dealspace/docs/ENRICHMENT-SPEC.md

9.2 KiB
Raw Blame History

Enrichment & Identity — Spec

Status: Draft Author: Johan Date: 2026-03-15

Overview

When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment.

1. Email-Based Identity Resolution

Flow

  1. User enters email → receives 6-digit TOTP code
  2. While user enters code, backend fires enrichment pipeline (concurrent goroutines):
    • Parse email → extract name parts + company domain
    • Domain scrape → company metadata
    • Enrichment API → person + company data
  3. User lands in app with profile pre-populated

Email Parsing

Pattern First Last
john.smith@gs.com John Smith
jsmith@company.com J Smith (flag as partial)
john@company.com John
john.smith.jr@company.com John Smith Jr

Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only.

Person Enrichment (API)

Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns:

  • Full name, title, phone
  • LinkedIn URL
  • Headshot URL
  • Company association (confirms domain mapping)

Fallback chain: Apollo → Clearbit → Hunter.io → parse-only.

2. Domain-Based Company Scraping

What to scrape (single fetch of homepage)

  • Company name (from <title>, OG tags, JSON-LD)
  • Logo URL (OG image, JSON-LD logo, favicon fallback)
  • Description / tagline
  • Address, phone, fax
  • Industry signals (from meta keywords, page content)
  • Social links (LinkedIn, Twitter)
  • Tech stack hints (optional, from script tags / headers)

Deep scrape (if team page exists)

Follow /about, /team, /our-team, /people links from nav:

  • Person name, title, photo URL, email, phone
  • Bio, education, prior experience (from individual profile pages)

Implementation

Go + goquery for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern).

Caching

Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale).

3. Headshot Strategy

Source Priority

  1. Enrichment API (Apollo/Clearbit) — highest quality, most reliable
  2. Gravatarhttps://gravatar.com/avatar/{md5(email)}?d=404 (check for 404 = no image)
  3. Company website — scraped from team page (see §2)
  4. LinkedIn — if user linked their profile (manual, not scraped)
  5. Fallback — generated initials avatar with company brand color (extracted from logo dominant color)

Storage

  • Download and store all headshots locally at invite/enrichment time
  • Serve from DealSpace CDN, never hotlink external URLs
  • Standard size: 256×256px, JPEG, quality 85
  • Thumbnail: 48×48px for inline use

Display Locations (mini headshots throughout)

  • Deal room member list — face next to name and role
  • Activity feed — "{face} Craig Lawson viewed the CIM — 2m ago"
  • Document access log — who opened what, when, with face
  • Comments / annotations — face next to every note
  • Invite flow — show headshots of discovered colleagues
  • Online presence — 24px avatars in header: "3 people in this room"
  • Watermark metadata — avatar optionally embedded in PDF watermark (v2)

4. Colleague Discovery & Invite

Flow

After first user from a domain authenticates:

  1. Query enrichment API for known people at same domain
  2. Present: "We found N people at {Company}. Invite them to this deal room?"
  3. Show as a list with headshots, name, title — checkboxes to select
  4. Generate invite emails using detected email pattern from first user
  5. One-click send

Email Pattern Detection

From the authenticated user's email, derive the pattern:

  • john.smith@hpc.com{first}.{last}@hpc.com
  • jsmith@hpc.com{first_initial}{last}@hpc.com
  • john@hpc.com{first}@hpc.com

Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this).

Privacy Considerations

  • Only show colleague suggestions to users with _admin roles
  • Never expose enrichment data to users outside the deal room
  • Allow users to dismiss / hide colleague suggestions
  • Enrichment data is deal-room-scoped, not global

5. Company Card Auto-Population

When a company domain enters the system (via user email or manual entry), auto-generate a company card:

Field Source
Name HTML title / OG / JSON-LD
Logo OG image / favicon
Description Meta description / JSON-LD
Address Page scrape / JSON-LD
Phone Page scrape
Industry Enrichment API / page signals
Website The domain itself
Team size Enrichment API
Key people Team page scrape
LinkedIn Social links from page

For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up.

6. Spreadsheet Anonymization

Use Case

Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names.

Approach

  • Go + excelize library (MIT, actively maintained)
  • Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere
  • Mapping stored per deal room, reversible by deal room owner (ib_admin, seller_admin)
  • Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent
  • User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty")

Limitations

  • Formula-embedded string literals (e.g., =IF(A1="Acme Corp",...)) require formula string parsing — defer to v2
  • Conditional formatting rules referencing text values — defer to v2

7. API / Enrichment Vendor Evaluation

Vendor Person Company Email verify Price
Apollo.io Free tier: 50/mo, paid from $49/mo
Clearbit (HubSpot) Enterprise pricing
Hunter.io partial Free tier: 25/mo, paid from $49/mo
PDL (People Data Labs) Pay-per-record
Gravatar photo only Free

Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale.

8. Implementation Phases

Phase 1 (MVP): Email parsing + Gravatar headshot + initials fallback. Zero external API dependency.

Phase 2: Domain homepage scrape → company card auto-population. Still zero paid API.

Phase 3: Apollo/Clearbit integration → full person enrichment + colleague discovery.

Phase 4: Team page deep scrape + spreadsheet anonymization.

9. Profile Isolation — Deal-Scoped Identity

Core Principle

Profiles are scoped to the deal room, not global. The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design.

Why

  • Deal A: someone's cell phone was shared in a conversation → their profile has it
  • Deal B: same person, but only their work email is known → no cell
  • Deal C: same person joined via a different email entirely → different enrichment data

We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity.

Data Model

DealProfile {
    id              TEXT PRIMARY KEY
    deal_id         TEXT NOT NULL  -- FK to deal room
    email           TEXT NOT NULL  -- the email used in THIS deal
    first_name      TEXT
    last_name       TEXT
    title           TEXT
    phone           TEXT           -- may be NULL in some deals
    headshot_path   TEXT           -- locally stored, per-deal copy
    company_id      TEXT           -- FK to DealCompany (also deal-scoped)
    linkedin_url    TEXT
    source          TEXT           -- 'enrichment', 'manual', 'scrape', 'invite'
    created_at      DATETIME
    enriched_at     DATETIME
    
    UNIQUE(deal_id, email)
}

Rules

  1. No cross-deal profile lookups. A query in Deal A never touches Deal B's profiles.
  2. No global person table. There is no Person entity — only DealProfile.
  3. Headshots are copied per deal. Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected.
  4. Enrichment runs per deal. If the same email appears in two deals, enrichment runs independently for each. Redundant but safe.
  5. Deletion is deal-scoped. Deleting a deal room deletes all its profiles. No orphans, no leaks.
  6. Admin export includes only that deal's PII. Audit logs, data exports, and compliance reports are deal-scoped.

Rationale

In M&A, confidentiality isn't just about documents — it's about who knows what about whom. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.