# Enrichment & Identity — Spec > Status: Draft > Author: Johan > Date: 2026-03-15 ## Overview When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment. ## 1. Email-Based Identity Resolution ### Flow 1. User enters email → receives 6-digit TOTP code 2. While user enters code, backend fires enrichment pipeline (concurrent goroutines): - Parse email → extract name parts + company domain - Domain scrape → company metadata - Enrichment API → person + company data 3. User lands in app with profile pre-populated ### Email Parsing | Pattern | First | Last | |---------|-------|------| | `john.smith@gs.com` | John | Smith | | `jsmith@company.com` | J | Smith (flag as partial) | | `john@company.com` | John | — | | `john.smith.jr@company.com` | John | Smith Jr | Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only. ### Person Enrichment (API) Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns: - Full name, title, phone - LinkedIn URL - Headshot URL - Company association (confirms domain mapping) Fallback chain: Apollo → Clearbit → Hunter.io → parse-only. ## 2. Domain-Based Company Scraping ### What to scrape (single fetch of homepage) - Company name (from ``, OG tags, JSON-LD) - Logo URL (OG image, JSON-LD logo, favicon fallback) - Description / tagline - Address, phone, fax - Industry signals (from meta keywords, page content) - Social links (LinkedIn, Twitter) - Tech stack hints (optional, from script tags / headers) ### Deep scrape (if team page exists) Follow `/about`, `/team`, `/our-team`, `/people` links from nav: - Person name, title, photo URL, email, phone - Bio, education, prior experience (from individual profile pages) ### Implementation Go + `goquery` for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern). ### Caching Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale). ## 3. Headshot Strategy ### Source Priority 1. **Enrichment API** (Apollo/Clearbit) — highest quality, most reliable 2. **Gravatar** — `https://gravatar.com/avatar/{md5(email)}?d=404` (check for 404 = no image) 3. **Company website** — scraped from team page (see §2) 4. **LinkedIn** — if user linked their profile (manual, not scraped) 5. **Fallback** — generated initials avatar with company brand color (extracted from logo dominant color) ### Storage - Download and store all headshots locally at invite/enrichment time - Serve from DealSpace CDN, never hotlink external URLs - Standard size: 256×256px, JPEG, quality 85 - Thumbnail: 48×48px for inline use ### Display Locations (mini headshots throughout) - **Deal room member list** — face next to name and role - **Activity feed** — "{face} Craig Lawson viewed the CIM — 2m ago" - **Document access log** — who opened what, when, with face - **Comments / annotations** — face next to every note - **Invite flow** — show headshots of discovered colleagues - **Online presence** — 24px avatars in header: "3 people in this room" - **Watermark metadata** — avatar optionally embedded in PDF watermark (v2) ## 4. Colleague Discovery & Invite ### Flow After first user from a domain authenticates: 1. Query enrichment API for known people at same domain 2. Present: "We found N people at {Company}. Invite them to this deal room?" 3. Show as a list with headshots, name, title — checkboxes to select 4. Generate invite emails using detected email pattern from first user 5. One-click send ### Email Pattern Detection From the authenticated user's email, derive the pattern: - `john.smith@hpc.com` → `{first}.{last}@hpc.com` - `jsmith@hpc.com` → `{first_initial}{last}@hpc.com` - `john@hpc.com` → `{first}@hpc.com` Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this). ### Privacy Considerations - Only show colleague suggestions to users with `_admin` roles - Never expose enrichment data to users outside the deal room - Allow users to dismiss / hide colleague suggestions - Enrichment data is deal-room-scoped, not global ## 5. Company Card Auto-Population When a company domain enters the system (via user email or manual entry), auto-generate a company card: | Field | Source | |-------|--------| | Name | HTML title / OG / JSON-LD | | Logo | OG image / favicon | | Description | Meta description / JSON-LD | | Address | Page scrape / JSON-LD | | Phone | Page scrape | | Industry | Enrichment API / page signals | | Website | The domain itself | | Team size | Enrichment API | | Key people | Team page scrape | | LinkedIn | Social links from page | For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up. ## 6. Spreadsheet Anonymization ### Use Case Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names. ### Approach - Go + `excelize` library (MIT, actively maintained) - Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere - Mapping stored per deal room, reversible by deal room owner (`ib_admin`, `seller_admin`) - Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent - User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty") ### Limitations - Formula-embedded string literals (e.g., `=IF(A1="Acme Corp",...)`) require formula string parsing — defer to v2 - Conditional formatting rules referencing text values — defer to v2 ## 7. API / Enrichment Vendor Evaluation | Vendor | Person | Company | Email verify | Price | |--------|--------|---------|-------------|-------| | Apollo.io | ✓ | ✓ | ✓ | Free tier: 50/mo, paid from $49/mo | | Clearbit (HubSpot) | ✓ | ✓ | — | Enterprise pricing | | Hunter.io | partial | — | ✓ | Free tier: 25/mo, paid from $49/mo | | PDL (People Data Labs) | ✓ | ✓ | — | Pay-per-record | | Gravatar | photo only | — | — | Free | Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale. ## 8. Implementation Phases **Phase 1 (MVP):** Email parsing + Gravatar headshot + initials fallback. Zero external API dependency. **Phase 2:** Domain homepage scrape → company card auto-population. Still zero paid API. **Phase 3:** Apollo/Clearbit integration → full person enrichment + colleague discovery. **Phase 4:** Team page deep scrape + spreadsheet anonymization. ## 9. Profile Isolation — Deal-Scoped Identity ### Core Principle **Profiles are scoped to the deal room, not global.** The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design. ### Why - Deal A: someone's cell phone was shared in a conversation → their profile has it - Deal B: same person, but only their work email is known → no cell - Deal C: same person joined via a different email entirely → different enrichment data We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity. ### Data Model ``` DealProfile { id TEXT PRIMARY KEY deal_id TEXT NOT NULL -- FK to deal room email TEXT NOT NULL -- the email used in THIS deal first_name TEXT last_name TEXT title TEXT phone TEXT -- may be NULL in some deals headshot_path TEXT -- locally stored, per-deal copy company_id TEXT -- FK to DealCompany (also deal-scoped) linkedin_url TEXT source TEXT -- 'enrichment', 'manual', 'scrape', 'invite' created_at DATETIME enriched_at DATETIME UNIQUE(deal_id, email) } ``` ### Rules 1. **No cross-deal profile lookups.** A query in Deal A never touches Deal B's profiles. 2. **No global person table.** There is no `Person` entity — only `DealProfile`. 3. **Headshots are copied per deal.** Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected. 4. **Enrichment runs per deal.** If the same email appears in two deals, enrichment runs independently for each. Redundant but safe. 5. **Deletion is deal-scoped.** Deleting a deal room deletes all its profiles. No orphans, no leaks. 6. **Admin export includes only that deal's PII.** Audit logs, data exports, and compliance reports are deal-scoped. ### Rationale In M&A, confidentiality isn't just about documents — it's about *who knows what about whom*. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.