diff --git a/docs/ENRICHMENT-SPEC.md b/docs/ENRICHMENT-SPEC.md new file mode 100644 index 0000000..f688e63 --- /dev/null +++ b/docs/ENRICHMENT-SPEC.md @@ -0,0 +1,231 @@ +# Enrichment & Identity — Spec + +> Status: Draft +> Author: Johan +> Date: 2026-03-15 + +## Overview + +When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment. + +## 1. Email-Based Identity Resolution + +### Flow + +1. User enters email → receives 6-digit TOTP code +2. While user enters code, backend fires enrichment pipeline (concurrent goroutines): + - Parse email → extract name parts + company domain + - Domain scrape → company metadata + - Enrichment API → person + company data +3. User lands in app with profile pre-populated + +### Email Parsing + +| Pattern | First | Last | +|---------|-------|------| +| `john.smith@gs.com` | John | Smith | +| `jsmith@company.com` | J | Smith (flag as partial) | +| `john@company.com` | John | — | +| `john.smith.jr@company.com` | John | Smith Jr | + +Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only. + +### Person Enrichment (API) + +Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns: +- Full name, title, phone +- LinkedIn URL +- Headshot URL +- Company association (confirms domain mapping) + +Fallback chain: Apollo → Clearbit → Hunter.io → parse-only. + +## 2. Domain-Based Company Scraping + +### What to scrape (single fetch of homepage) + +- Company name (from ``, OG tags, JSON-LD) +- Logo URL (OG image, JSON-LD logo, favicon fallback) +- Description / tagline +- Address, phone, fax +- Industry signals (from meta keywords, page content) +- Social links (LinkedIn, Twitter) +- Tech stack hints (optional, from script tags / headers) + +### Deep scrape (if team page exists) + +Follow `/about`, `/team`, `/our-team`, `/people` links from nav: +- Person name, title, photo URL, email, phone +- Bio, education, prior experience (from individual profile pages) + +### Implementation + +Go + `goquery` for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern). + +### Caching + +Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale). + +## 3. Headshot Strategy + +### Source Priority + +1. **Enrichment API** (Apollo/Clearbit) — highest quality, most reliable +2. **Gravatar** — `https://gravatar.com/avatar/{md5(email)}?d=404` (check for 404 = no image) +3. **Company website** — scraped from team page (see §2) +4. **LinkedIn** — if user linked their profile (manual, not scraped) +5. **Fallback** — generated initials avatar with company brand color (extracted from logo dominant color) + +### Storage + +- Download and store all headshots locally at invite/enrichment time +- Serve from DealSpace CDN, never hotlink external URLs +- Standard size: 256×256px, JPEG, quality 85 +- Thumbnail: 48×48px for inline use + +### Display Locations (mini headshots throughout) + +- **Deal room member list** — face next to name and role +- **Activity feed** — "{face} Craig Lawson viewed the CIM — 2m ago" +- **Document access log** — who opened what, when, with face +- **Comments / annotations** — face next to every note +- **Invite flow** — show headshots of discovered colleagues +- **Online presence** — 24px avatars in header: "3 people in this room" +- **Watermark metadata** — avatar optionally embedded in PDF watermark (v2) + +## 4. Colleague Discovery & Invite + +### Flow + +After first user from a domain authenticates: + +1. Query enrichment API for known people at same domain +2. Present: "We found N people at {Company}. Invite them to this deal room?" +3. Show as a list with headshots, name, title — checkboxes to select +4. Generate invite emails using detected email pattern from first user +5. One-click send + +### Email Pattern Detection + +From the authenticated user's email, derive the pattern: +- `john.smith@hpc.com` → `{first}.{last}@hpc.com` +- `jsmith@hpc.com` → `{first_initial}{last}@hpc.com` +- `john@hpc.com` → `{first}@hpc.com` + +Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this). + +### Privacy Considerations + +- Only show colleague suggestions to users with `_admin` roles +- Never expose enrichment data to users outside the deal room +- Allow users to dismiss / hide colleague suggestions +- Enrichment data is deal-room-scoped, not global + +## 5. Company Card Auto-Population + +When a company domain enters the system (via user email or manual entry), auto-generate a company card: + +| Field | Source | +|-------|--------| +| Name | HTML title / OG / JSON-LD | +| Logo | OG image / favicon | +| Description | Meta description / JSON-LD | +| Address | Page scrape / JSON-LD | +| Phone | Page scrape | +| Industry | Enrichment API / page signals | +| Website | The domain itself | +| Team size | Enrichment API | +| Key people | Team page scrape | +| LinkedIn | Social links from page | + +For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up. + +## 6. Spreadsheet Anonymization + +### Use Case + +Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names. + +### Approach + +- Go + `excelize` library (MIT, actively maintained) +- Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere +- Mapping stored per deal room, reversible by deal room owner (`ib_admin`, `seller_admin`) +- Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent +- User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty") + +### Limitations + +- Formula-embedded string literals (e.g., `=IF(A1="Acme Corp",...)`) require formula string parsing — defer to v2 +- Conditional formatting rules referencing text values — defer to v2 + +## 7. API / Enrichment Vendor Evaluation + +| Vendor | Person | Company | Email verify | Price | +|--------|--------|---------|-------------|-------| +| Apollo.io | ✓ | ✓ | ✓ | Free tier: 50/mo, paid from $49/mo | +| Clearbit (HubSpot) | ✓ | ✓ | — | Enterprise pricing | +| Hunter.io | partial | — | ✓ | Free tier: 25/mo, paid from $49/mo | +| PDL (People Data Labs) | ✓ | ✓ | — | Pay-per-record | +| Gravatar | photo only | — | — | Free | + +Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale. + +## 8. Implementation Phases + +**Phase 1 (MVP):** Email parsing + Gravatar headshot + initials fallback. Zero external API dependency. + +**Phase 2:** Domain homepage scrape → company card auto-population. Still zero paid API. + +**Phase 3:** Apollo/Clearbit integration → full person enrichment + colleague discovery. + +**Phase 4:** Team page deep scrape + spreadsheet anonymization. + +## 9. Profile Isolation — Deal-Scoped Identity + +### Core Principle + +**Profiles are scoped to the deal room, not global.** The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design. + +### Why + +- Deal A: someone's cell phone was shared in a conversation → their profile has it +- Deal B: same person, but only their work email is known → no cell +- Deal C: same person joined via a different email entirely → different enrichment data + +We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity. + +### Data Model + +``` +DealProfile { + id TEXT PRIMARY KEY + deal_id TEXT NOT NULL -- FK to deal room + email TEXT NOT NULL -- the email used in THIS deal + first_name TEXT + last_name TEXT + title TEXT + phone TEXT -- may be NULL in some deals + headshot_path TEXT -- locally stored, per-deal copy + company_id TEXT -- FK to DealCompany (also deal-scoped) + linkedin_url TEXT + source TEXT -- 'enrichment', 'manual', 'scrape', 'invite' + created_at DATETIME + enriched_at DATETIME + + UNIQUE(deal_id, email) +} +``` + +### Rules + +1. **No cross-deal profile lookups.** A query in Deal A never touches Deal B's profiles. +2. **No global person table.** There is no `Person` entity — only `DealProfile`. +3. **Headshots are copied per deal.** Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected. +4. **Enrichment runs per deal.** If the same email appears in two deals, enrichment runs independently for each. Redundant but safe. +5. **Deletion is deal-scoped.** Deleting a deal room deletes all its profiles. No orphans, no leaks. +6. **Admin export includes only that deal's PII.** Audit logs, data exports, and compliance reports are deal-scoped. + +### Rationale + +In M&A, confidentiality isn't just about documents — it's about *who knows what about whom*. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.