dealspace/docs/ENRICHMENT-SPEC.md

232 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Enrichment & Identity — Spec
> Status: Draft
> Author: Johan
> Date: 2026-03-15
## Overview
When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment.
## 1. Email-Based Identity Resolution
### Flow
1. User enters email → receives 6-digit TOTP code
2. While user enters code, backend fires enrichment pipeline (concurrent goroutines):
- Parse email → extract name parts + company domain
- Domain scrape → company metadata
- Enrichment API → person + company data
3. User lands in app with profile pre-populated
### Email Parsing
| Pattern | First | Last |
|---------|-------|------|
| `john.smith@gs.com` | John | Smith |
| `jsmith@company.com` | J | Smith (flag as partial) |
| `john@company.com` | John | — |
| `john.smith.jr@company.com` | John | Smith Jr |
Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only.
### Person Enrichment (API)
Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns:
- Full name, title, phone
- LinkedIn URL
- Headshot URL
- Company association (confirms domain mapping)
Fallback chain: Apollo → Clearbit → Hunter.io → parse-only.
## 2. Domain-Based Company Scraping
### What to scrape (single fetch of homepage)
- Company name (from `<title>`, OG tags, JSON-LD)
- Logo URL (OG image, JSON-LD logo, favicon fallback)
- Description / tagline
- Address, phone, fax
- Industry signals (from meta keywords, page content)
- Social links (LinkedIn, Twitter)
- Tech stack hints (optional, from script tags / headers)
### Deep scrape (if team page exists)
Follow `/about`, `/team`, `/our-team`, `/people` links from nav:
- Person name, title, photo URL, email, phone
- Bio, education, prior experience (from individual profile pages)
### Implementation
Go + `goquery` for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern).
### Caching
Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale).
## 3. Headshot Strategy
### Source Priority
1. **Enrichment API** (Apollo/Clearbit) — highest quality, most reliable
2. **Gravatar**`https://gravatar.com/avatar/{md5(email)}?d=404` (check for 404 = no image)
3. **Company website** — scraped from team page (see §2)
4. **LinkedIn** — if user linked their profile (manual, not scraped)
5. **Fallback** — generated initials avatar with company brand color (extracted from logo dominant color)
### Storage
- Download and store all headshots locally at invite/enrichment time
- Serve from DealSpace CDN, never hotlink external URLs
- Standard size: 256×256px, JPEG, quality 85
- Thumbnail: 48×48px for inline use
### Display Locations (mini headshots throughout)
- **Deal room member list** — face next to name and role
- **Activity feed** — "{face} Craig Lawson viewed the CIM — 2m ago"
- **Document access log** — who opened what, when, with face
- **Comments / annotations** — face next to every note
- **Invite flow** — show headshots of discovered colleagues
- **Online presence** — 24px avatars in header: "3 people in this room"
- **Watermark metadata** — avatar optionally embedded in PDF watermark (v2)
## 4. Colleague Discovery & Invite
### Flow
After first user from a domain authenticates:
1. Query enrichment API for known people at same domain
2. Present: "We found N people at {Company}. Invite them to this deal room?"
3. Show as a list with headshots, name, title — checkboxes to select
4. Generate invite emails using detected email pattern from first user
5. One-click send
### Email Pattern Detection
From the authenticated user's email, derive the pattern:
- `john.smith@hpc.com``{first}.{last}@hpc.com`
- `jsmith@hpc.com``{first_initial}{last}@hpc.com`
- `john@hpc.com``{first}@hpc.com`
Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this).
### Privacy Considerations
- Only show colleague suggestions to users with `_admin` roles
- Never expose enrichment data to users outside the deal room
- Allow users to dismiss / hide colleague suggestions
- Enrichment data is deal-room-scoped, not global
## 5. Company Card Auto-Population
When a company domain enters the system (via user email or manual entry), auto-generate a company card:
| Field | Source |
|-------|--------|
| Name | HTML title / OG / JSON-LD |
| Logo | OG image / favicon |
| Description | Meta description / JSON-LD |
| Address | Page scrape / JSON-LD |
| Phone | Page scrape |
| Industry | Enrichment API / page signals |
| Website | The domain itself |
| Team size | Enrichment API |
| Key people | Team page scrape |
| LinkedIn | Social links from page |
For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up.
## 6. Spreadsheet Anonymization
### Use Case
Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names.
### Approach
- Go + `excelize` library (MIT, actively maintained)
- Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere
- Mapping stored per deal room, reversible by deal room owner (`ib_admin`, `seller_admin`)
- Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent
- User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty")
### Limitations
- Formula-embedded string literals (e.g., `=IF(A1="Acme Corp",...)`) require formula string parsing — defer to v2
- Conditional formatting rules referencing text values — defer to v2
## 7. API / Enrichment Vendor Evaluation
| Vendor | Person | Company | Email verify | Price |
|--------|--------|---------|-------------|-------|
| Apollo.io | ✓ | ✓ | ✓ | Free tier: 50/mo, paid from $49/mo |
| Clearbit (HubSpot) | ✓ | ✓ | — | Enterprise pricing |
| Hunter.io | partial | — | ✓ | Free tier: 25/mo, paid from $49/mo |
| PDL (People Data Labs) | ✓ | ✓ | — | Pay-per-record |
| Gravatar | photo only | — | — | Free |
Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale.
## 8. Implementation Phases
**Phase 1 (MVP):** Email parsing + Gravatar headshot + initials fallback. Zero external API dependency.
**Phase 2:** Domain homepage scrape → company card auto-population. Still zero paid API.
**Phase 3:** Apollo/Clearbit integration → full person enrichment + colleague discovery.
**Phase 4:** Team page deep scrape + spreadsheet anonymization.
## 9. Profile Isolation — Deal-Scoped Identity
### Core Principle
**Profiles are scoped to the deal room, not global.** The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design.
### Why
- Deal A: someone's cell phone was shared in a conversation → their profile has it
- Deal B: same person, but only their work email is known → no cell
- Deal C: same person joined via a different email entirely → different enrichment data
We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity.
### Data Model
```
DealProfile {
id TEXT PRIMARY KEY
deal_id TEXT NOT NULL -- FK to deal room
email TEXT NOT NULL -- the email used in THIS deal
first_name TEXT
last_name TEXT
title TEXT
phone TEXT -- may be NULL in some deals
headshot_path TEXT -- locally stored, per-deal copy
company_id TEXT -- FK to DealCompany (also deal-scoped)
linkedin_url TEXT
source TEXT -- 'enrichment', 'manual', 'scrape', 'invite'
created_at DATETIME
enriched_at DATETIME
UNIQUE(deal_id, email)
}
```
### Rules
1. **No cross-deal profile lookups.** A query in Deal A never touches Deal B's profiles.
2. **No global person table.** There is no `Person` entity — only `DealProfile`.
3. **Headshots are copied per deal.** Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected.
4. **Enrichment runs per deal.** If the same email appears in two deals, enrichment runs independently for each. Redundant but safe.
5. **Deletion is deal-scoped.** Deleting a deal room deletes all its profiles. No orphans, no leaks.
6. **Admin export includes only that deal's PII.** Audit logs, data exports, and compliance reports are deal-scoped.
### Rationale
In M&A, confidentiality isn't just about documents — it's about *who knows what about whom*. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.