chore: auto-commit uncommitted changes

This commit is contained in:
James 2026-03-16 00:01:27 -04:00
parent 379a79bc9d
commit bfade7a86f
1 changed files with 231 additions and 0 deletions

231
docs/ENRICHMENT-SPEC.md Normal file
View File

@ -0,0 +1,231 @@
# Enrichment & Identity — Spec
> Status: Draft
> Author: Johan
> Date: 2026-03-15
## Overview
When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment.
## 1. Email-Based Identity Resolution
### Flow
1. User enters email → receives 6-digit TOTP code
2. While user enters code, backend fires enrichment pipeline (concurrent goroutines):
- Parse email → extract name parts + company domain
- Domain scrape → company metadata
- Enrichment API → person + company data
3. User lands in app with profile pre-populated
### Email Parsing
| Pattern | First | Last |
|---------|-------|------|
| `john.smith@gs.com` | John | Smith |
| `jsmith@company.com` | J | Smith (flag as partial) |
| `john@company.com` | John | — |
| `john.smith.jr@company.com` | John | Smith Jr |
Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only.
### Person Enrichment (API)
Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns:
- Full name, title, phone
- LinkedIn URL
- Headshot URL
- Company association (confirms domain mapping)
Fallback chain: Apollo → Clearbit → Hunter.io → parse-only.
## 2. Domain-Based Company Scraping
### What to scrape (single fetch of homepage)
- Company name (from `<title>`, OG tags, JSON-LD)
- Logo URL (OG image, JSON-LD logo, favicon fallback)
- Description / tagline
- Address, phone, fax
- Industry signals (from meta keywords, page content)
- Social links (LinkedIn, Twitter)
- Tech stack hints (optional, from script tags / headers)
### Deep scrape (if team page exists)
Follow `/about`, `/team`, `/our-team`, `/people` links from nav:
- Person name, title, photo URL, email, phone
- Bio, education, prior experience (from individual profile pages)
### Implementation
Go + `goquery` for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern).
### Caching
Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale).
## 3. Headshot Strategy
### Source Priority
1. **Enrichment API** (Apollo/Clearbit) — highest quality, most reliable
2. **Gravatar**`https://gravatar.com/avatar/{md5(email)}?d=404` (check for 404 = no image)
3. **Company website** — scraped from team page (see §2)
4. **LinkedIn** — if user linked their profile (manual, not scraped)
5. **Fallback** — generated initials avatar with company brand color (extracted from logo dominant color)
### Storage
- Download and store all headshots locally at invite/enrichment time
- Serve from DealSpace CDN, never hotlink external URLs
- Standard size: 256×256px, JPEG, quality 85
- Thumbnail: 48×48px for inline use
### Display Locations (mini headshots throughout)
- **Deal room member list** — face next to name and role
- **Activity feed** — "{face} Craig Lawson viewed the CIM — 2m ago"
- **Document access log** — who opened what, when, with face
- **Comments / annotations** — face next to every note
- **Invite flow** — show headshots of discovered colleagues
- **Online presence** — 24px avatars in header: "3 people in this room"
- **Watermark metadata** — avatar optionally embedded in PDF watermark (v2)
## 4. Colleague Discovery & Invite
### Flow
After first user from a domain authenticates:
1. Query enrichment API for known people at same domain
2. Present: "We found N people at {Company}. Invite them to this deal room?"
3. Show as a list with headshots, name, title — checkboxes to select
4. Generate invite emails using detected email pattern from first user
5. One-click send
### Email Pattern Detection
From the authenticated user's email, derive the pattern:
- `john.smith@hpc.com``{first}.{last}@hpc.com`
- `jsmith@hpc.com``{first_initial}{last}@hpc.com`
- `john@hpc.com``{first}@hpc.com`
Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this).
### Privacy Considerations
- Only show colleague suggestions to users with `_admin` roles
- Never expose enrichment data to users outside the deal room
- Allow users to dismiss / hide colleague suggestions
- Enrichment data is deal-room-scoped, not global
## 5. Company Card Auto-Population
When a company domain enters the system (via user email or manual entry), auto-generate a company card:
| Field | Source |
|-------|--------|
| Name | HTML title / OG / JSON-LD |
| Logo | OG image / favicon |
| Description | Meta description / JSON-LD |
| Address | Page scrape / JSON-LD |
| Phone | Page scrape |
| Industry | Enrichment API / page signals |
| Website | The domain itself |
| Team size | Enrichment API |
| Key people | Team page scrape |
| LinkedIn | Social links from page |
For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up.
## 6. Spreadsheet Anonymization
### Use Case
Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names.
### Approach
- Go + `excelize` library (MIT, actively maintained)
- Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere
- Mapping stored per deal room, reversible by deal room owner (`ib_admin`, `seller_admin`)
- Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent
- User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty")
### Limitations
- Formula-embedded string literals (e.g., `=IF(A1="Acme Corp",...)`) require formula string parsing — defer to v2
- Conditional formatting rules referencing text values — defer to v2
## 7. API / Enrichment Vendor Evaluation
| Vendor | Person | Company | Email verify | Price |
|--------|--------|---------|-------------|-------|
| Apollo.io | ✓ | ✓ | ✓ | Free tier: 50/mo, paid from $49/mo |
| Clearbit (HubSpot) | ✓ | ✓ | — | Enterprise pricing |
| Hunter.io | partial | — | ✓ | Free tier: 25/mo, paid from $49/mo |
| PDL (People Data Labs) | ✓ | ✓ | — | Pay-per-record |
| Gravatar | photo only | — | — | Free |
Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale.
## 8. Implementation Phases
**Phase 1 (MVP):** Email parsing + Gravatar headshot + initials fallback. Zero external API dependency.
**Phase 2:** Domain homepage scrape → company card auto-population. Still zero paid API.
**Phase 3:** Apollo/Clearbit integration → full person enrichment + colleague discovery.
**Phase 4:** Team page deep scrape + spreadsheet anonymization.
## 9. Profile Isolation — Deal-Scoped Identity
### Core Principle
**Profiles are scoped to the deal room, not global.** The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design.
### Why
- Deal A: someone's cell phone was shared in a conversation → their profile has it
- Deal B: same person, but only their work email is known → no cell
- Deal C: same person joined via a different email entirely → different enrichment data
We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity.
### Data Model
```
DealProfile {
id TEXT PRIMARY KEY
deal_id TEXT NOT NULL -- FK to deal room
email TEXT NOT NULL -- the email used in THIS deal
first_name TEXT
last_name TEXT
title TEXT
phone TEXT -- may be NULL in some deals
headshot_path TEXT -- locally stored, per-deal copy
company_id TEXT -- FK to DealCompany (also deal-scoped)
linkedin_url TEXT
source TEXT -- 'enrichment', 'manual', 'scrape', 'invite'
created_at DATETIME
enriched_at DATETIME
UNIQUE(deal_id, email)
}
```
### Rules
1. **No cross-deal profile lookups.** A query in Deal A never touches Deal B's profiles.
2. **No global person table.** There is no `Person` entity — only `DealProfile`.
3. **Headshots are copied per deal.** Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected.
4. **Enrichment runs per deal.** If the same email appears in two deals, enrichment runs independently for each. Redundant but safe.
5. **Deletion is deal-scoped.** Deleting a deal room deletes all its profiles. No orphans, no leaks.
6. **Admin export includes only that deal's PII.** Audit logs, data exports, and compliance reports are deal-scoped.
### Rationale
In M&A, confidentiality isn't just about documents — it's about *who knows what about whom*. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.