chore: auto-commit uncommitted changes
This commit is contained in:
parent
379a79bc9d
commit
bfade7a86f
|
|
@ -0,0 +1,231 @@
|
|||
# Enrichment & Identity — Spec
|
||||
|
||||
> Status: Draft
|
||||
> Author: Johan
|
||||
> Date: 2026-03-15
|
||||
|
||||
## Overview
|
||||
|
||||
When a user logs in with just an email address, DealSpace should derive as much identity and company context as possible — eliminating manual data entry and creating an immediate sense of a populated, professional environment.
|
||||
|
||||
## 1. Email-Based Identity Resolution
|
||||
|
||||
### Flow
|
||||
|
||||
1. User enters email → receives 6-digit TOTP code
|
||||
2. While user enters code, backend fires enrichment pipeline (concurrent goroutines):
|
||||
- Parse email → extract name parts + company domain
|
||||
- Domain scrape → company metadata
|
||||
- Enrichment API → person + company data
|
||||
3. User lands in app with profile pre-populated
|
||||
|
||||
### Email Parsing
|
||||
|
||||
| Pattern | First | Last |
|
||||
|---------|-------|------|
|
||||
| `john.smith@gs.com` | John | Smith |
|
||||
| `jsmith@company.com` | J | Smith (flag as partial) |
|
||||
| `john@company.com` | John | — |
|
||||
| `john.smith.jr@company.com` | John | Smith Jr |
|
||||
|
||||
Personal domains (gmail, outlook, yahoo, proton, icloud, hey) → skip company enrichment, person-only.
|
||||
|
||||
### Person Enrichment (API)
|
||||
|
||||
Primary: Apollo.io or Clearbit Enrichment API. Lookup by email returns:
|
||||
- Full name, title, phone
|
||||
- LinkedIn URL
|
||||
- Headshot URL
|
||||
- Company association (confirms domain mapping)
|
||||
|
||||
Fallback chain: Apollo → Clearbit → Hunter.io → parse-only.
|
||||
|
||||
## 2. Domain-Based Company Scraping
|
||||
|
||||
### What to scrape (single fetch of homepage)
|
||||
|
||||
- Company name (from `<title>`, OG tags, JSON-LD)
|
||||
- Logo URL (OG image, JSON-LD logo, favicon fallback)
|
||||
- Description / tagline
|
||||
- Address, phone, fax
|
||||
- Industry signals (from meta keywords, page content)
|
||||
- Social links (LinkedIn, Twitter)
|
||||
- Tech stack hints (optional, from script tags / headers)
|
||||
|
||||
### Deep scrape (if team page exists)
|
||||
|
||||
Follow `/about`, `/team`, `/our-team`, `/people` links from nav:
|
||||
- Person name, title, photo URL, email, phone
|
||||
- Bio, education, prior experience (from individual profile pages)
|
||||
|
||||
### Implementation
|
||||
|
||||
Go + `goquery` for HTML parsing. No headless browser needed — these are server-rendered WordPress/marketing sites. Timeout: 5s per fetch, max 3 pages per domain (homepage + team listing + one profile to detect email pattern).
|
||||
|
||||
### Caching
|
||||
|
||||
Cache company data per domain in SQLite. TTL: 30 days. Headshot images: download and store locally (don't hotlink — external URLs go stale).
|
||||
|
||||
## 3. Headshot Strategy
|
||||
|
||||
### Source Priority
|
||||
|
||||
1. **Enrichment API** (Apollo/Clearbit) — highest quality, most reliable
|
||||
2. **Gravatar** — `https://gravatar.com/avatar/{md5(email)}?d=404` (check for 404 = no image)
|
||||
3. **Company website** — scraped from team page (see §2)
|
||||
4. **LinkedIn** — if user linked their profile (manual, not scraped)
|
||||
5. **Fallback** — generated initials avatar with company brand color (extracted from logo dominant color)
|
||||
|
||||
### Storage
|
||||
|
||||
- Download and store all headshots locally at invite/enrichment time
|
||||
- Serve from DealSpace CDN, never hotlink external URLs
|
||||
- Standard size: 256×256px, JPEG, quality 85
|
||||
- Thumbnail: 48×48px for inline use
|
||||
|
||||
### Display Locations (mini headshots throughout)
|
||||
|
||||
- **Deal room member list** — face next to name and role
|
||||
- **Activity feed** — "{face} Craig Lawson viewed the CIM — 2m ago"
|
||||
- **Document access log** — who opened what, when, with face
|
||||
- **Comments / annotations** — face next to every note
|
||||
- **Invite flow** — show headshots of discovered colleagues
|
||||
- **Online presence** — 24px avatars in header: "3 people in this room"
|
||||
- **Watermark metadata** — avatar optionally embedded in PDF watermark (v2)
|
||||
|
||||
## 4. Colleague Discovery & Invite
|
||||
|
||||
### Flow
|
||||
|
||||
After first user from a domain authenticates:
|
||||
|
||||
1. Query enrichment API for known people at same domain
|
||||
2. Present: "We found N people at {Company}. Invite them to this deal room?"
|
||||
3. Show as a list with headshots, name, title — checkboxes to select
|
||||
4. Generate invite emails using detected email pattern from first user
|
||||
5. One-click send
|
||||
|
||||
### Email Pattern Detection
|
||||
|
||||
From the authenticated user's email, derive the pattern:
|
||||
- `john.smith@hpc.com` → `{first}.{last}@hpc.com`
|
||||
- `jsmith@hpc.com` → `{first_initial}{last}@hpc.com`
|
||||
- `john@hpc.com` → `{first}@hpc.com`
|
||||
|
||||
Apply pattern to generate emails for discovered colleagues. Optional: SMTP RCPT TO verification before sending (many servers support this).
|
||||
|
||||
### Privacy Considerations
|
||||
|
||||
- Only show colleague suggestions to users with `_admin` roles
|
||||
- Never expose enrichment data to users outside the deal room
|
||||
- Allow users to dismiss / hide colleague suggestions
|
||||
- Enrichment data is deal-room-scoped, not global
|
||||
|
||||
## 5. Company Card Auto-Population
|
||||
|
||||
When a company domain enters the system (via user email or manual entry), auto-generate a company card:
|
||||
|
||||
| Field | Source |
|
||||
|-------|--------|
|
||||
| Name | HTML title / OG / JSON-LD |
|
||||
| Logo | OG image / favicon |
|
||||
| Description | Meta description / JSON-LD |
|
||||
| Address | Page scrape / JSON-LD |
|
||||
| Phone | Page scrape |
|
||||
| Industry | Enrichment API / page signals |
|
||||
| Website | The domain itself |
|
||||
| Team size | Enrichment API |
|
||||
| Key people | Team page scrape |
|
||||
| LinkedIn | Social links from page |
|
||||
|
||||
For bulge bracket firms (Goldman, JPM, Morgan Stanley, etc.) — maintain a static seed database of ~50 major firms with pre-filled cards. Don't scrape; just look up.
|
||||
|
||||
## 6. Spreadsheet Anonymization
|
||||
|
||||
### Use Case
|
||||
|
||||
Seller uploads financial model or data room spreadsheet. Buyer-side viewers should see anonymized company names / counterparty names.
|
||||
|
||||
### Approach
|
||||
|
||||
- Go + `excelize` library (MIT, actively maintained)
|
||||
- Consistent deterministic mapping across all tabs: "Acme Corp" → "Target Alpha" everywhere
|
||||
- Mapping stored per deal room, reversible by deal room owner (`ib_admin`, `seller_admin`)
|
||||
- Preserve formulas — VLOOKUPs resolve correctly because mapping is consistent
|
||||
- User-configurable: tag which columns/fields to anonymize (or auto-suggest from headers like "Company", "Name", "Contact", "Counterparty")
|
||||
|
||||
### Limitations
|
||||
|
||||
- Formula-embedded string literals (e.g., `=IF(A1="Acme Corp",...)`) require formula string parsing — defer to v2
|
||||
- Conditional formatting rules referencing text values — defer to v2
|
||||
|
||||
## 7. API / Enrichment Vendor Evaluation
|
||||
|
||||
| Vendor | Person | Company | Email verify | Price |
|
||||
|--------|--------|---------|-------------|-------|
|
||||
| Apollo.io | ✓ | ✓ | ✓ | Free tier: 50/mo, paid from $49/mo |
|
||||
| Clearbit (HubSpot) | ✓ | ✓ | — | Enterprise pricing |
|
||||
| Hunter.io | partial | — | ✓ | Free tier: 25/mo, paid from $49/mo |
|
||||
| PDL (People Data Labs) | ✓ | ✓ | — | Pay-per-record |
|
||||
| Gravatar | photo only | — | — | Free |
|
||||
|
||||
Recommendation: Start with Apollo free tier for MVP. Evaluate PDL for scale.
|
||||
|
||||
## 8. Implementation Phases
|
||||
|
||||
**Phase 1 (MVP):** Email parsing + Gravatar headshot + initials fallback. Zero external API dependency.
|
||||
|
||||
**Phase 2:** Domain homepage scrape → company card auto-population. Still zero paid API.
|
||||
|
||||
**Phase 3:** Apollo/Clearbit integration → full person enrichment + colleague discovery.
|
||||
|
||||
**Phase 4:** Team page deep scrape + spreadsheet anonymization.
|
||||
|
||||
## 9. Profile Isolation — Deal-Scoped Identity
|
||||
|
||||
### Core Principle
|
||||
|
||||
**Profiles are scoped to the deal room, not global.** The same physical person can have three separate profiles across three deals, each with different levels of PII. This is by design.
|
||||
|
||||
### Why
|
||||
|
||||
- Deal A: someone's cell phone was shared in a conversation → their profile has it
|
||||
- Deal B: same person, but only their work email is known → no cell
|
||||
- Deal C: same person joined via a different email entirely → different enrichment data
|
||||
|
||||
We do NOT merge these. We do NOT leak PII from Deal A into Deal B. Each deal room is a self-contained universe of identity.
|
||||
|
||||
### Data Model
|
||||
|
||||
```
|
||||
DealProfile {
|
||||
id TEXT PRIMARY KEY
|
||||
deal_id TEXT NOT NULL -- FK to deal room
|
||||
email TEXT NOT NULL -- the email used in THIS deal
|
||||
first_name TEXT
|
||||
last_name TEXT
|
||||
title TEXT
|
||||
phone TEXT -- may be NULL in some deals
|
||||
headshot_path TEXT -- locally stored, per-deal copy
|
||||
company_id TEXT -- FK to DealCompany (also deal-scoped)
|
||||
linkedin_url TEXT
|
||||
source TEXT -- 'enrichment', 'manual', 'scrape', 'invite'
|
||||
created_at DATETIME
|
||||
enriched_at DATETIME
|
||||
|
||||
UNIQUE(deal_id, email)
|
||||
}
|
||||
```
|
||||
|
||||
### Rules
|
||||
|
||||
1. **No cross-deal profile lookups.** A query in Deal A never touches Deal B's profiles.
|
||||
2. **No global person table.** There is no `Person` entity — only `DealProfile`.
|
||||
3. **Headshots are copied per deal.** Even if the same headshot URL was scraped, each deal gets its own stored copy. If the source disappears, existing deals are unaffected.
|
||||
4. **Enrichment runs per deal.** If the same email appears in two deals, enrichment runs independently for each. Redundant but safe.
|
||||
5. **Deletion is deal-scoped.** Deleting a deal room deletes all its profiles. No orphans, no leaks.
|
||||
6. **Admin export includes only that deal's PII.** Audit logs, data exports, and compliance reports are deal-scoped.
|
||||
|
||||
### Rationale
|
||||
|
||||
In M&A, confidentiality isn't just about documents — it's about *who knows what about whom*. A buyer in Deal A should never learn that the same banker is also involved in Deal B, or gain contact details that were only shared in a different context. Three profiles for one person is not a bug. It's the feature.
|
||||
Loading…
Reference in New Issue