983 lines
30 KiB
Markdown
983 lines
30 KiB
Markdown
# Dealspace — AI Matching & Embedding Specification
|
|
|
|
**Version:** 0.1 — 2026-02-28
|
|
**Status:** Pre-implementation. Addresses SPEC-REVIEW.md section 3 (race conditions) and section 7.4 (magic threshold).
|
|
|
|
---
|
|
|
|
## 1. Embedding Model Selection
|
|
|
|
### 1.1 Candidates
|
|
|
|
| Model | Provider | Dimensions | Context | Latency | Cost | Retention |
|
|
|-------|----------|------------|---------|---------|------|-----------|
|
|
| `nomic-embed-text-v1.5` | Fireworks | 768 | 8192 | ~50ms | $0.008/1M | Zero |
|
|
| `voyage-finance-2` | Voyage AI | 1024 | 16000 | ~80ms | $0.12/1M | 30 days |
|
|
|
|
### 1.2 Domain Analysis
|
|
|
|
M&A requests contain:
|
|
- **Financial terminology:** EBITDA, working capital adjustments, earnout provisions, rep & warranty
|
|
- **Legal terminology:** indemnification, IP assignments, material adverse change, disclosure schedules
|
|
- **Industry-specific terms:** varies by deal (tech: ARR, churn; healthcare: HIPAA, 340B; manufacturing: capex, inventory turns)
|
|
|
|
**Voyage-finance-2** was trained specifically on financial documents (10-Ks, credit agreements, M&A filings). It shows ~8% improvement on financial similarity benchmarks vs. general-purpose models.
|
|
|
|
**Nomic-embed-text-v1.5** is general-purpose but performs well on semantic matching. Zero retention is critical for M&A confidentiality.
|
|
|
|
### 1.3 Recommendation
|
|
|
|
**Use Fireworks nomic-embed-text-v1.5** for MVP:
|
|
1. **Zero retention** — Voyage's 30-day retention is unacceptable for M&A data
|
|
2. **15x cheaper** — allows generous matching without cost concerns
|
|
3. **Proven stack** — same infrastructure as inou (known good)
|
|
4. **Good enough** — general semantic similarity works for request matching; we're matching "audited financials" to "FY2024 audit report," not parsing covenant calculations
|
|
|
|
**Revisit voyage-finance-2 when:**
|
|
- Voyage offers a zero-retention API option
|
|
- Match quality metrics show <85% human confirmation rate
|
|
- We expand to quantitative matching (finding answers by numerical similarity)
|
|
|
|
### 1.4 Fireworks Configuration
|
|
|
|
```go
|
|
const (
|
|
EmbedModel = "nomic-embed-text-v1.5"
|
|
EmbedEndpoint = "https://api.fireworks.ai/inference/v1/embeddings"
|
|
EmbedDimension = 768
|
|
EmbedMaxTokens = 8192 // model context limit
|
|
)
|
|
```
|
|
|
|
API key stored in environment: `FIREWORKS_API_KEY`
|
|
|
|
---
|
|
|
|
## 2. What to Embed
|
|
|
|
### 2.1 Request Embedding
|
|
|
|
Embed the **semantic intent**, not raw fields.
|
|
|
|
```go
|
|
func BuildRequestEmbedText(r *Request, ws *Workstream) string {
|
|
var b strings.Builder
|
|
|
|
// Workstream context (aids cross-workstream relevance)
|
|
b.WriteString("Workstream: ")
|
|
b.WriteString(ws.Name)
|
|
b.WriteString("\n\n")
|
|
|
|
// Title is high-signal
|
|
b.WriteString("Request: ")
|
|
b.WriteString(r.Title)
|
|
b.WriteString("\n\n")
|
|
|
|
// Body provides detail
|
|
if r.Body != "" {
|
|
b.WriteString("Details: ")
|
|
b.WriteString(r.Body)
|
|
}
|
|
|
|
return strings.TrimSpace(b.String())
|
|
}
|
|
```
|
|
|
|
**Do NOT embed:**
|
|
- Request ID, ref numbers (non-semantic)
|
|
- Due dates, priority (operational, not semantic)
|
|
- Assignee names (PII, not relevant to matching)
|
|
- Status (changes frequently, embedding is point-in-time)
|
|
|
|
### 2.2 Answer Embedding
|
|
|
|
Answers may be long (multi-document explanations). Use chunking with overlap.
|
|
|
|
```go
|
|
const (
|
|
AnswerChunkSize = 1500 // tokens (~6000 chars)
|
|
AnswerChunkOverlap = 200 // tokens overlap for context continuity
|
|
)
|
|
|
|
func BuildAnswerEmbedTexts(a *Answer, ws *Workstream) []string {
|
|
var chunks []string
|
|
|
|
// Prefix every chunk with context
|
|
prefix := fmt.Sprintf("Workstream: %s\nAnswer: %s\n\n", ws.Name, a.Title)
|
|
|
|
body := a.Body
|
|
if len(body) <= AnswerChunkSize*4 { // ~6000 chars = 1 chunk
|
|
return []string{prefix + body}
|
|
}
|
|
|
|
// Chunk long answers
|
|
for start := 0; start < len(body); {
|
|
end := start + AnswerChunkSize*4
|
|
if end > len(body) {
|
|
end = len(body)
|
|
}
|
|
|
|
// Find sentence boundary near end
|
|
if end < len(body) {
|
|
if idx := strings.LastIndex(body[start:end], ". "); idx > 0 {
|
|
end = start + idx + 1
|
|
}
|
|
}
|
|
|
|
chunks = append(chunks, prefix+body[start:end])
|
|
start = end - AnswerChunkOverlap*4
|
|
if start < 0 {
|
|
start = 0
|
|
}
|
|
}
|
|
|
|
return chunks
|
|
}
|
|
```
|
|
|
|
**Do NOT embed:**
|
|
- File contents (privacy: see section 10)
|
|
- File names (may contain sensitive identifiers)
|
|
- Rejection reasons (operational, not semantic)
|
|
- Internal comments (IB-only context)
|
|
|
|
### 2.3 Embedding Storage
|
|
|
|
```sql
|
|
CREATE TABLE embeddings (
|
|
id TEXT PRIMARY KEY, -- UUID
|
|
entry_id TEXT NOT NULL, -- request or answer entry_id
|
|
chunk_idx INTEGER NOT NULL DEFAULT 0, -- 0 for single-chunk, 0..N for multi-chunk
|
|
vector BLOB NOT NULL, -- 768 float32 = 3072 bytes
|
|
text_hash TEXT NOT NULL, -- SHA-256 of embedded text (dedup check)
|
|
model TEXT NOT NULL, -- "nomic-embed-text-v1.5"
|
|
created_at INTEGER NOT NULL,
|
|
UNIQUE(entry_id, chunk_idx)
|
|
);
|
|
|
|
CREATE INDEX idx_embeddings_entry ON embeddings(entry_id);
|
|
```
|
|
|
|
**Note:** Vector stored as raw `float32` bytes (little-endian). Cosine similarity computed in Go, not SQL.
|
|
|
|
---
|
|
|
|
## 3. Retroactive Matching (Bidirectional)
|
|
|
|
### 3.1 The Problem (from SPEC-REVIEW §3.1)
|
|
|
|
The original spec only describes matching when a buyer submits a request. But:
|
|
- **New request** should search existing published answers
|
|
- **New answer** (when published) should search open requests
|
|
|
|
Both directions are required for complete coverage.
|
|
|
|
### 3.2 Matching Directions
|
|
|
|
```
|
|
Direction 1: Request → Answers
|
|
Trigger: Request created OR request text updated
|
|
Search: All published answers in accessible workstreams
|
|
Output: Suggested answer_links with ai_score
|
|
|
|
Direction 2: Answer → Requests
|
|
Trigger: Answer published (stage = dataroom)
|
|
Search: All open requests in accessible workstreams
|
|
Output: Suggested answer_links with ai_score
|
|
```
|
|
|
|
### 3.3 Implementation
|
|
|
|
```go
|
|
// Called when request is created or body/title changes
|
|
func MatchRequestToAnswers(ctx context.Context, requestID string) ([]AnswerMatch, error) {
|
|
// 1. Get request embedding (create if missing)
|
|
// 2. Load all published answer embeddings in same workstream
|
|
// 3. Cosine similarity against each
|
|
// 4. Return matches above threshold
|
|
}
|
|
|
|
// Called when answer is published
|
|
func MatchAnswerToRequests(ctx context.Context, answerID string) ([]RequestMatch, error) {
|
|
// 1. Get answer embedding(s) (multi-chunk: use max score across chunks)
|
|
// 2. Load all open request embeddings in same workstream
|
|
// 3. Cosine similarity against each
|
|
// 4. Return matches above threshold
|
|
}
|
|
```
|
|
|
|
### 3.4 Matching on Update
|
|
|
|
If a request body is edited:
|
|
1. Recompute embedding (check `text_hash` — skip if unchanged)
|
|
2. Re-run matching
|
|
3. New suggestions appear; existing confirmed links preserved
|
|
|
|
If an answer body is edited before publish:
|
|
- No action (draft state)
|
|
|
|
If an answer is re-published (correction):
|
|
- Re-run matching
|
|
- Flag for human review if new requests match
|
|
|
|
---
|
|
|
|
## 4. Broadcast Idempotency
|
|
|
|
### 4.1 The Problem (from SPEC-REVIEW §3.2)
|
|
|
|
Multiple requests can link to the same answer. Without idempotency:
|
|
- Confirming R1↔A1 sends broadcast to Buyer A
|
|
- Confirming R2↔A1 sends another broadcast to Buyer B
|
|
- If Buyer A also asked R2... they get two notifications
|
|
|
|
### 4.2 Broadcasts Table
|
|
|
|
```sql
|
|
CREATE TABLE broadcasts (
|
|
id TEXT PRIMARY KEY,
|
|
answer_id TEXT NOT NULL REFERENCES entries(entry_id),
|
|
request_id TEXT NOT NULL REFERENCES entries(entry_id),
|
|
recipient_id TEXT NOT NULL REFERENCES users(id),
|
|
channel TEXT NOT NULL, -- "web" | "email" | "slack" | "teams"
|
|
sent_at INTEGER NOT NULL,
|
|
UNIQUE(answer_id, request_id, recipient_id, channel)
|
|
);
|
|
|
|
CREATE INDEX idx_broadcasts_answer ON broadcasts(answer_id);
|
|
CREATE INDEX idx_broadcasts_recipient ON broadcasts(recipient_id);
|
|
```
|
|
|
|
### 4.3 Broadcast Logic
|
|
|
|
```go
|
|
func BroadcastAnswer(ctx context.Context, tx *sql.Tx, answerID string) error {
|
|
// 1. Get all confirmed answer_links for this answer
|
|
links, err := getConfirmedLinks(tx, answerID)
|
|
|
|
// 2. For each link, get the request's origin_id (ultimate requester)
|
|
recipients := make(map[string][]string) // user_id -> []request_id
|
|
for _, link := range links {
|
|
req, _ := getRequest(tx, link.RequestID)
|
|
recipients[req.OriginID] = append(recipients[req.OriginID], link.RequestID)
|
|
}
|
|
|
|
// 3. For each recipient, check idempotency and send
|
|
for userID, requestIDs := range recipients {
|
|
for _, reqID := range requestIDs {
|
|
// Check if already sent
|
|
exists, _ := broadcastExists(tx, answerID, reqID, userID, "web")
|
|
if exists {
|
|
continue // idempotent skip
|
|
}
|
|
|
|
// Record broadcast
|
|
err := insertBroadcast(tx, answerID, reqID, userID, "web", time.Now().UnixMilli())
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// Queue notification (outside transaction)
|
|
NotifyQueue.Push(Notification{
|
|
UserID: userID,
|
|
AnswerID: answerID,
|
|
RequestID: reqID,
|
|
})
|
|
}
|
|
}
|
|
|
|
return nil
|
|
}
|
|
```
|
|
|
|
### 4.4 Notification Deduplication
|
|
|
|
Even with idempotency per (answer, request, recipient), a user might get multiple notifications if they asked 5 equivalent questions.
|
|
|
|
**User-facing behavior:** Collapse into single notification:
|
|
```
|
|
"Answer published: [Title] — resolves 5 of your requests"
|
|
```
|
|
|
|
This is a presentation concern, not a data model change. The `broadcasts` table tracks each link; the notification renderer collapses them.
|
|
|
|
---
|
|
|
|
## 5. Configurable Similarity Threshold
|
|
|
|
### 5.1 The Problem (from SPEC-REVIEW §7.4)
|
|
|
|
Hardcoded 0.72 is a magic number that:
|
|
- May be too strict for some workstreams (legal requests are verbose)
|
|
- May be too loose for others (financial requests are terse)
|
|
- Cannot be tuned without code changes
|
|
|
|
### 5.2 Per-Workstream Configuration
|
|
|
|
Add to workstream entry's Data:
|
|
|
|
```json
|
|
{
|
|
"name": "Finance",
|
|
"match_config": {
|
|
"threshold": 0.72,
|
|
"auto_confirm_threshold": 0.95,
|
|
"cross_workstream": ["Legal"]
|
|
}
|
|
}
|
|
```
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `threshold` | float | 0.72 | Minimum score to suggest match |
|
|
| `auto_confirm_threshold` | float | null | If set, scores above this auto-confirm (no human review) |
|
|
| `cross_workstream` | []string | [] | Workstream slugs to include in matching (see section 6) |
|
|
|
|
### 5.3 Threshold Tuning Guidance
|
|
|
|
| Workstream Type | Recommended Threshold | Rationale |
|
|
|-----------------|----------------------|-----------|
|
|
| Finance | 0.72 | Standard M&A requests, well-defined terminology |
|
|
| Legal | 0.68 | Verbose requests with boilerplate, semantic core is smaller |
|
|
| IT | 0.75 | Technical specificity matters, false positives costly |
|
|
| HR | 0.70 | Mix of standard and org-specific terms |
|
|
| Operations | 0.72 | General business terminology |
|
|
|
|
### 5.4 Calibration Process
|
|
|
|
After initial deal data:
|
|
1. Export all (request, answer, human_confirmed) tuples
|
|
2. Compute score distribution for confirmed vs. rejected matches
|
|
3. Adjust threshold to maximize F1 score per workstream
|
|
4. Log threshold changes to audit for compliance
|
|
|
|
---
|
|
|
|
## 6. Cross-Workstream Matching
|
|
|
|
### 6.1 Use Case
|
|
|
|
An IT request ("Describe cybersecurity insurance coverage") may be answered by a Legal answer (cyber liability policy document).
|
|
|
|
Without cross-workstream matching, the IT buyer never sees the Legal answer.
|
|
|
|
### 6.2 Opt-In Per Workstream Pair
|
|
|
|
Configured in each workstream's `match_config.cross_workstream`:
|
|
|
|
```json
|
|
// IT workstream
|
|
{
|
|
"match_config": {
|
|
"cross_workstream": ["Legal"] // IT requests search Legal answers
|
|
}
|
|
}
|
|
|
|
// Legal workstream
|
|
{
|
|
"match_config": {
|
|
"cross_workstream": ["IT", "Finance"] // Legal requests search IT and Finance
|
|
}
|
|
}
|
|
```
|
|
|
|
**Relationship is directional:** IT searching Legal doesn't imply Legal searches IT.
|
|
|
|
### 6.3 RBAC Interaction
|
|
|
|
Cross-workstream matching only returns answers the requester can access:
|
|
|
|
```go
|
|
func GetMatchableAnswers(ctx context.Context, actorID, requestWorkstreamID string) ([]Answer, error) {
|
|
// 1. Get workstream config
|
|
ws, _ := getWorkstream(requestWorkstreamID)
|
|
|
|
// 2. Build workstream list (self + cross)
|
|
workstreams := []string{ws.ID}
|
|
workstreams = append(workstreams, ws.MatchConfig.CrossWorkstream...)
|
|
|
|
// 3. Filter by access (RBAC)
|
|
var accessible []string
|
|
for _, wsID := range workstreams {
|
|
if hasAccess(actorID, wsID, "read") {
|
|
accessible = append(accessible, wsID)
|
|
}
|
|
}
|
|
|
|
// 4. Get published answers from accessible workstreams
|
|
return getPublishedAnswers(accessible)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Request Deduplication (Auto-Suggest Existing Answers)
|
|
|
|
### 7.1 The Problem
|
|
|
|
Buyer B asks the same question Buyer A already got answered. Without dedup:
|
|
- Seller does duplicate work
|
|
- IB reviews duplicate request
|
|
- Buyer B waits when answer already exists
|
|
|
|
### 7.2 Dedup Flow
|
|
|
|
```
|
|
Buyer B submits request
|
|
→ Embed request
|
|
→ Search published answers (same logic as section 3)
|
|
→ If match score ≥ threshold:
|
|
→ Show Buyer B: "Similar answer already exists — view it?"
|
|
→ If Buyer B accepts: link request to existing answer, mark resolved
|
|
→ If Buyer B declines: proceed with normal request flow
|
|
```
|
|
|
|
### 7.3 UX Considerations
|
|
|
|
**Don't block submission.** Show suggestion after submit, not as a gate:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Your request has been submitted. │
|
|
│ │
|
|
│ 💡 We found a similar published answer that may help: │
|
|
│ │
|
|
│ "FY2024 Audited Financial Statements" │
|
|
│ Published: 2026-02-15 | Similarity: 89% │
|
|
│ │
|
|
│ [View Answer] [This doesn't answer my question] │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 7.4 Data Model
|
|
|
|
When buyer accepts the suggestion:
|
|
|
|
```sql
|
|
INSERT INTO answer_links (answer_id, request_id, linked_by, linked_at, status, ai_score)
|
|
VALUES (?, ?, ?, ?, 'self_confirmed', ?);
|
|
-- status = 'self_confirmed' means buyer accepted the AI suggestion
|
|
-- no IB review required
|
|
```
|
|
|
|
When buyer declines:
|
|
|
|
```sql
|
|
INSERT INTO answer_links (answer_id, request_id, linked_by, linked_at, status, ai_score)
|
|
VALUES (?, ?, ?, ?, 'rejected_by_requester', ?);
|
|
-- Prevents suggesting this answer again for this request
|
|
-- Request proceeds to normal IB/Seller workflow
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Race Condition Fixes (DB Transactions)
|
|
|
|
### 8.1 The Problem (from SPEC-REVIEW §3)
|
|
|
|
Without transactions:
|
|
1. IB confirms match R1↔A1
|
|
2. Concurrent: IB publishes A1
|
|
3. Broadcast fires during confirm
|
|
4. Confirm completes, tries to broadcast again
|
|
5. Duplicate notifications or worse — inconsistent state
|
|
|
|
### 8.2 Transaction Boundaries
|
|
|
|
**Atomic operation 1: Confirm Match**
|
|
```go
|
|
func ConfirmMatch(ctx context.Context, answerID, requestID, actorID string) error {
|
|
return db.Transaction(func(tx *sql.Tx) error {
|
|
// 1. Verify answer exists and is published
|
|
answer, err := getAnswer(tx, answerID)
|
|
if err != nil || answer.Status != "published" {
|
|
return ErrAnswerNotPublished
|
|
}
|
|
|
|
// 2. Update answer_link status
|
|
err = updateAnswerLink(tx, answerID, requestID, "confirmed", actorID)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// 3. Broadcast (idempotent)
|
|
err = BroadcastAnswer(ctx, tx, answerID)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
return nil
|
|
})
|
|
}
|
|
```
|
|
|
|
**Atomic operation 2: Publish Answer**
|
|
```go
|
|
func PublishAnswer(ctx context.Context, answerID, actorID string) error {
|
|
return db.Transaction(func(tx *sql.Tx) error {
|
|
// 1. Update answer status
|
|
err := updateAnswerStatus(tx, answerID, "published", actorID)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// 2. Update entry stage to dataroom
|
|
err = updateEntryStage(tx, answerID, "dataroom")
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// 3. Run retroactive matching (creates pending answer_links)
|
|
matches, err := MatchAnswerToRequests(ctx, answerID)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
for _, m := range matches {
|
|
err = insertAnswerLink(tx, answerID, m.RequestID, "pending", m.Score)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
}
|
|
|
|
// 4. Broadcast already-confirmed links (if any pre-existed)
|
|
err = BroadcastAnswer(ctx, tx, answerID)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
return nil
|
|
})
|
|
}
|
|
```
|
|
|
|
### 8.3 Optimistic Locking
|
|
|
|
Add version column to prevent concurrent modifications:
|
|
|
|
```sql
|
|
ALTER TABLE entries ADD COLUMN version INTEGER NOT NULL DEFAULT 1;
|
|
```
|
|
|
|
```go
|
|
func updateAnswerStatus(tx *sql.Tx, answerID, status string, expectedVersion int) (int, error) {
|
|
result, err := tx.Exec(`
|
|
UPDATE entries
|
|
SET data = json_set(data, '$.status', ?),
|
|
version = version + 1,
|
|
updated_at = ?
|
|
WHERE entry_id = ? AND version = ?
|
|
`, status, time.Now().UnixMilli(), answerID, expectedVersion)
|
|
|
|
if err != nil {
|
|
return 0, err
|
|
}
|
|
|
|
rows, _ := result.RowsAffected()
|
|
if rows == 0 {
|
|
return 0, ErrConcurrentModification
|
|
}
|
|
|
|
return expectedVersion + 1, nil
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 9. SQLite Cosine Similarity & Qdrant Migration
|
|
|
|
### 9.1 Pure Go Cosine Similarity
|
|
|
|
SQLite doesn't have native vector operations. Compute in Go:
|
|
|
|
```go
|
|
// CosineSimilarity computes similarity between two vectors.
|
|
// Vectors must be same length. Returns value in [-1, 1].
|
|
func CosineSimilarity(a, b []float32) float32 {
|
|
if len(a) != len(b) {
|
|
panic("vector length mismatch")
|
|
}
|
|
|
|
var dotProduct, normA, normB float32
|
|
for i := range a {
|
|
dotProduct += a[i] * b[i]
|
|
normA += a[i] * a[i]
|
|
normB += b[i] * b[i]
|
|
}
|
|
|
|
if normA == 0 || normB == 0 {
|
|
return 0
|
|
}
|
|
|
|
return dotProduct / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
|
|
}
|
|
|
|
// BatchCosineSimilarity computes query vs all candidates.
|
|
// Uses SIMD via Go compiler optimizations.
|
|
func BatchCosineSimilarity(query []float32, candidates [][]float32) []float32 {
|
|
scores := make([]float32, len(candidates))
|
|
|
|
// Pre-compute query norm
|
|
var queryNorm float32
|
|
for _, v := range query {
|
|
queryNorm += v * v
|
|
}
|
|
queryNorm = float32(math.Sqrt(float64(queryNorm)))
|
|
|
|
for i, candidate := range candidates {
|
|
var dot, candNorm float32
|
|
for j := range query {
|
|
dot += query[j] * candidate[j]
|
|
candNorm += candidate[j] * candidate[j]
|
|
}
|
|
candNorm = float32(math.Sqrt(float64(candNorm)))
|
|
|
|
if queryNorm == 0 || candNorm == 0 {
|
|
scores[i] = 0
|
|
} else {
|
|
scores[i] = dot / (queryNorm * candNorm)
|
|
}
|
|
}
|
|
|
|
return scores
|
|
}
|
|
```
|
|
|
|
### 9.2 Performance Characteristics (SQLite + Go)
|
|
|
|
| Embeddings | Load Time | Search Time | Memory |
|
|
|------------|-----------|-------------|--------|
|
|
| 1,000 | 50ms | 2ms | 3 MB |
|
|
| 10,000 | 500ms | 20ms | 30 MB |
|
|
| 100,000 | 5s | 200ms | 300 MB |
|
|
| 1,000,000 | 50s | 2s | 3 GB |
|
|
|
|
**Acceptable for MVP:** Most deals have <10,000 documents. Search under 100ms is fine.
|
|
|
|
### 9.3 Qdrant Migration Threshold
|
|
|
|
Migrate to Qdrant when:
|
|
1. **Embedding count > 100,000** — search latency exceeds 200ms
|
|
2. **Memory pressure** — embeddings consume >500MB RAM
|
|
3. **Multi-tenancy** — need isolated collections per client (compliance)
|
|
|
|
### 9.4 Qdrant Integration (Future)
|
|
|
|
```go
|
|
type VectorStore interface {
|
|
Upsert(id string, vector []float32, metadata map[string]any) error
|
|
Search(query []float32, filter map[string]any, limit int) ([]SearchResult, error)
|
|
Delete(id string) error
|
|
}
|
|
|
|
// SQLiteVectorStore implements VectorStore using embeddings table
|
|
type SQLiteVectorStore struct { ... }
|
|
|
|
// QdrantVectorStore implements VectorStore using Qdrant API
|
|
type QdrantVectorStore struct { ... }
|
|
```
|
|
|
|
Abstract behind interface now; swap implementation later without code changes.
|
|
|
|
### 9.5 Hybrid Mode (Transition)
|
|
|
|
During migration:
|
|
1. Write to both SQLite and Qdrant
|
|
2. Read from Qdrant (with SQLite fallback)
|
|
3. Validate results match for first 1000 queries
|
|
4. Drop SQLite embeddings table after validation
|
|
|
|
---
|
|
|
|
## 10. Privacy: Plaintext Only, Never Files
|
|
|
|
### 10.1 Embedding Content Policy
|
|
|
|
**ALLOWED to embed:**
|
|
- Request title
|
|
- Request body text
|
|
- Answer title
|
|
- Answer body text (the explanation, not file contents)
|
|
- Workstream name (context)
|
|
|
|
**NEVER embed:**
|
|
- File contents (PDF, DOCX, XLSX, images)
|
|
- File names (may contain deal names, party names)
|
|
- Internal comments
|
|
- Routing/assignment metadata
|
|
- User names or email addresses
|
|
|
|
### 10.2 Why No File Embedding?
|
|
|
|
1. **Privacy:** M&A documents contain material non-public information. Sending to ANY external API (even zero-retention) creates compliance risk.
|
|
|
|
2. **Size:** A single PDF may be 100+ pages. Embedding would require chunking, storage, and search across potentially millions of chunks. Overkill for request-matching.
|
|
|
|
3. **Semantic mismatch:** Request asks "audited financials for FY2024." The answer body says "Please find attached the FY2024 audited financial statements." The body text + title is sufficient for matching — we don't need to embed page 47 of the PDF.
|
|
|
|
### 10.3 Future: On-Premise OCR + Embedding
|
|
|
|
If file-level search becomes required:
|
|
1. Run OCR on-premise (GLM-OCR on forge, not external API)
|
|
2. Store extracted text in `entry.data` (encrypted at rest)
|
|
3. Embed extracted text (still goes to Fireworks, but it's our extracted text, not raw file)
|
|
|
|
This is out of scope for MVP.
|
|
|
|
### 10.4 Audit Trail
|
|
|
|
Log every embedding request for compliance:
|
|
|
|
```sql
|
|
CREATE TABLE embed_audit (
|
|
id TEXT PRIMARY KEY,
|
|
entry_id TEXT NOT NULL,
|
|
text_hash TEXT NOT NULL, -- SHA-256 of text sent
|
|
text_len INTEGER NOT NULL, -- character count
|
|
model TEXT NOT NULL,
|
|
requested_at INTEGER NOT NULL,
|
|
latency_ms INTEGER,
|
|
success INTEGER NOT NULL
|
|
);
|
|
```
|
|
|
|
**Do NOT log the actual text** — that defeats the privacy purpose. Log the hash for correlation if needed.
|
|
|
|
---
|
|
|
|
## 11. lib/embed.go — Function Signatures
|
|
|
|
### 11.1 Public API
|
|
|
|
```go
|
|
package lib
|
|
|
|
import (
|
|
"context"
|
|
)
|
|
|
|
// EmbedConfig holds embedding service configuration.
|
|
type EmbedConfig struct {
|
|
APIKey string // FIREWORKS_API_KEY
|
|
Endpoint string // defaults to Fireworks endpoint
|
|
Model string // defaults to nomic-embed-text-v1.5
|
|
Timeout time.Duration
|
|
MaxRetries int
|
|
}
|
|
|
|
// EmbedResult contains the embedding and metadata.
|
|
type EmbedResult struct {
|
|
Vector []float32
|
|
TextHash string // SHA-256 of input text
|
|
Model string
|
|
TokenCount int
|
|
LatencyMs int64
|
|
}
|
|
|
|
// MatchResult represents a potential match with score.
|
|
type MatchResult struct {
|
|
EntryID string
|
|
ChunkIdx int
|
|
Score float32
|
|
EntryType string // "request" | "answer"
|
|
}
|
|
|
|
// Embed generates an embedding for the given text.
|
|
// Returns ErrTextTooLong if text exceeds model context.
|
|
// Returns ErrEmptyText if text is empty or whitespace only.
|
|
func Embed(ctx context.Context, cfg *EmbedConfig, text string) (*EmbedResult, error)
|
|
|
|
// EmbedBatch generates embeddings for multiple texts.
|
|
// More efficient than calling Embed in a loop (single API call).
|
|
// Max 100 texts per batch.
|
|
func EmbedBatch(ctx context.Context, cfg *EmbedConfig, texts []string) ([]*EmbedResult, error)
|
|
|
|
// EmbedRequest creates and stores embedding for a request entry.
|
|
// Idempotent: skips if embedding exists and text_hash matches.
|
|
func EmbedRequest(ctx context.Context, db *sql.DB, cfg *EmbedConfig, requestID string) error
|
|
|
|
// EmbedAnswer creates and stores embedding(s) for an answer entry.
|
|
// May produce multiple chunks for long answers.
|
|
// Idempotent: skips chunks where text_hash matches.
|
|
func EmbedAnswer(ctx context.Context, db *sql.DB, cfg *EmbedConfig, answerID string) error
|
|
|
|
// MatchRequestToAnswers finds published answers matching the request.
|
|
// Returns matches above the workstream's configured threshold.
|
|
// Respects cross-workstream config and RBAC.
|
|
func MatchRequestToAnswers(ctx context.Context, db *sql.DB, actorID, requestID string) ([]MatchResult, error)
|
|
|
|
// MatchAnswerToRequests finds open requests matching the answer.
|
|
// Returns matches above the workstream's configured threshold.
|
|
// Respects cross-workstream config and RBAC.
|
|
func MatchAnswerToRequests(ctx context.Context, db *sql.DB, actorID, answerID string) ([]MatchResult, error)
|
|
|
|
// FindDuplicateRequests finds existing requests similar to the given text.
|
|
// Used for deduplication suggestions before/after submission.
|
|
func FindDuplicateRequests(ctx context.Context, db *sql.DB, actorID, workstreamID, text string) ([]MatchResult, error)
|
|
|
|
// CosineSimilarity computes similarity between two vectors.
|
|
func CosineSimilarity(a, b []float32) float32
|
|
|
|
// DeleteEmbeddings removes all embeddings for an entry.
|
|
// Called when entry is deleted.
|
|
func DeleteEmbeddings(ctx context.Context, db *sql.DB, entryID string) error
|
|
|
|
// RefreshEmbedding re-embeds an entry if content changed.
|
|
// Compares text_hash to detect changes.
|
|
// Returns true if embedding was updated.
|
|
func RefreshEmbedding(ctx context.Context, db *sql.DB, cfg *EmbedConfig, entryID string) (bool, error)
|
|
```
|
|
|
|
### 11.2 Async Embedding on Publish
|
|
|
|
Embedding should not block the user action. Use async processing:
|
|
|
|
```go
|
|
// EmbedQueue is a background worker that processes embedding requests.
|
|
type EmbedQueue struct {
|
|
cfg *EmbedConfig
|
|
db *sql.DB
|
|
queue chan embedJob
|
|
wg sync.WaitGroup
|
|
}
|
|
|
|
type embedJob struct {
|
|
EntryID string
|
|
EntryType string // "request" | "answer"
|
|
Priority int // 0 = normal, 1 = high (new request needs matching)
|
|
}
|
|
|
|
// Start begins processing the embedding queue.
|
|
// Workers defaults to 2 (Fireworks rate limit friendly).
|
|
func (q *EmbedQueue) Start(workers int)
|
|
|
|
// Stop gracefully shuts down the queue.
|
|
func (q *EmbedQueue) Stop()
|
|
|
|
// Enqueue adds an entry for embedding.
|
|
// Non-blocking; returns immediately.
|
|
func (q *EmbedQueue) Enqueue(entryID, entryType string, priority int)
|
|
```
|
|
|
|
### 11.3 Integration Points
|
|
|
|
**On Request Create:**
|
|
```go
|
|
func HandleCreateRequest(w http.ResponseWriter, r *http.Request) {
|
|
// ... validation, RBAC, insert entry ...
|
|
|
|
// Queue embedding (non-blocking)
|
|
embedQueue.Enqueue(request.ID, "request", 1) // high priority
|
|
|
|
// Return success immediately
|
|
respondJSON(w, request)
|
|
}
|
|
```
|
|
|
|
**On Answer Publish:**
|
|
```go
|
|
func HandlePublishAnswer(w http.ResponseWriter, r *http.Request) {
|
|
err := db.Transaction(func(tx *sql.Tx) error {
|
|
// ... update status, stage ...
|
|
|
|
// Embedding happens inline for matching (within transaction timeout)
|
|
err := EmbedAnswer(ctx, tx, cfg, answer.ID)
|
|
if err != nil {
|
|
// Log but don't fail — matching can happen later
|
|
log.Warn("embedding failed, will retry", "error", err)
|
|
}
|
|
|
|
// Match and create answer_links
|
|
matches, _ := MatchAnswerToRequests(ctx, tx, actorID, answer.ID)
|
|
for _, m := range matches {
|
|
insertAnswerLink(tx, answer.ID, m.EntryID, "pending", m.Score)
|
|
}
|
|
|
|
// Broadcast confirmed links
|
|
return BroadcastAnswer(ctx, tx, answer.ID)
|
|
})
|
|
|
|
respondJSON(w, answer)
|
|
}
|
|
```
|
|
|
|
### 11.4 Error Handling
|
|
|
|
```go
|
|
var (
|
|
ErrTextTooLong = errors.New("text exceeds model context limit")
|
|
ErrEmptyText = errors.New("text is empty or whitespace only")
|
|
ErrEmbeddingFailed = errors.New("embedding API call failed")
|
|
ErrRateLimited = errors.New("embedding API rate limited")
|
|
ErrNoEmbedding = errors.New("entry has no embedding")
|
|
ErrWorkstreamConfig = errors.New("workstream missing match configuration")
|
|
)
|
|
```
|
|
|
|
Retry policy for transient errors:
|
|
- `ErrRateLimited`: exponential backoff, max 3 retries
|
|
- `ErrEmbeddingFailed`: retry once after 1s
|
|
- All others: fail immediately
|
|
|
|
---
|
|
|
|
## 12. answer_links Table (Updated)
|
|
|
|
Incorporates SPEC-REVIEW feedback on rejection tracking:
|
|
|
|
```sql
|
|
CREATE TABLE answer_links (
|
|
answer_id TEXT NOT NULL REFERENCES entries(entry_id),
|
|
request_id TEXT NOT NULL REFERENCES entries(entry_id),
|
|
|
|
-- Who created the link
|
|
linked_by TEXT NOT NULL,
|
|
linked_at INTEGER NOT NULL,
|
|
|
|
-- AI matching metadata
|
|
ai_score REAL, -- cosine similarity at time of match
|
|
ai_model TEXT, -- model used for matching
|
|
|
|
-- Review status
|
|
status TEXT NOT NULL DEFAULT 'pending',
|
|
-- 'pending': AI suggested, awaiting human review
|
|
-- 'confirmed': IB confirmed the match
|
|
-- 'rejected': IB rejected the match
|
|
-- 'self_confirmed': requester accepted dedup suggestion
|
|
-- 'rejected_by_requester': requester declined dedup suggestion
|
|
|
|
reviewed_by TEXT, -- who reviewed (if status != pending)
|
|
reviewed_at INTEGER, -- when reviewed
|
|
reject_reason TEXT, -- why rejected (if status = rejected)
|
|
|
|
PRIMARY KEY (answer_id, request_id)
|
|
);
|
|
|
|
CREATE INDEX idx_links_answer ON answer_links(answer_id);
|
|
CREATE INDEX idx_links_request ON answer_links(request_id);
|
|
CREATE INDEX idx_links_status ON answer_links(status);
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Summary: What Gets Built
|
|
|
|
| Component | Location | Purpose |
|
|
|-----------|----------|---------|
|
|
| `lib/embed.go` | Core embedding logic | API calls, similarity, storage |
|
|
| `embeddings` table | Schema | Vector storage |
|
|
| `broadcasts` table | Schema | Idempotency |
|
|
| `answer_links` | Schema update | Status + rejection tracking |
|
|
| `embed_audit` table | Schema | Compliance logging |
|
|
| `EmbedQueue` | Background worker | Async processing |
|
|
| Workstream config | Entry.Data | Per-workstream thresholds |
|
|
|
|
**Not built (future):**
|
|
- Qdrant integration (interface defined, impl deferred)
|
|
- File content embedding (privacy: out of scope)
|
|
- Auto-confirm (threshold defined, feature disabled for MVP)
|
|
|
|
---
|
|
|
|
*This document extends SPEC.md. If conflicts exist, discuss before implementing.*
|