92 lines
2.1 KiB
Markdown
92 lines
2.1 KiB
Markdown
# import-genome
|
|
|
|
Fast genetic data importer using lib.Save() for direct database access.
|
|
|
|
## Performance
|
|
|
|
~1.5 seconds to:
|
|
- Read 18MB file
|
|
- Parse 674,160 variants
|
|
- Sort by rsid
|
|
- Match against 9,403 SNPedia rsids
|
|
- Insert 5,382 entries via lib.Save()
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
cd ~/dev/inou
|
|
make import-genome
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
import-genome <plain-file> <dossier-id>
|
|
|
|
# Help
|
|
import-genome --help
|
|
```
|
|
|
|
## Supported Formats
|
|
|
|
| Format | Delimiter | Columns | Alleles |
|
|
|-------------|-----------|---------|------------|
|
|
| AncestryDNA | Tab | 5 | Split |
|
|
| 23andMe | Tab | 4 | Combined |
|
|
| MyHeritage | CSV+Quotes| 4 | Combined |
|
|
| FTDNA | CSV | 4 | Combined |
|
|
|
|
Auto-detected from file structure.
|
|
|
|
## Data Model
|
|
|
|
Creates hierarchical entries:
|
|
|
|
```
|
|
Parent (genome/extraction):
|
|
id: 3b38234f2b0f7ee6
|
|
data: {"source": "ancestry", "variants": 5381}
|
|
|
|
Children (genome/variant):
|
|
parent_id: 3b38234f2b0f7ee6
|
|
type: rs1801133 (rsid)
|
|
value: TT (genotype)
|
|
```
|
|
|
|
## Databases
|
|
|
|
- **SNPedia reference**: `~/dev/inou/snpedia-genotypes/genotypes.db` (read-only, direct SQL)
|
|
- **Entries**: via `lib.Save()` to `/tank/inou/data/inou.db` (single transaction)
|
|
|
|
## Algorithm
|
|
|
|
1. Read plain-text genome file
|
|
2. Auto-detect format from first data line
|
|
3. Parse all variants (rsid + genotype)
|
|
4. Sort by rsid
|
|
5. Load SNPedia rsid set into memory
|
|
6. Match user variants against SNPedia (O(1) lookup)
|
|
7. Delete existing genome entries for dossier
|
|
8. Build []lib.Entry slice
|
|
9. lib.Save() - single transaction with prepared statements
|
|
|
|
## Example
|
|
|
|
```bash
|
|
./bin/import-genome /path/to/ancestry.txt 3b38234f2b0f7ee6
|
|
|
|
# Output:
|
|
# Phase 1 - Read: 24ms (18320431 bytes)
|
|
# Detected format: ancestry
|
|
# Phase 2 - Parse: 162ms (674160 variants)
|
|
# Phase 3 - Sort: 306ms
|
|
# Phase 4 - Load SNPedia: 47ms (9403 rsids)
|
|
# Phase 5 - Match & normalize: 40ms (5381 matched)
|
|
# Phase 6 - Init & delete existing: 15ms
|
|
# Phase 7 - Build entries: 8ms (5382 entries)
|
|
# Phase 8 - lib.Save: 850ms (5382 entries saved)
|
|
#
|
|
# TOTAL: 1.5s
|
|
# Parent ID: c286564f3195445a
|
|
```
|