Files
portfolio/Ralph/progress.txt
T

83 lines
5.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Progress Log — Semantic Search & AI Chat
# Branch: ralph/semantic-search
# Started: 2026-02-15
## Codebase Patterns
- `@xenova/transformers` pipeline with `pooling: 'mean'` and `normalize: true` returns a Tensor; use `Array.from(output.data as Float32Array)` to extract the 384-d vector
- Scripts live in `scripts/` and run via `npx tsx` (tsx is not a project dep, npx fetches it)
- tsconfig `include` only covers `src/` — scripts are type-checked by tsx at runtime, not by `tsc --noEmit`
- Project uses `"type": "module"` in package.json
- Palette item IDs: `exp-{consultation.id}`, `skill-{skill.id}`, `proj-{investigation.id}`, `ach-{0-3}`, `edu-{0-3}`, `action-{0-3}`
- `buildEmbeddingTexts()` in `src/lib/search.ts` returns `Array<{ id: string, text: string }>` with IDs matching PaletteItem IDs — use this for both embedding generation and chat context
- `src/data/embeddings.json` is an array of `{ id: string, embedding: number[] }` — 42 items, 384-d vectors, IDs match PaletteItem IDs. Vite imports JSON natively.
- `src/lib/embedding-model.ts` exports `initModel()`, `embedQuery(text)`, `isModelReady()` — check `isModelReady()` before calling `embedQuery()`
- `initModel()` is called fire-and-forget in `App.tsx` on mount — model loads during boot/ECG/login phases
---
## 2026-02-15 - US-001
- Installed `@xenova/transformers` (^2.17.2)
- Created `scripts/generate-embeddings.ts` with main() that loads `Xenova/all-MiniLM-L6-v2` and embeds a test string
- Added `"generate-embeddings"` npm script
- Verified: outputs vector length 384 and exits cleanly
- Typecheck passes
- Files changed: `package.json`, `package-lock.json`, `scripts/generate-embeddings.ts`
- **Learnings for future iterations:**
- `pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')` auto-downloads and caches the ONNX model (~23MB)
- First run takes a few seconds for model download; subsequent runs are near-instant from cache
- The pipeline's `pooling: 'mean'` and `normalize: true` options handle mean-pooling and L2 normalization in one step — no manual tensor manipulation needed
- `output.data` is a `Float32Array`; wrap in `Array.from()` for a plain number array
---
## 2026-02-15 - US-002
- Added `buildEmbeddingTexts()` function to `src/lib/search.ts`
- Imports all raw data files (consultations, skills, kpis, investigations, documents)
- Generates natural-language paragraphs for each palette item type:
- Consultations: role, org, duration, history narrative, examination bullets, coded entry descriptions
- Skills: name, category, frequency, proficiency %, years of experience
- Achievements: title, subtitle, full KPI explanation + story context + outcomes
- Investigations: name, methodology, tech stack, results
- Education: title, type, institution, duration, classification, research detail, notes (from documents.ts)
- Quick Actions: title + subtitle
- IDs match PaletteItem IDs (e.g. `exp-{id}`, `skill-{id}`, `ach-{i}`, `proj-{id}`, `edu-{i}`, `action-{i}`)
- Typecheck and lint pass
- Files changed: `src/lib/search.ts`
- **Learnings for future iterations:**
- Education items in `buildPaletteData()` are hardcoded arrays (not iterated from `documents`), with ids `edu-0` through `edu-3`. The mapping to `documents.ts` entries is: edu-0→doc-mary-seacole, edu-1→doc-mpharm, edu-2→doc-alevels, edu-3→doc-gphc
- Achievement items are similarly hardcoded with ids `ach-0` through `ach-3`, each linked to a KPI id
- Quick action items are `action-0` through `action-3`
- `documents.ts` is imported but wasn't previously used in `search.ts` — now used for education embedding text
---
## 2026-02-15 - US-003
- Updated `scripts/generate-embeddings.ts` to import `buildEmbeddingTexts()` and generate full embeddings
- Script embeds all 42 palette items sequentially using `Xenova/all-MiniLM-L6-v2`
- Outputs `src/data/embeddings.json` as `Array<{ id: string, embedding: number[] }>`
- Each embedding is a 384-dimensional float array
- File is ~453KB (42 items × 384 floats with pretty-printed JSON)
- `npm run generate-embeddings` regenerates the file successfully
- Typecheck and lint pass
- Files changed: `scripts/generate-embeddings.ts`, `src/data/embeddings.json`
- **Learnings for future iterations:**
- `import.meta.dirname` works in tsx/Node ESM scripts — use it instead of `__dirname` (which isn't available in ESM)
- `@/` path alias works in `npx tsx` scripts because tsx resolves tsconfig paths automatically
- The embeddings file is ~450KB with pretty-print; could be reduced with compact JSON but readability is preferred for now
- Processing 42 items takes ~10-15 seconds on first run (model cached after first download)
---
## 2026-02-15 - US-004
- Created `src/lib/embedding-model.ts` with three exports: `initModel()`, `embedQuery()`, `isModelReady()`
- Module-level `let extractor` pattern avoids React re-render issues
- `initModel()` uses `loading` guard to prevent duplicate pipeline loads
- `embedQuery()` uses same `pooling: 'mean'` and `normalize: true` as the build script
- `initModel()` called fire-and-forget in `App.tsx` `useEffect([], [])` — runs during boot phase
- Silent failure: try/catch swallows errors, `isModelReady()` stays false
- Typecheck, lint, and build all pass
- Files changed: `src/lib/embedding-model.ts` (new), `src/App.tsx`
- **Learnings for future iterations:**
- `FeatureExtractionPipeline` type is exported from `@xenova/transformers` and can be used for the module-level variable
- The `loading` boolean guard prevents race conditions if `initModel()` is called multiple times (e.g., React strict mode double-mount)
- `initModel()` is intentionally not awaited — it's fire-and-forget so it doesn't block the boot animation
- Consumers should check `isModelReady()` before calling `embedQuery()` — it throws if model isn't loaded
---