merge

2026-02-15 23:20:24 +00:00
parent 4580ca9c84
commit 0fbbf9e46f
10 changed files with 1576 additions and 2 deletions
@@ -0,0 +1,125 @@
+# PRD: Chat Widget Polish & Model Updates
+
+## Introduction
+
+The semantic search and AI chat features are functionally complete (US-001 through US-010). This PRD covers four polish items: mobile full-screen chat experience, a welcome message with suggested questions, self-hosting the ONNX embedding model, and updating from Gemini 2.0 Flash to Gemini 3 Flash Preview.
+
+## Goals
+
+- Full-screen chat on mobile (<768px) for a better small-screen experience
+- Welcome message with suggested question chips to reduce blank-state friction
+- Self-host the ONNX model (`all-MiniLM-L6-v2`) to eliminate dependency on Hugging Face CDN
+- Update Gemini model to `gemini-3-flash-preview` and show which model powers the chat
+- Refresh system prompt while updating the model
+
+## User Stories
+
+### US-011: Mobile full-screen chat panel
+**Description:** As a mobile visitor, I want the chat panel to be a full-screen overlay so it's easy to use on small screens.
+
+**Acceptance Criteria:**
+- [ ] Below `md` breakpoint (768px), chat panel renders as full-screen overlay (100vw x 100vh, or using `dvh` for mobile browser chrome)
+- [ ] Full-screen mode has a visible header with close button
+- [ ] Floating chat button is hidden while panel is open on mobile
+- [ ] Above 768px, existing panel behavior unchanged (380px wide, anchored bottom-right)
+- [ ] Smooth transition between open/closed states respects `prefers-reduced-motion`
+- [ ] Typecheck passes
+- [ ] Verify in browser using dev-browser skill
+
+### US-012: Welcome message with suggested questions
+**Description:** As a visitor opening the chat for the first time, I see a friendly welcome and clickable suggested questions so I know what to ask.
+
+**Acceptance Criteria:**
+- [ ] When chat panel opens and conversation is empty, display welcome message: "Hey! I'm here to help you learn more about Andy. What would you like to know?"
+- [ ] Below the welcome message, show 2-3 clickable pill/chip buttons with suggested questions (e.g., "What's his NHS experience?", "Tell me about his data skills", "What projects has he built?")
+- [ ] Clicking a suggested question sends it as a user message (same as typing and pressing Enter)
+- [ ] Welcome message and chips are always visible when conversation is empty (persist across open/close if no messages sent)
+- [ ] Once a message is sent, the welcome/chips area is replaced by the conversation
+- [ ] Chips use design system tokens (teal accent border, hover state)
+- [ ] Typecheck passes
+- [ ] Verify in browser using dev-browser skill
+
+### US-013: Self-host ONNX embedding model
+**Description:** As a developer, I want the ONNX model files served from the same host as the site, so there's no runtime dependency on Hugging Face CDN.
+
+**Acceptance Criteria:**
+- [ ] Model files for `all-MiniLM-L6-v2` downloaded and placed in `public/models/all-MiniLM-L6-v2/` (or `public/models/onnx/` — whichever is cleaner)
+- [ ] Files include at minimum: `onnx/model_quantized.onnx`, `tokenizer.json`, `tokenizer_config.json`, `config.json`
+- [ ] `src/lib/embedding-model.ts` updated to load from local path instead of Hugging Face CDN
+- [ ] Build-time embedding script (`scripts/generate-embeddings.ts`) also uses local model path
+- [ ] `.gitignore` does NOT ignore the model files — they are committed as static assets
+- [ ] Verify model loads correctly in browser (semantic search still works in command palette)
+- [ ] Typecheck passes
+
+### US-014: Update to Gemini 3 Flash Preview + model indicator
+**Description:** As a developer, I want to use the latest free Gemini model, and as a visitor, I want to see what model powers the chat.
+
+**Acceptance Criteria:**
+- [ ] `GEMINI_API_BASE` in `src/lib/gemini.ts` updated from `gemini-2.0-flash` to `gemini-3-flash-preview`
+- [ ] Review and update the system prompt for clarity (ensure it's well-structured for the new model)
+- [ ] Review and update the response format instructions (the `[ITEMS: ...]` suffix pattern)
+- [ ] Small text indicator in chat panel header or footer showing the model name (e.g., "Gemini 3 Flash" in `font-geist`, 11px, tertiary color)
+- [ ] If the model string needs to change in future, it should be a single constant — not hardcoded in multiple places
+- [ ] Typecheck passes
+- [ ] Verify in browser using dev-browser skill
+
+## Functional Requirements
+
+- FR-1: Chat panel below 768px uses full-screen overlay layout (`position: fixed; inset: 0`)
+- FR-2: Chat button hidden when full-screen panel is open on mobile
+- FR-3: Welcome message and suggested question chips shown when conversation is empty
+- FR-4: Clicking a suggested question chip triggers the same flow as manually typing and sending
+- FR-5: ONNX model files served from `public/models/` as static assets
+- FR-6: `embedding-model.ts` configures Transformers.js to use local model path
+- FR-7: Gemini API calls use `gemini-3-flash-preview` model
+- FR-8: Chat UI displays model name indicator
+
+## Non-Goals
+
+- No changes to the command palette UI or semantic search ranking logic
+- No persistent chat history across page loads
+- No rate limiting or abuse prevention
+- No changes to the boot/ECG/login flow
+- No model fine-tuning or custom training
+
+## Design Considerations
+
+### Mobile Full-Screen Chat
+- Full viewport with safe area insets (`env(safe-area-inset-*)`) for notched devices
+- Header matches existing panel header style but full-width
+- Input pinned to bottom, messages scroll above
+
+### Welcome Message & Chips
+- Welcome text styled as an AI message bubble (left-aligned, light background)
+- Chips: small rounded pills with teal border, teal text on hover, `font-ui` 12-13px
+- 2-3 chips arranged in a flex-wrap row below the welcome bubble
+- Example questions: "What's his NHS experience?", "Tell me about his data skills", "What projects has he built?"
+
+### Model Indicator
+- Placed in the chat panel header, right-aligned or below the "Ask about Andy" title
+- `font-geist`, 11px, `var(--text-tertiary)` color
+- Format: "Powered by Gemini 3 Flash" or just "Gemini 3 Flash"
+
+## Technical Considerations
+
+### Self-Hosting ONNX Model
+- Transformers.js supports a `localURL` or custom `env.localModelPath` configuration to redirect model loading from HF CDN to a local path
+- The quantized model (`model_quantized.onnx`) is ~23MB — acceptable for a static deploy
+- Files must be served with correct MIME types (`.onnx` as `application/octet-stream`)
+- The build-time script and browser runtime must both point to the same model files
+
+### Gemini Model Update
+- `gemini-3-flash-preview` may have a different API path structure — verify against the Generative Language API docs
+- The streaming SSE format should be identical across Flash models, but verify the response shape
+
+## Success Metrics
+
+- Mobile chat is comfortable to use on a phone-sized viewport (no overflow, no cropping)
+- Suggested questions reduce "blank screen" hesitation — visitors engage faster
+- ONNX model loads successfully from local path (no HF CDN requests in network tab)
+- Chat responses come through on the new Gemini model with correct item references
+
+## Open Questions
+
+- Should the suggested question chips be configurable from a data file, or hardcoded in the component?
+- Does `gemini-3-flash-preview` require a different API version path (`v1beta` vs `v1`)?
@@ -0,0 +1,197 @@
+# PRD: Improve LLM CV Knowledge Accuracy
+
+## Introduction
+
+The portfolio's AI chat gives inaccurate or shallow answers about Andy's work history. The root cause: the system prompt feeds `buildEmbeddingTexts()` summaries rather than the full CV detail. Questions about specific achievements, methodology, clinical specialties, or cross-role context produce vague or incorrect responses. This PRD defines an iterative improvement process: enrich the LLM's context, measure accuracy against 10 verifiable benchmark questions, and repeat until all pass — while ensuring changes are structural (not question-specific hacks).
+
+Additionally, the LLM provider is changing from Gemini to **OpenRouter** using the **z-ai/glm-5** model. This requires migrating the API integration, renaming the module, and updating the benchmark harness to use the new provider.
+
+## Goals
+
+- Achieve 10/10 accuracy on benchmark questions with factually correct, detailed, citation-worthy answers
+- Ensure improvements are structural — benefiting all possible queries, not just the 10 benchmarks
+- Maintain the existing architecture (no new APIs beyond OpenRouter, no RAG infrastructure, no backend)
+- Migrate from Gemini to OpenRouter (z-ai/glm-5) for both production chat and benchmark scoring
+- Regenerate embeddings when embedding texts change, keeping search and LLM context in sync
+
+## Benchmark Questions
+
+These 10 questions have verifiable answers from CV_v4.md and the structured data files. Each tests a different knowledge gap.
+
+| # | Question | Expected Answer (summary) | Tests |
+|---|----------|--------------------------|-------|
+| Q1 | "How many years has Andy been employed by the NHS?" | ~3.5 years (May 2022–present). Tesco was private sector. | NHS vs non-NHS employer distinction |
+| Q2 | "What was Andy's involvement with tirzepatide?" | Supported NICE TA1026 commissioning, authored executive paper advocating primary care model, drove GP-led delivery. | Deep role-specific detail |
+| Q3 | "What specific tools and software has Andy built?" | 5 projects: switching algorithm, Blueteq generator, CD monitoring, Sankey tool, PharMetrics. Each with outcomes. | Cross-role aggregation |
+| Q4 | "What were Andy's A-level subjects and grades?" | Maths A*, Chemistry B, Politics C. Highworth Grammar School, 2009–2011. | Specific education detail |
+| Q5 | "Was Andy's Tesco role part of the NHS?" | No. Tesco PLC is private. Community pharmacy, not NHS employment. LPC representative for Norfolk. | Employer classification |
+| Q6 | "How did the patient switching algorithm work?" | Python, real-world GP data, auto-identified patients for alternatives, 3 days vs months manual, 14,000 patients, £2.6M, novel GP payment system. | Methodology depth |
+| Q7 | "What clinical specialties has Andy worked across?" | Rheumatology, ophthalmology (wet AMD, DMO, RVO), dermatology, gastroenterology, neurology, migraine — from high-cost drugs role. | Narrative detail not in bullet summaries |
+| Q8 | "What is Andy's experience with the dm+d?" | Created comprehensive medicines data table integrating all dm+d products with standardised strengths, morphine equivalents, Anticholinergic Burden scoring — single source of truth. | Technical achievement context |
+| Q9 | "What budget does Andy manage and how?" | £220M prescribing budget. Forecasting models, variance analysis, financial reporting to executive team, interactive expenditure dashboard. | Figure + methodology |
+| Q10 | "What leadership training does Andy have?" | Mary Seacole Programme (2018, 78%). Also national induction programme at Tesco, NVQ3 supervision. | Cross-role synthesis |
+
+### Scoring Criteria
+
+Each question scored 0–2:
+- **0 — Incorrect**: Wrong facts, invented detail, or contradicts CV
+- **1 — Partial**: Correct but missing key detail, or vague where specifics are available
+- **2 — Accurate**: Factually correct, appropriately detailed, cites specific achievements/metrics
+
+**Pass threshold**: 18/20 (90%), with no question scoring 0.
+
+### Anti-Benchmaxing Rules
+
+- No hardcoded answers or question-specific prompt clauses
+- Every change must be a structural improvement (richer context, better prompt patterns, enriched embeddings)
+- After each iteration, mentally evaluate: "Would this help a question NOT in the benchmark?" — if no, reject the change
+- The system prompt must not reference benchmark questions or their specific phrasings
+
+## User Stories
+
+### US-001: Migrate production chat from Gemini to OpenRouter
+**Description:** As a developer, I need to replace the Gemini API integration with OpenRouter so the chat uses z-ai/glm-5.
+
+**Acceptance Criteria:**
+- [ ] Rename `src/lib/gemini.ts` → `src/lib/llm.ts`
+- [ ] Update all imports across the codebase (`ChatWidget.tsx`, `search.ts`, etc.)
+- [ ] Replace Gemini API calls with OpenRouter's OpenAI-compatible API (`https://openrouter.ai/api/v1/chat/completions`)
+- [ ] Model set to `z-ai/glm-5`
+- [ ] API key read from `VITE_OPEN_ROUTER_API_KEY` env var
+- [ ] SSE streaming still works (OpenRouter supports `stream: true`)
+- [ ] System prompt and message format adapted to OpenAI chat completions format (`messages` array with `role`/`content`)
+- [ ] Export updated display name constant (e.g., `LLM_DISPLAY_NAME = 'GLM-5'`) and update model indicator in chat UI
+- [ ] Rename `isGeminiAvailable()` → `isLLMAvailable()` (or similar)
+- [ ] Typecheck passes
+- [ ] **Verify in browser**: chat opens, sends a message, streams a response
+
+### US-002: Migrate benchmark script to OpenRouter
+**Description:** As a developer, I need the benchmark harness to use OpenRouter so it tests the same model and prompt path as production.
+
+**Acceptance Criteria:**
+- [ ] `scripts/benchmark.ts` uses OpenRouter API instead of Gemini
+- [ ] API key read from `VITE_OPEN_ROUTER_API_KEY` (loaded from `.env`)
+- [ ] Request format uses OpenAI chat completions structure
+- [ ] Model identifier set to `z-ai/glm-5`
+- [ ] Rate limit/retry logic updated for OpenRouter's error responses
+- [ ] Scoring calls also use OpenRouter (same provider for all LLM calls)
+- [ ] `npm run benchmark` still works end-to-end
+- [ ] Typecheck passes
+
+### US-003: Enrich system prompt with full CV context
+**Description:** As a portfolio visitor, I want the AI to have comprehensive knowledge of Andy's background so it can answer detailed questions accurately.
+
+**Acceptance Criteria:**
+- [ ] System prompt includes full professional profile narrative (from CV_v4.md profile section)
+- [ ] Each role includes full achievement bullets, not just summaries
+- [ ] Clear distinction between NHS employment (May 2022+) and private sector (Tesco)
+- [ ] Clinical specialties, methodology details, and specific outcomes included
+- [ ] Education includes specific grades, subjects, research topics
+- [ ] Prompt is well-structured with clear sections for easy LLM parsing
+- [ ] No invented or extrapolated content — everything sourced from CV_v4.md and data files
+- [ ] Typecheck passes
+
+### US-004: Improve system prompt instructions
+**Description:** As a portfolio visitor, I want the AI to use its knowledge effectively — citing specifics, distinguishing between employers, and aggregating across roles when asked.
+
+**Acceptance Criteria:**
+- [ ] Prompt instructs LLM to distinguish NHS employment from private sector roles
+- [ ] Prompt instructs LLM to aggregate across roles when asked broad questions (e.g., "what tools has Andy built?")
+- [ ] Prompt instructs LLM to cite specific metrics, dates, and outcomes when available
+- [ ] Temperature and token limits are appropriate for detailed answers (review current 0.7 temp, 512 max tokens)
+- [ ] Typecheck passes
+
+### US-005: Enrich embedding texts for semantic search
+**Description:** As a portfolio visitor, I want semantic search to surface relevant results even for nuanced queries so the chat and command palette find the right content.
+
+**Acceptance Criteria:**
+- [ ] `buildEmbeddingTexts()` generates richer text per item — full achievement narratives, methodology detail, clinical specialties
+- [ ] Role `history` narratives are included (currently only `examination` bullets and `codedEntries`)
+- [ ] Cross-references included where items relate (e.g., CD monitoring links to controlled drugs skill)
+- [ ] Embedding texts remain well-formed natural language (not keyword soup)
+- [ ] Typecheck passes
+
+### US-006: Regenerate embeddings
+**Description:** As a developer, I need embeddings regenerated whenever embedding texts change so semantic search results match the enriched content.
+
+**Acceptance Criteria:**
+- [ ] Embeddings regenerated using the same model (all-MiniLM-L6-v2)
+- [ ] Output written to `src/data/embeddings.json`
+- [ ] Number of embeddings matches number of palette items
+- [ ] Regeneration can be triggered via script (`npm run generate-embeddings` or similar)
+- [ ] Typecheck passes
+
+### US-007: Iterative benchmark loop
+**Description:** As a developer, I want to run the benchmark, review scores, make improvements, and repeat until the pass threshold is met.
+
+**Acceptance Criteria:**
+- [ ] Run benchmark → review scores → identify failing questions → make structural improvements → repeat
+- [ ] Each iteration logged with: changes made, scores before/after, rationale
+- [ ] Minimum 2 iterations, maximum 10
+- [ ] Stop when 18/20 achieved with no question scoring 0
+- [ ] Final iteration results saved as evidence
+- [ ] All changes pass typecheck before benchmarking
+
+### US-008: Validate no regression on general queries
+**Description:** As a portfolio visitor, I want the AI to still handle general questions well after the benchmark-focused improvements.
+
+**Acceptance Criteria:**
+- [ ] Test 5 general questions not in the benchmark (e.g., "Tell me about Andy", "What does Andy do?", "How can I contact Andy?", "What is this website?", "What are Andy's strongest skills?")
+- [ ] All general questions produce sensible, accurate responses
+- [ ] No degradation in response quality for broad queries
+- [ ] System prompt size hasn't grown to a point that degrades response speed noticeably
+
+## Functional Requirements
+
+- FR-1: Production chat must use OpenRouter API with model `z-ai/glm-5`
+- FR-2: API key sourced from `VITE_OPEN_ROUTER_API_KEY` environment variable
+- FR-3: LLM module renamed from `gemini.ts` to `llm.ts` with updated exports
+- FR-4: Chat UI displays "GLM-5" as the model indicator (replacing "Gemini 3 Flash")
+- FR-5: Benchmark harness must use the identical system prompt construction path as production (`buildSystemPrompt()` from `llm.ts`)
+- FR-6: System prompt changes must be made in `llm.ts` and/or `search.ts` — the same files that serve production
+- FR-7: Embedding text changes must be in `buildEmbeddingTexts()` in `search.ts`
+- FR-8: Scoring must be automated via LLM (OpenRouter), not manual review
+- FR-9: All benchmark artifacts (questions, expected answers, results) stored in `scripts/`
+- FR-10: Embedding regeneration must produce deterministic output for the same input texts
+- FR-11: System prompt must remain a single self-contained context block (no external retrieval at runtime)
+
+## Non-Goals
+
+- No RAG infrastructure or vector database
+- No additional API integrations beyond OpenRouter
+- No changes to the chat UI layout, streaming UX, or item linking (beyond model name display)
+- No changes to the command palette search UX
+- No changes to boot sequence, ECG, or login phases
+- No new backend or server-side components
+- Not optimising for adversarial/trick questions — focus is on legitimate CV queries
+- No keeping Gemini as a fallback — this is a full replacement
+
+## Technical Considerations
+
+- **OpenRouter API format**: Uses OpenAI-compatible chat completions endpoint (`POST https://openrouter.ai/api/v1/chat/completions`). Messages use `{ role: 'system' | 'user' | 'assistant', content: string }` format. Streaming uses `stream: true` with SSE `data:` lines containing `choices[0].delta.content`.
+- **Authentication**: `Authorization: Bearer <VITE_OPEN_ROUTER_API_KEY>` header. Include `HTTP-Referer` and `X-Title` headers as recommended by OpenRouter.
+- **Rate limits**: OpenRouter has per-model rate limits. Add retry logic for 429 responses. The benchmark script should include delays between calls.
+- **Embedding regeneration**: Needs Node.js script that loads the ONNX model and processes all texts. Existing `scripts/generate-embeddings` script should be reused.
+- **Temperature**: Current 0.7 may introduce variability in answers. Consider lowering to 0.3–0.5 for more consistent factual responses. Benchmark both.
+- **Max tokens**: Current 512 may truncate detailed answers. Consider increasing to 768 or 1024 for benchmark testing.
+- **Prompt structure**: Well-structured prompts with clear headings/sections parse better for LLMs than flat text. Consider markdown structure in system prompt.
+- **CORS**: OpenRouter supports browser-side calls. The existing client-side fetch pattern should work without changes.
+
+## Success Metrics
+
+- 18/20 or higher on benchmark (90%+ accuracy)
+- No question scores 0 (no factual errors)
+- 5/5 general validation questions pass
+- System prompt remains under 8KB
+- No typecheck or lint regressions
+- Embedding regeneration completes without errors
+- Chat streaming works in-browser with OpenRouter
+
+## Resolved Questions
+
+- **Model provider**: OpenRouter with z-ai/glm-5 (replaces Gemini 3 Flash).
+- **File naming**: `gemini.ts` renamed to `llm.ts` for provider-agnostic naming.
+- **Benchmark provider**: OpenRouter used for both chat answers and scoring (single provider).
+- **Benchmark results are git-tracked.** Each iteration's scores are committed so improvement over time is visible and auditable.
+- **Existing `scripts/generate-embeddings` script exists.** Review and adapt as needed rather than building from scratch.
+- **Benchmark harness is permanent.** Kept as an ongoing regression test (`npm run benchmark`) for validating LLM accuracy after any data or prompt changes. Question set can be expanded over time.