This commit is contained in:
2026-02-15 23:20:24 +00:00
parent 4580ca9c84
commit 0fbbf9e46f
10 changed files with 1576 additions and 2 deletions
+125
View File
@@ -0,0 +1,125 @@
# PRD: Chat Widget Polish & Model Updates
## Introduction
The semantic search and AI chat features are functionally complete (US-001 through US-010). This PRD covers four polish items: mobile full-screen chat experience, a welcome message with suggested questions, self-hosting the ONNX embedding model, and updating from Gemini 2.0 Flash to Gemini 3 Flash Preview.
## Goals
- Full-screen chat on mobile (<768px) for a better small-screen experience
- Welcome message with suggested question chips to reduce blank-state friction
- Self-host the ONNX model (`all-MiniLM-L6-v2`) to eliminate dependency on Hugging Face CDN
- Update Gemini model to `gemini-3-flash-preview` and show which model powers the chat
- Refresh system prompt while updating the model
## User Stories
### US-011: Mobile full-screen chat panel
**Description:** As a mobile visitor, I want the chat panel to be a full-screen overlay so it's easy to use on small screens.
**Acceptance Criteria:**
- [ ] Below `md` breakpoint (768px), chat panel renders as full-screen overlay (100vw x 100vh, or using `dvh` for mobile browser chrome)
- [ ] Full-screen mode has a visible header with close button
- [ ] Floating chat button is hidden while panel is open on mobile
- [ ] Above 768px, existing panel behavior unchanged (380px wide, anchored bottom-right)
- [ ] Smooth transition between open/closed states respects `prefers-reduced-motion`
- [ ] Typecheck passes
- [ ] Verify in browser using dev-browser skill
### US-012: Welcome message with suggested questions
**Description:** As a visitor opening the chat for the first time, I see a friendly welcome and clickable suggested questions so I know what to ask.
**Acceptance Criteria:**
- [ ] When chat panel opens and conversation is empty, display welcome message: "Hey! I'm here to help you learn more about Andy. What would you like to know?"
- [ ] Below the welcome message, show 2-3 clickable pill/chip buttons with suggested questions (e.g., "What's his NHS experience?", "Tell me about his data skills", "What projects has he built?")
- [ ] Clicking a suggested question sends it as a user message (same as typing and pressing Enter)
- [ ] Welcome message and chips are always visible when conversation is empty (persist across open/close if no messages sent)
- [ ] Once a message is sent, the welcome/chips area is replaced by the conversation
- [ ] Chips use design system tokens (teal accent border, hover state)
- [ ] Typecheck passes
- [ ] Verify in browser using dev-browser skill
### US-013: Self-host ONNX embedding model
**Description:** As a developer, I want the ONNX model files served from the same host as the site, so there's no runtime dependency on Hugging Face CDN.
**Acceptance Criteria:**
- [ ] Model files for `all-MiniLM-L6-v2` downloaded and placed in `public/models/all-MiniLM-L6-v2/` (or `public/models/onnx/` — whichever is cleaner)
- [ ] Files include at minimum: `onnx/model_quantized.onnx`, `tokenizer.json`, `tokenizer_config.json`, `config.json`
- [ ] `src/lib/embedding-model.ts` updated to load from local path instead of Hugging Face CDN
- [ ] Build-time embedding script (`scripts/generate-embeddings.ts`) also uses local model path
- [ ] `.gitignore` does NOT ignore the model files — they are committed as static assets
- [ ] Verify model loads correctly in browser (semantic search still works in command palette)
- [ ] Typecheck passes
### US-014: Update to Gemini 3 Flash Preview + model indicator
**Description:** As a developer, I want to use the latest free Gemini model, and as a visitor, I want to see what model powers the chat.
**Acceptance Criteria:**
- [ ] `GEMINI_API_BASE` in `src/lib/gemini.ts` updated from `gemini-2.0-flash` to `gemini-3-flash-preview`
- [ ] Review and update the system prompt for clarity (ensure it's well-structured for the new model)
- [ ] Review and update the response format instructions (the `[ITEMS: ...]` suffix pattern)
- [ ] Small text indicator in chat panel header or footer showing the model name (e.g., "Gemini 3 Flash" in `font-geist`, 11px, tertiary color)
- [ ] If the model string needs to change in future, it should be a single constant — not hardcoded in multiple places
- [ ] Typecheck passes
- [ ] Verify in browser using dev-browser skill
## Functional Requirements
- FR-1: Chat panel below 768px uses full-screen overlay layout (`position: fixed; inset: 0`)
- FR-2: Chat button hidden when full-screen panel is open on mobile
- FR-3: Welcome message and suggested question chips shown when conversation is empty
- FR-4: Clicking a suggested question chip triggers the same flow as manually typing and sending
- FR-5: ONNX model files served from `public/models/` as static assets
- FR-6: `embedding-model.ts` configures Transformers.js to use local model path
- FR-7: Gemini API calls use `gemini-3-flash-preview` model
- FR-8: Chat UI displays model name indicator
## Non-Goals
- No changes to the command palette UI or semantic search ranking logic
- No persistent chat history across page loads
- No rate limiting or abuse prevention
- No changes to the boot/ECG/login flow
- No model fine-tuning or custom training
## Design Considerations
### Mobile Full-Screen Chat
- Full viewport with safe area insets (`env(safe-area-inset-*)`) for notched devices
- Header matches existing panel header style but full-width
- Input pinned to bottom, messages scroll above
### Welcome Message & Chips
- Welcome text styled as an AI message bubble (left-aligned, light background)
- Chips: small rounded pills with teal border, teal text on hover, `font-ui` 12-13px
- 2-3 chips arranged in a flex-wrap row below the welcome bubble
- Example questions: "What's his NHS experience?", "Tell me about his data skills", "What projects has he built?"
### Model Indicator
- Placed in the chat panel header, right-aligned or below the "Ask about Andy" title
- `font-geist`, 11px, `var(--text-tertiary)` color
- Format: "Powered by Gemini 3 Flash" or just "Gemini 3 Flash"
## Technical Considerations
### Self-Hosting ONNX Model
- Transformers.js supports a `localURL` or custom `env.localModelPath` configuration to redirect model loading from HF CDN to a local path
- The quantized model (`model_quantized.onnx`) is ~23MB — acceptable for a static deploy
- Files must be served with correct MIME types (`.onnx` as `application/octet-stream`)
- The build-time script and browser runtime must both point to the same model files
### Gemini Model Update
- `gemini-3-flash-preview` may have a different API path structure — verify against the Generative Language API docs
- The streaming SSE format should be identical across Flash models, but verify the response shape
## Success Metrics
- Mobile chat is comfortable to use on a phone-sized viewport (no overflow, no cropping)
- Suggested questions reduce "blank screen" hesitation — visitors engage faster
- ONNX model loads successfully from local path (no HF CDN requests in network tab)
- Chat responses come through on the new Gemini model with correct item references
## Open Questions
- Should the suggested question chips be configurable from a data file, or hardcoded in the component?
- Does `gemini-3-flash-preview` require a different API version path (`v1beta` vs `v1`)?
+197
View File
@@ -0,0 +1,197 @@
# PRD: Improve LLM CV Knowledge Accuracy
## Introduction
The portfolio's AI chat gives inaccurate or shallow answers about Andy's work history. The root cause: the system prompt feeds `buildEmbeddingTexts()` summaries rather than the full CV detail. Questions about specific achievements, methodology, clinical specialties, or cross-role context produce vague or incorrect responses. This PRD defines an iterative improvement process: enrich the LLM's context, measure accuracy against 10 verifiable benchmark questions, and repeat until all pass — while ensuring changes are structural (not question-specific hacks).
Additionally, the LLM provider is changing from Gemini to **OpenRouter** using the **z-ai/glm-5** model. This requires migrating the API integration, renaming the module, and updating the benchmark harness to use the new provider.
## Goals
- Achieve 10/10 accuracy on benchmark questions with factually correct, detailed, citation-worthy answers
- Ensure improvements are structural — benefiting all possible queries, not just the 10 benchmarks
- Maintain the existing architecture (no new APIs beyond OpenRouter, no RAG infrastructure, no backend)
- Migrate from Gemini to OpenRouter (z-ai/glm-5) for both production chat and benchmark scoring
- Regenerate embeddings when embedding texts change, keeping search and LLM context in sync
## Benchmark Questions
These 10 questions have verifiable answers from CV_v4.md and the structured data files. Each tests a different knowledge gap.
| # | Question | Expected Answer (summary) | Tests |
|---|----------|--------------------------|-------|
| Q1 | "How many years has Andy been employed by the NHS?" | ~3.5 years (May 2022present). Tesco was private sector. | NHS vs non-NHS employer distinction |
| Q2 | "What was Andy's involvement with tirzepatide?" | Supported NICE TA1026 commissioning, authored executive paper advocating primary care model, drove GP-led delivery. | Deep role-specific detail |
| Q3 | "What specific tools and software has Andy built?" | 5 projects: switching algorithm, Blueteq generator, CD monitoring, Sankey tool, PharMetrics. Each with outcomes. | Cross-role aggregation |
| Q4 | "What were Andy's A-level subjects and grades?" | Maths A*, Chemistry B, Politics C. Highworth Grammar School, 20092011. | Specific education detail |
| Q5 | "Was Andy's Tesco role part of the NHS?" | No. Tesco PLC is private. Community pharmacy, not NHS employment. LPC representative for Norfolk. | Employer classification |
| Q6 | "How did the patient switching algorithm work?" | Python, real-world GP data, auto-identified patients for alternatives, 3 days vs months manual, 14,000 patients, £2.6M, novel GP payment system. | Methodology depth |
| Q7 | "What clinical specialties has Andy worked across?" | Rheumatology, ophthalmology (wet AMD, DMO, RVO), dermatology, gastroenterology, neurology, migraine — from high-cost drugs role. | Narrative detail not in bullet summaries |
| Q8 | "What is Andy's experience with the dm+d?" | Created comprehensive medicines data table integrating all dm+d products with standardised strengths, morphine equivalents, Anticholinergic Burden scoring — single source of truth. | Technical achievement context |
| Q9 | "What budget does Andy manage and how?" | £220M prescribing budget. Forecasting models, variance analysis, financial reporting to executive team, interactive expenditure dashboard. | Figure + methodology |
| Q10 | "What leadership training does Andy have?" | Mary Seacole Programme (2018, 78%). Also national induction programme at Tesco, NVQ3 supervision. | Cross-role synthesis |
### Scoring Criteria
Each question scored 02:
- **0 — Incorrect**: Wrong facts, invented detail, or contradicts CV
- **1 — Partial**: Correct but missing key detail, or vague where specifics are available
- **2 — Accurate**: Factually correct, appropriately detailed, cites specific achievements/metrics
**Pass threshold**: 18/20 (90%), with no question scoring 0.
### Anti-Benchmaxing Rules
- No hardcoded answers or question-specific prompt clauses
- Every change must be a structural improvement (richer context, better prompt patterns, enriched embeddings)
- After each iteration, mentally evaluate: "Would this help a question NOT in the benchmark?" — if no, reject the change
- The system prompt must not reference benchmark questions or their specific phrasings
## User Stories
### US-001: Migrate production chat from Gemini to OpenRouter
**Description:** As a developer, I need to replace the Gemini API integration with OpenRouter so the chat uses z-ai/glm-5.
**Acceptance Criteria:**
- [ ] Rename `src/lib/gemini.ts``src/lib/llm.ts`
- [ ] Update all imports across the codebase (`ChatWidget.tsx`, `search.ts`, etc.)
- [ ] Replace Gemini API calls with OpenRouter's OpenAI-compatible API (`https://openrouter.ai/api/v1/chat/completions`)
- [ ] Model set to `z-ai/glm-5`
- [ ] API key read from `VITE_OPEN_ROUTER_API_KEY` env var
- [ ] SSE streaming still works (OpenRouter supports `stream: true`)
- [ ] System prompt and message format adapted to OpenAI chat completions format (`messages` array with `role`/`content`)
- [ ] Export updated display name constant (e.g., `LLM_DISPLAY_NAME = 'GLM-5'`) and update model indicator in chat UI
- [ ] Rename `isGeminiAvailable()``isLLMAvailable()` (or similar)
- [ ] Typecheck passes
- [ ] **Verify in browser**: chat opens, sends a message, streams a response
### US-002: Migrate benchmark script to OpenRouter
**Description:** As a developer, I need the benchmark harness to use OpenRouter so it tests the same model and prompt path as production.
**Acceptance Criteria:**
- [ ] `scripts/benchmark.ts` uses OpenRouter API instead of Gemini
- [ ] API key read from `VITE_OPEN_ROUTER_API_KEY` (loaded from `.env`)
- [ ] Request format uses OpenAI chat completions structure
- [ ] Model identifier set to `z-ai/glm-5`
- [ ] Rate limit/retry logic updated for OpenRouter's error responses
- [ ] Scoring calls also use OpenRouter (same provider for all LLM calls)
- [ ] `npm run benchmark` still works end-to-end
- [ ] Typecheck passes
### US-003: Enrich system prompt with full CV context
**Description:** As a portfolio visitor, I want the AI to have comprehensive knowledge of Andy's background so it can answer detailed questions accurately.
**Acceptance Criteria:**
- [ ] System prompt includes full professional profile narrative (from CV_v4.md profile section)
- [ ] Each role includes full achievement bullets, not just summaries
- [ ] Clear distinction between NHS employment (May 2022+) and private sector (Tesco)
- [ ] Clinical specialties, methodology details, and specific outcomes included
- [ ] Education includes specific grades, subjects, research topics
- [ ] Prompt is well-structured with clear sections for easy LLM parsing
- [ ] No invented or extrapolated content — everything sourced from CV_v4.md and data files
- [ ] Typecheck passes
### US-004: Improve system prompt instructions
**Description:** As a portfolio visitor, I want the AI to use its knowledge effectively — citing specifics, distinguishing between employers, and aggregating across roles when asked.
**Acceptance Criteria:**
- [ ] Prompt instructs LLM to distinguish NHS employment from private sector roles
- [ ] Prompt instructs LLM to aggregate across roles when asked broad questions (e.g., "what tools has Andy built?")
- [ ] Prompt instructs LLM to cite specific metrics, dates, and outcomes when available
- [ ] Temperature and token limits are appropriate for detailed answers (review current 0.7 temp, 512 max tokens)
- [ ] Typecheck passes
### US-005: Enrich embedding texts for semantic search
**Description:** As a portfolio visitor, I want semantic search to surface relevant results even for nuanced queries so the chat and command palette find the right content.
**Acceptance Criteria:**
- [ ] `buildEmbeddingTexts()` generates richer text per item — full achievement narratives, methodology detail, clinical specialties
- [ ] Role `history` narratives are included (currently only `examination` bullets and `codedEntries`)
- [ ] Cross-references included where items relate (e.g., CD monitoring links to controlled drugs skill)
- [ ] Embedding texts remain well-formed natural language (not keyword soup)
- [ ] Typecheck passes
### US-006: Regenerate embeddings
**Description:** As a developer, I need embeddings regenerated whenever embedding texts change so semantic search results match the enriched content.
**Acceptance Criteria:**
- [ ] Embeddings regenerated using the same model (all-MiniLM-L6-v2)
- [ ] Output written to `src/data/embeddings.json`
- [ ] Number of embeddings matches number of palette items
- [ ] Regeneration can be triggered via script (`npm run generate-embeddings` or similar)
- [ ] Typecheck passes
### US-007: Iterative benchmark loop
**Description:** As a developer, I want to run the benchmark, review scores, make improvements, and repeat until the pass threshold is met.
**Acceptance Criteria:**
- [ ] Run benchmark → review scores → identify failing questions → make structural improvements → repeat
- [ ] Each iteration logged with: changes made, scores before/after, rationale
- [ ] Minimum 2 iterations, maximum 10
- [ ] Stop when 18/20 achieved with no question scoring 0
- [ ] Final iteration results saved as evidence
- [ ] All changes pass typecheck before benchmarking
### US-008: Validate no regression on general queries
**Description:** As a portfolio visitor, I want the AI to still handle general questions well after the benchmark-focused improvements.
**Acceptance Criteria:**
- [ ] Test 5 general questions not in the benchmark (e.g., "Tell me about Andy", "What does Andy do?", "How can I contact Andy?", "What is this website?", "What are Andy's strongest skills?")
- [ ] All general questions produce sensible, accurate responses
- [ ] No degradation in response quality for broad queries
- [ ] System prompt size hasn't grown to a point that degrades response speed noticeably
## Functional Requirements
- FR-1: Production chat must use OpenRouter API with model `z-ai/glm-5`
- FR-2: API key sourced from `VITE_OPEN_ROUTER_API_KEY` environment variable
- FR-3: LLM module renamed from `gemini.ts` to `llm.ts` with updated exports
- FR-4: Chat UI displays "GLM-5" as the model indicator (replacing "Gemini 3 Flash")
- FR-5: Benchmark harness must use the identical system prompt construction path as production (`buildSystemPrompt()` from `llm.ts`)
- FR-6: System prompt changes must be made in `llm.ts` and/or `search.ts` — the same files that serve production
- FR-7: Embedding text changes must be in `buildEmbeddingTexts()` in `search.ts`
- FR-8: Scoring must be automated via LLM (OpenRouter), not manual review
- FR-9: All benchmark artifacts (questions, expected answers, results) stored in `scripts/`
- FR-10: Embedding regeneration must produce deterministic output for the same input texts
- FR-11: System prompt must remain a single self-contained context block (no external retrieval at runtime)
## Non-Goals
- No RAG infrastructure or vector database
- No additional API integrations beyond OpenRouter
- No changes to the chat UI layout, streaming UX, or item linking (beyond model name display)
- No changes to the command palette search UX
- No changes to boot sequence, ECG, or login phases
- No new backend or server-side components
- Not optimising for adversarial/trick questions — focus is on legitimate CV queries
- No keeping Gemini as a fallback — this is a full replacement
## Technical Considerations
- **OpenRouter API format**: Uses OpenAI-compatible chat completions endpoint (`POST https://openrouter.ai/api/v1/chat/completions`). Messages use `{ role: 'system' | 'user' | 'assistant', content: string }` format. Streaming uses `stream: true` with SSE `data:` lines containing `choices[0].delta.content`.
- **Authentication**: `Authorization: Bearer <VITE_OPEN_ROUTER_API_KEY>` header. Include `HTTP-Referer` and `X-Title` headers as recommended by OpenRouter.
- **Rate limits**: OpenRouter has per-model rate limits. Add retry logic for 429 responses. The benchmark script should include delays between calls.
- **Embedding regeneration**: Needs Node.js script that loads the ONNX model and processes all texts. Existing `scripts/generate-embeddings` script should be reused.
- **Temperature**: Current 0.7 may introduce variability in answers. Consider lowering to 0.30.5 for more consistent factual responses. Benchmark both.
- **Max tokens**: Current 512 may truncate detailed answers. Consider increasing to 768 or 1024 for benchmark testing.
- **Prompt structure**: Well-structured prompts with clear headings/sections parse better for LLMs than flat text. Consider markdown structure in system prompt.
- **CORS**: OpenRouter supports browser-side calls. The existing client-side fetch pattern should work without changes.
## Success Metrics
- 18/20 or higher on benchmark (90%+ accuracy)
- No question scores 0 (no factual errors)
- 5/5 general validation questions pass
- System prompt remains under 8KB
- No typecheck or lint regressions
- Embedding regeneration completes without errors
- Chat streaming works in-browser with OpenRouter
## Resolved Questions
- **Model provider**: OpenRouter with z-ai/glm-5 (replaces Gemini 3 Flash).
- **File naming**: `gemini.ts` renamed to `llm.ts` for provider-agnostic naming.
- **Benchmark provider**: OpenRouter used for both chat answers and scoring (single provider).
- **Benchmark results are git-tracked.** Each iteration's scores are committed so improvement over time is visible and auditable.
- **Existing `scripts/generate-embeddings` script exists.** Review and adapt as needed rather than building from scratch.
- **Benchmark harness is permanent.** Kept as an ongoing regression test (`npm run benchmark`) for validating LLM accuracy after any data or prompt changes. Question set can be expanded over time.