14 KiB
PRD: Improve LLM CV Knowledge Accuracy
Introduction
The portfolio's AI chat gives inaccurate or shallow answers about Andy's work history. The root cause: the system prompt feeds buildEmbeddingTexts() summaries rather than the full CV detail. Questions about specific achievements, methodology, clinical specialties, or cross-role context produce vague or incorrect responses. This PRD defines an iterative improvement process: enrich the LLM's context, measure accuracy against 10 verifiable benchmark questions, and repeat until all pass — while ensuring changes are structural (not question-specific hacks).
Additionally, the LLM provider is changing from Gemini to OpenRouter using the z-ai/glm-5 model. This requires migrating the API integration, renaming the module, and updating the benchmark harness to use the new provider.
Goals
- Achieve 10/10 accuracy on benchmark questions with factually correct, detailed, citation-worthy answers
- Ensure improvements are structural — benefiting all possible queries, not just the 10 benchmarks
- Maintain the existing architecture (no new APIs beyond OpenRouter, no RAG infrastructure, no backend)
- Migrate from Gemini to OpenRouter (z-ai/glm-5) for both production chat and benchmark scoring
- Regenerate embeddings when embedding texts change, keeping search and LLM context in sync
Benchmark Questions
These 10 questions have verifiable answers from CV_v4.md and the structured data files. Each tests a different knowledge gap.
| # | Question | Expected Answer (summary) | Tests |
|---|---|---|---|
| Q1 | "How many years has Andy been employed by the NHS?" | ~3.5 years (May 2022–present). Tesco was private sector. | NHS vs non-NHS employer distinction |
| Q2 | "What was Andy's involvement with tirzepatide?" | Supported NICE TA1026 commissioning, authored executive paper advocating primary care model, drove GP-led delivery. | Deep role-specific detail |
| Q3 | "What specific tools and software has Andy built?" | 5 projects: switching algorithm, Blueteq generator, CD monitoring, Sankey tool, PharMetrics. Each with outcomes. | Cross-role aggregation |
| Q4 | "What were Andy's A-level subjects and grades?" | Maths A*, Chemistry B, Politics C. Highworth Grammar School, 2009–2011. | Specific education detail |
| Q5 | "Was Andy's Tesco role part of the NHS?" | No. Tesco PLC is private. Community pharmacy, not NHS employment. LPC representative for Norfolk. | Employer classification |
| Q6 | "How did the patient switching algorithm work?" | Python, real-world GP data, auto-identified patients for alternatives, 3 days vs months manual, 14,000 patients, £2.6M, novel GP payment system. | Methodology depth |
| Q7 | "What clinical specialties has Andy worked across?" | Rheumatology, ophthalmology (wet AMD, DMO, RVO), dermatology, gastroenterology, neurology, migraine — from high-cost drugs role. | Narrative detail not in bullet summaries |
| Q8 | "What is Andy's experience with the dm+d?" | Created comprehensive medicines data table integrating all dm+d products with standardised strengths, morphine equivalents, Anticholinergic Burden scoring — single source of truth. | Technical achievement context |
| Q9 | "What budget does Andy manage and how?" | £220M prescribing budget. Forecasting models, variance analysis, financial reporting to executive team, interactive expenditure dashboard. | Figure + methodology |
| Q10 | "What leadership training does Andy have?" | Mary Seacole Programme (2018, 78%). Also national induction programme at Tesco, NVQ3 supervision. | Cross-role synthesis |
Scoring Criteria
Each question scored 0–2:
- 0 — Incorrect: Wrong facts, invented detail, or contradicts CV
- 1 — Partial: Correct but missing key detail, or vague where specifics are available
- 2 — Accurate: Factually correct, appropriately detailed, cites specific achievements/metrics
Pass threshold: 18/20 (90%), with no question scoring 0.
Anti-Benchmaxing Rules
- No hardcoded answers or question-specific prompt clauses
- Every change must be a structural improvement (richer context, better prompt patterns, enriched embeddings)
- After each iteration, mentally evaluate: "Would this help a question NOT in the benchmark?" — if no, reject the change
- The system prompt must not reference benchmark questions or their specific phrasings
User Stories
US-001: Migrate production chat from Gemini to OpenRouter
Description: As a developer, I need to replace the Gemini API integration with OpenRouter so the chat uses z-ai/glm-5.
Acceptance Criteria:
- Rename
src/lib/gemini.ts→src/lib/llm.ts - Update all imports across the codebase (
ChatWidget.tsx,search.ts, etc.) - Replace Gemini API calls with OpenRouter's OpenAI-compatible API (
https://openrouter.ai/api/v1/chat/completions) - Model set to
z-ai/glm-5 - API key read from
VITE_OPEN_ROUTER_API_KEYenv var - SSE streaming still works (OpenRouter supports
stream: true) - System prompt and message format adapted to OpenAI chat completions format (
messagesarray withrole/content) - Export updated display name constant (e.g.,
LLM_DISPLAY_NAME = 'GLM-5') and update model indicator in chat UI - Rename
isGeminiAvailable()→isLLMAvailable()(or similar) - Typecheck passes
- Verify in browser: chat opens, sends a message, streams a response
US-002: Migrate benchmark script to OpenRouter
Description: As a developer, I need the benchmark harness to use OpenRouter so it tests the same model and prompt path as production.
Acceptance Criteria:
scripts/benchmark.tsuses OpenRouter API instead of Gemini- API key read from
VITE_OPEN_ROUTER_API_KEY(loaded from.env) - Request format uses OpenAI chat completions structure
- Model identifier set to
z-ai/glm-5 - Rate limit/retry logic updated for OpenRouter's error responses
- Scoring calls also use OpenRouter (same provider for all LLM calls)
npm run benchmarkstill works end-to-end- Typecheck passes
US-003: Enrich system prompt with full CV context
Description: As a portfolio visitor, I want the AI to have comprehensive knowledge of Andy's background so it can answer detailed questions accurately.
Acceptance Criteria:
- System prompt includes full professional profile narrative (from CV_v4.md profile section)
- Each role includes full achievement bullets, not just summaries
- Clear distinction between NHS employment (May 2022+) and private sector (Tesco)
- Clinical specialties, methodology details, and specific outcomes included
- Education includes specific grades, subjects, research topics
- Prompt is well-structured with clear sections for easy LLM parsing
- No invented or extrapolated content — everything sourced from CV_v4.md and data files
- Typecheck passes
US-004: Improve system prompt instructions
Description: As a portfolio visitor, I want the AI to use its knowledge effectively — citing specifics, distinguishing between employers, and aggregating across roles when asked.
Acceptance Criteria:
- Prompt instructs LLM to distinguish NHS employment from private sector roles
- Prompt instructs LLM to aggregate across roles when asked broad questions (e.g., "what tools has Andy built?")
- Prompt instructs LLM to cite specific metrics, dates, and outcomes when available
- Temperature and token limits are appropriate for detailed answers (review current 0.7 temp, 512 max tokens)
- Typecheck passes
US-005: Enrich embedding texts for semantic search
Description: As a portfolio visitor, I want semantic search to surface relevant results even for nuanced queries so the chat and command palette find the right content.
Acceptance Criteria:
buildEmbeddingTexts()generates richer text per item — full achievement narratives, methodology detail, clinical specialties- Role
historynarratives are included (currently onlyexaminationbullets andcodedEntries) - Cross-references included where items relate (e.g., CD monitoring links to controlled drugs skill)
- Embedding texts remain well-formed natural language (not keyword soup)
- Typecheck passes
US-006: Regenerate embeddings
Description: As a developer, I need embeddings regenerated whenever embedding texts change so semantic search results match the enriched content.
Acceptance Criteria:
- Embeddings regenerated using the same model (all-MiniLM-L6-v2)
- Output written to
src/data/embeddings.json - Number of embeddings matches number of palette items
- Regeneration can be triggered via script (
npm run generate-embeddingsor similar) - Typecheck passes
US-007: Iterative benchmark loop
Description: As a developer, I want to run the benchmark, review scores, make improvements, and repeat until the pass threshold is met.
Acceptance Criteria:
- Run benchmark → review scores → identify failing questions → make structural improvements → repeat
- Each iteration logged with: changes made, scores before/after, rationale
- Minimum 2 iterations, maximum 10
- Stop when 18/20 achieved with no question scoring 0
- Final iteration results saved as evidence
- All changes pass typecheck before benchmarking
US-008: Validate no regression on general queries
Description: As a portfolio visitor, I want the AI to still handle general questions well after the benchmark-focused improvements.
Acceptance Criteria:
- Test 5 general questions not in the benchmark (e.g., "Tell me about Andy", "What does Andy do?", "How can I contact Andy?", "What is this website?", "What are Andy's strongest skills?")
- All general questions produce sensible, accurate responses
- No degradation in response quality for broad queries
- System prompt size hasn't grown to a point that degrades response speed noticeably
Functional Requirements
- FR-1: Production chat must use OpenRouter API with model
z-ai/glm-5 - FR-2: API key sourced from
VITE_OPEN_ROUTER_API_KEYenvironment variable - FR-3: LLM module renamed from
gemini.tstollm.tswith updated exports - FR-4: Chat UI displays "GLM-5" as the model indicator (replacing "Gemini 3 Flash")
- FR-5: Benchmark harness must use the identical system prompt construction path as production (
buildSystemPrompt()fromllm.ts) - FR-6: System prompt changes must be made in
llm.tsand/orsearch.ts— the same files that serve production - FR-7: Embedding text changes must be in
buildEmbeddingTexts()insearch.ts - FR-8: Scoring must be automated via LLM (OpenRouter), not manual review
- FR-9: All benchmark artifacts (questions, expected answers, results) stored in
scripts/ - FR-10: Embedding regeneration must produce deterministic output for the same input texts
- FR-11: System prompt must remain a single self-contained context block (no external retrieval at runtime)
Non-Goals
- No RAG infrastructure or vector database
- No additional API integrations beyond OpenRouter
- No changes to the chat UI layout, streaming UX, or item linking (beyond model name display)
- No changes to the command palette search UX
- No changes to boot sequence, ECG, or login phases
- No new backend or server-side components
- Not optimising for adversarial/trick questions — focus is on legitimate CV queries
- No keeping Gemini as a fallback — this is a full replacement
Technical Considerations
- OpenRouter API format: Uses OpenAI-compatible chat completions endpoint (
POST https://openrouter.ai/api/v1/chat/completions). Messages use{ role: 'system' | 'user' | 'assistant', content: string }format. Streaming usesstream: truewith SSEdata:lines containingchoices[0].delta.content. - Authentication:
Authorization: Bearer <VITE_OPEN_ROUTER_API_KEY>header. IncludeHTTP-RefererandX-Titleheaders as recommended by OpenRouter. - Rate limits: OpenRouter has per-model rate limits. Add retry logic for 429 responses. The benchmark script should include delays between calls.
- Embedding regeneration: Needs Node.js script that loads the ONNX model and processes all texts. Existing
scripts/generate-embeddingsscript should be reused. - Temperature: Current 0.7 may introduce variability in answers. Consider lowering to 0.3–0.5 for more consistent factual responses. Benchmark both.
- Max tokens: Current 512 may truncate detailed answers. Consider increasing to 768 or 1024 for benchmark testing.
- Prompt structure: Well-structured prompts with clear headings/sections parse better for LLMs than flat text. Consider markdown structure in system prompt.
- CORS: OpenRouter supports browser-side calls. The existing client-side fetch pattern should work without changes.
Success Metrics
- 18/20 or higher on benchmark (90%+ accuracy)
- No question scores 0 (no factual errors)
- 5/5 general validation questions pass
- System prompt remains under 8KB
- No typecheck or lint regressions
- Embedding regeneration completes without errors
- Chat streaming works in-browser with OpenRouter
Resolved Questions
- Model provider: OpenRouter with z-ai/glm-5 (replaces Gemini 3 Flash).
- File naming:
gemini.tsrenamed tollm.tsfor provider-agnostic naming. - Benchmark provider: OpenRouter used for both chat answers and scoring (single provider).
- Benchmark results are git-tracked. Each iteration's scores are committed so improvement over time is visible and auditable.
- Existing
scripts/generate-embeddingsscript exists. Review and adapt as needed rather than building from scratch. - Benchmark harness is permanent. Kept as an ongoing regression test (
npm run benchmark) for validating LLM accuracy after any data or prompt changes. Question set can be expanded over time.