Files
portfolio/tasks/prd-llm-cv-knowledge.md
T
2026-02-15 23:20:24 +00:00

14 KiB
Raw Blame History

PRD: Improve LLM CV Knowledge Accuracy

Introduction

The portfolio's AI chat gives inaccurate or shallow answers about Andy's work history. The root cause: the system prompt feeds buildEmbeddingTexts() summaries rather than the full CV detail. Questions about specific achievements, methodology, clinical specialties, or cross-role context produce vague or incorrect responses. This PRD defines an iterative improvement process: enrich the LLM's context, measure accuracy against 10 verifiable benchmark questions, and repeat until all pass — while ensuring changes are structural (not question-specific hacks).

Additionally, the LLM provider is changing from Gemini to OpenRouter using the z-ai/glm-5 model. This requires migrating the API integration, renaming the module, and updating the benchmark harness to use the new provider.

Goals

  • Achieve 10/10 accuracy on benchmark questions with factually correct, detailed, citation-worthy answers
  • Ensure improvements are structural — benefiting all possible queries, not just the 10 benchmarks
  • Maintain the existing architecture (no new APIs beyond OpenRouter, no RAG infrastructure, no backend)
  • Migrate from Gemini to OpenRouter (z-ai/glm-5) for both production chat and benchmark scoring
  • Regenerate embeddings when embedding texts change, keeping search and LLM context in sync

Benchmark Questions

These 10 questions have verifiable answers from CV_v4.md and the structured data files. Each tests a different knowledge gap.

# Question Expected Answer (summary) Tests
Q1 "How many years has Andy been employed by the NHS?" ~3.5 years (May 2022present). Tesco was private sector. NHS vs non-NHS employer distinction
Q2 "What was Andy's involvement with tirzepatide?" Supported NICE TA1026 commissioning, authored executive paper advocating primary care model, drove GP-led delivery. Deep role-specific detail
Q3 "What specific tools and software has Andy built?" 5 projects: switching algorithm, Blueteq generator, CD monitoring, Sankey tool, PharMetrics. Each with outcomes. Cross-role aggregation
Q4 "What were Andy's A-level subjects and grades?" Maths A*, Chemistry B, Politics C. Highworth Grammar School, 20092011. Specific education detail
Q5 "Was Andy's Tesco role part of the NHS?" No. Tesco PLC is private. Community pharmacy, not NHS employment. LPC representative for Norfolk. Employer classification
Q6 "How did the patient switching algorithm work?" Python, real-world GP data, auto-identified patients for alternatives, 3 days vs months manual, 14,000 patients, £2.6M, novel GP payment system. Methodology depth
Q7 "What clinical specialties has Andy worked across?" Rheumatology, ophthalmology (wet AMD, DMO, RVO), dermatology, gastroenterology, neurology, migraine — from high-cost drugs role. Narrative detail not in bullet summaries
Q8 "What is Andy's experience with the dm+d?" Created comprehensive medicines data table integrating all dm+d products with standardised strengths, morphine equivalents, Anticholinergic Burden scoring — single source of truth. Technical achievement context
Q9 "What budget does Andy manage and how?" £220M prescribing budget. Forecasting models, variance analysis, financial reporting to executive team, interactive expenditure dashboard. Figure + methodology
Q10 "What leadership training does Andy have?" Mary Seacole Programme (2018, 78%). Also national induction programme at Tesco, NVQ3 supervision. Cross-role synthesis

Scoring Criteria

Each question scored 02:

  • 0 — Incorrect: Wrong facts, invented detail, or contradicts CV
  • 1 — Partial: Correct but missing key detail, or vague where specifics are available
  • 2 — Accurate: Factually correct, appropriately detailed, cites specific achievements/metrics

Pass threshold: 18/20 (90%), with no question scoring 0.

Anti-Benchmaxing Rules

  • No hardcoded answers or question-specific prompt clauses
  • Every change must be a structural improvement (richer context, better prompt patterns, enriched embeddings)
  • After each iteration, mentally evaluate: "Would this help a question NOT in the benchmark?" — if no, reject the change
  • The system prompt must not reference benchmark questions or their specific phrasings

User Stories

US-001: Migrate production chat from Gemini to OpenRouter

Description: As a developer, I need to replace the Gemini API integration with OpenRouter so the chat uses z-ai/glm-5.

Acceptance Criteria:

  • Rename src/lib/gemini.tssrc/lib/llm.ts
  • Update all imports across the codebase (ChatWidget.tsx, search.ts, etc.)
  • Replace Gemini API calls with OpenRouter's OpenAI-compatible API (https://openrouter.ai/api/v1/chat/completions)
  • Model set to z-ai/glm-5
  • API key read from VITE_OPEN_ROUTER_API_KEY env var
  • SSE streaming still works (OpenRouter supports stream: true)
  • System prompt and message format adapted to OpenAI chat completions format (messages array with role/content)
  • Export updated display name constant (e.g., LLM_DISPLAY_NAME = 'GLM-5') and update model indicator in chat UI
  • Rename isGeminiAvailable()isLLMAvailable() (or similar)
  • Typecheck passes
  • Verify in browser: chat opens, sends a message, streams a response

US-002: Migrate benchmark script to OpenRouter

Description: As a developer, I need the benchmark harness to use OpenRouter so it tests the same model and prompt path as production.

Acceptance Criteria:

  • scripts/benchmark.ts uses OpenRouter API instead of Gemini
  • API key read from VITE_OPEN_ROUTER_API_KEY (loaded from .env)
  • Request format uses OpenAI chat completions structure
  • Model identifier set to z-ai/glm-5
  • Rate limit/retry logic updated for OpenRouter's error responses
  • Scoring calls also use OpenRouter (same provider for all LLM calls)
  • npm run benchmark still works end-to-end
  • Typecheck passes

US-003: Enrich system prompt with full CV context

Description: As a portfolio visitor, I want the AI to have comprehensive knowledge of Andy's background so it can answer detailed questions accurately.

Acceptance Criteria:

  • System prompt includes full professional profile narrative (from CV_v4.md profile section)
  • Each role includes full achievement bullets, not just summaries
  • Clear distinction between NHS employment (May 2022+) and private sector (Tesco)
  • Clinical specialties, methodology details, and specific outcomes included
  • Education includes specific grades, subjects, research topics
  • Prompt is well-structured with clear sections for easy LLM parsing
  • No invented or extrapolated content — everything sourced from CV_v4.md and data files
  • Typecheck passes

US-004: Improve system prompt instructions

Description: As a portfolio visitor, I want the AI to use its knowledge effectively — citing specifics, distinguishing between employers, and aggregating across roles when asked.

Acceptance Criteria:

  • Prompt instructs LLM to distinguish NHS employment from private sector roles
  • Prompt instructs LLM to aggregate across roles when asked broad questions (e.g., "what tools has Andy built?")
  • Prompt instructs LLM to cite specific metrics, dates, and outcomes when available
  • Temperature and token limits are appropriate for detailed answers (review current 0.7 temp, 512 max tokens)
  • Typecheck passes

Description: As a portfolio visitor, I want semantic search to surface relevant results even for nuanced queries so the chat and command palette find the right content.

Acceptance Criteria:

  • buildEmbeddingTexts() generates richer text per item — full achievement narratives, methodology detail, clinical specialties
  • Role history narratives are included (currently only examination bullets and codedEntries)
  • Cross-references included where items relate (e.g., CD monitoring links to controlled drugs skill)
  • Embedding texts remain well-formed natural language (not keyword soup)
  • Typecheck passes

US-006: Regenerate embeddings

Description: As a developer, I need embeddings regenerated whenever embedding texts change so semantic search results match the enriched content.

Acceptance Criteria:

  • Embeddings regenerated using the same model (all-MiniLM-L6-v2)
  • Output written to src/data/embeddings.json
  • Number of embeddings matches number of palette items
  • Regeneration can be triggered via script (npm run generate-embeddings or similar)
  • Typecheck passes

US-007: Iterative benchmark loop

Description: As a developer, I want to run the benchmark, review scores, make improvements, and repeat until the pass threshold is met.

Acceptance Criteria:

  • Run benchmark → review scores → identify failing questions → make structural improvements → repeat
  • Each iteration logged with: changes made, scores before/after, rationale
  • Minimum 2 iterations, maximum 10
  • Stop when 18/20 achieved with no question scoring 0
  • Final iteration results saved as evidence
  • All changes pass typecheck before benchmarking

US-008: Validate no regression on general queries

Description: As a portfolio visitor, I want the AI to still handle general questions well after the benchmark-focused improvements.

Acceptance Criteria:

  • Test 5 general questions not in the benchmark (e.g., "Tell me about Andy", "What does Andy do?", "How can I contact Andy?", "What is this website?", "What are Andy's strongest skills?")
  • All general questions produce sensible, accurate responses
  • No degradation in response quality for broad queries
  • System prompt size hasn't grown to a point that degrades response speed noticeably

Functional Requirements

  • FR-1: Production chat must use OpenRouter API with model z-ai/glm-5
  • FR-2: API key sourced from VITE_OPEN_ROUTER_API_KEY environment variable
  • FR-3: LLM module renamed from gemini.ts to llm.ts with updated exports
  • FR-4: Chat UI displays "GLM-5" as the model indicator (replacing "Gemini 3 Flash")
  • FR-5: Benchmark harness must use the identical system prompt construction path as production (buildSystemPrompt() from llm.ts)
  • FR-6: System prompt changes must be made in llm.ts and/or search.ts — the same files that serve production
  • FR-7: Embedding text changes must be in buildEmbeddingTexts() in search.ts
  • FR-8: Scoring must be automated via LLM (OpenRouter), not manual review
  • FR-9: All benchmark artifacts (questions, expected answers, results) stored in scripts/
  • FR-10: Embedding regeneration must produce deterministic output for the same input texts
  • FR-11: System prompt must remain a single self-contained context block (no external retrieval at runtime)

Non-Goals

  • No RAG infrastructure or vector database
  • No additional API integrations beyond OpenRouter
  • No changes to the chat UI layout, streaming UX, or item linking (beyond model name display)
  • No changes to the command palette search UX
  • No changes to boot sequence, ECG, or login phases
  • No new backend or server-side components
  • Not optimising for adversarial/trick questions — focus is on legitimate CV queries
  • No keeping Gemini as a fallback — this is a full replacement

Technical Considerations

  • OpenRouter API format: Uses OpenAI-compatible chat completions endpoint (POST https://openrouter.ai/api/v1/chat/completions). Messages use { role: 'system' | 'user' | 'assistant', content: string } format. Streaming uses stream: true with SSE data: lines containing choices[0].delta.content.
  • Authentication: Authorization: Bearer <VITE_OPEN_ROUTER_API_KEY> header. Include HTTP-Referer and X-Title headers as recommended by OpenRouter.
  • Rate limits: OpenRouter has per-model rate limits. Add retry logic for 429 responses. The benchmark script should include delays between calls.
  • Embedding regeneration: Needs Node.js script that loads the ONNX model and processes all texts. Existing scripts/generate-embeddings script should be reused.
  • Temperature: Current 0.7 may introduce variability in answers. Consider lowering to 0.30.5 for more consistent factual responses. Benchmark both.
  • Max tokens: Current 512 may truncate detailed answers. Consider increasing to 768 or 1024 for benchmark testing.
  • Prompt structure: Well-structured prompts with clear headings/sections parse better for LLMs than flat text. Consider markdown structure in system prompt.
  • CORS: OpenRouter supports browser-side calls. The existing client-side fetch pattern should work without changes.

Success Metrics

  • 18/20 or higher on benchmark (90%+ accuracy)
  • No question scores 0 (no factual errors)
  • 5/5 general validation questions pass
  • System prompt remains under 8KB
  • No typecheck or lint regressions
  • Embedding regeneration completes without errors
  • Chat streaming works in-browser with OpenRouter

Resolved Questions

  • Model provider: OpenRouter with z-ai/glm-5 (replaces Gemini 3 Flash).
  • File naming: gemini.ts renamed to llm.ts for provider-agnostic naming.
  • Benchmark provider: OpenRouter used for both chat answers and scoring (single provider).
  • Benchmark results are git-tracked. Each iteration's scores are committed so improvement over time is visible and auditable.
  • Existing scripts/generate-embeddings script exists. Review and adapt as needed rather than building from scratch.
  • Benchmark harness is permanent. Kept as an ongoing regression test (npm run benchmark) for validating LLM accuracy after any data or prompt changes. Question set can be expanded over time.