feat: US-017 - Improve system prompt instructions and LLM parameters

2026-02-16 00:42:58 +00:00
parent 194f83f490
commit f0870cf320
4 changed files with 35 additions and 19 deletions
@@ -333,7 +333,7 @@
        "Typecheck passes"
      ],
      "priority": 17,
-      "passes": false,
+      "passes": true,
      "notes": "These are behavioral instructions that go in the Rules section of the system prompt. Keep them concise — LLMs follow shorter, clearer rules better than long paragraphs. Consider: '1. Distinguish NHS employment (May 2022–present, ICB) from private sector (Tesco PLC). 2. When asked about tools/skills across career, aggregate from ALL roles. 3. Cite specific numbers, dates, and outcomes — never say approximate when exact figures are available. 4. If the answer isn't in the context, say so clearly.' Temperature and maxTokens are set in the API request config, not the prompt."
    },
    {
@@ -396,3 +396,23 @@
  - System prompt no longer depends on `buildEmbeddingTexts()` — the CV context is hardcoded. This means prompt content and embedding texts can diverge (prompt is optimised for Q&A, embeddings for semantic search)
  - When the prompt is close to the 8KB limit, trim verbose connecting phrases and redundant qualifiers first — the specific facts and numbers are what matter for accuracy
 ---
+
+## 2026-02-16 - US-017
+- Improved Response Rules in system prompt (`src/lib/llm.ts`) with numbered, clearer behavioral instructions:
+  1. Explicit "I don't have that information" phrasing for missing data
+  2. Stronger employer distinction instruction with "Never conflate the two"
+  3. Aggregation instruction broadened to include "projects" alongside tools/skills/achievements
+  4. Explicit prohibition on "approximately" and "around" when exact figures exist
+  5. Adaptive length instruction: thorough for list/detail questions, concise for simple ones
+- Lowered temperature from 0.7 to 0.4 for more consistent factual responses
+- Increased max_tokens from 512 to 800 to avoid truncating detailed answers
+- Preserved [ITEMS: ...] suffix instruction unchanged
+- Mirrored identical changes in `scripts/benchmark.ts` (prompt, temperature defaults, max_tokens defaults)
+- Typecheck (0 errors), lint (0 errors), production build passes
+- Files changed: `src/lib/llm.ts`, `scripts/benchmark.ts`
+- **Learnings for future iterations:**
+  - Numbered rules in system prompts tend to be followed more reliably by LLMs than bullet points
+  - Temperature 0.4 is a good balance for factual Q&A — low enough for consistency, high enough to avoid repetitive phrasing
+  - The benchmark script's `callLLM()` uses default params `temperature = 0.4, maxTokens = 800` — these match production. The scoring call overrides temperature to 0 for deterministic scoring
+  - The adaptive length rule ("thorough for detailed questions, concise for simple ones") replaces the fixed "2-4 sentences" rule — this should improve scores on questions requiring enumeration
+---