docs: update progress.txt with iteration 9 (Task 3.3 in progress)

Fixed two critical bugs preventing GP diagnosis matching:
1. SNOMED codes in scientific notation now converted to integers
2. Using PseudoNHSNoLinked (not PersonKey) for GP record lookup

Full refresh is running in background - next iteration should verify completion.
This commit is contained in:
Andrew Charlwood
2026-02-05 15:51:17 +00:00
parent 5b1569ed5c
commit 843b4f23cc
+74
View File
@@ -468,3 +468,77 @@ For a patient on drug X:
### Blocked items:
- Task 3.3 verification requires Snowflake connection (NHS SSO)
## Iteration 9 — 2026-02-05
### Task: 3.3 Test Full Refresh Pipeline
### Why this task:
- Previous iteration completed Task 3.2 but noted verification was pending
- Task 3.3 is the final task in Phase 3 to validate all pipeline work
- Phase 4 (UI) depends on having working indication chart data
- Following "validate before moving forward" principle
### Status: IN PROGRESS
### What was done:
1. **First refresh attempt**: Ran `python -m cli.refresh_pathways --chart-type all --verbose`
- Directory chart: Processed successfully (293 nodes for all_6mo)
- Indication chart: **0% diagnosis matches** - all 37,257 patients fell back to FALLBACK
- This resulted in no indication pathway data (charts empty)
2. **Diagnosed root cause #1**: SNOMED codes stored in scientific notation
- CSV has codes like "1.0629311000119108e+16" due to pandas/Excel export
- The `clean_snomed_code()` function only handled ".0" suffix removal
- Codes were stored as "1.06e+16" which never match Snowflake data
- **Fix**: Updated `clean_snomed_code()` to convert scientific notation to integers
- Reloaded 144,056 SNOMED mappings with properly formatted codes
3. **Diagnosed root cause #2**: Wrong patient identifier used for GP lookup
- `batch_lookup_indication_groups()` was using `PersonKey` column
- `PersonKey` = `LocalPatientID` (provider-specific like "J188448")
- GP records use `PatientPseudonym` which matches `PseudoNHSNoLinked` (SHA-256 hash)
- **Fix**: Changed to use `PseudoNHSNoLinked` column for GP record matching
- Test showed ~20% match rate for ADALIMUMAB patients with correct identifier
4. **Committed fixes**: `5b1569e` "fix: correct patient identifier for GP diagnosis lookup (Task 3.3)"
5. **Started second refresh**: Running in background (task ID: be9b9e7)
- Processing time expected: ~15-20 minutes total
- Should now show non-zero GP matches
### Validation results:
- Tier 1 (Code): Syntax check passed for both modified files
- Tier 1 (Code): Import check passed
- Tier 2 (Data): SNOMED codes now properly formatted (0 scientific notation entries)
- Tier 2 (Data): GP record matching test: 20 matches found in 100 ADALIMUMAB patients
- Tier 2 (Data): Full refresh still running (started 15:XX) - pending final verification
### Files changed:
- `data_processing/load_snomed_mapping.py` — fixed clean_snomed_code() for scientific notation
- `data_processing/diagnosis_lookup.py` — changed to use PseudoNHSNoLinked for GP lookup
- `IMPLEMENTATION_PLAN.md` — marked Task 3.3 as in progress
### Committed: 5b1569e "fix: correct patient identifier for GP diagnosis lookup (Task 3.3)"
### Patterns discovered:
- **Critical**: PersonKey ≠ PatientPseudonym. HCD data has two patient identifiers:
- `LocalPatientID` (aliased as PersonKey) — provider-specific, NOT in GP records
- `PseudoNHSNoLinked` — pseudonymised NHS number, matches `PatientPseudonym` in GP records
- SNOMED codes can have 15-16 digits, causing float precision issues in pandas/Excel exports
- Scientific notation must be converted back to integers for string matching
### Next iteration should:
1. **Check refresh completion**: Read output from task be9b9e7
- Look for "DIAGNOSIS matches: X%" line in batch lookup output
- Should now show non-zero percentage (expected 10-30% based on ADALIMUMAB test)
- Look for "indication: X nodes total" confirming indication charts generated
2. **If refresh succeeded**: Verify database state
- `SELECT chart_type, COUNT(*) FROM pathway_nodes GROUP BY chart_type`
- Should show both "directory" (293) and "indication" (expected 300-600) rows
- `SELECT DISTINCT directory FROM pathway_nodes WHERE chart_type='indication' LIMIT 20`
- Should show Search_Term values like "rheumatoid arthritis", "macular degeneration"
3. **Mark Task 3.3 complete** with validation evidence:
- Processing time
- Record counts per chart type
- Coverage percentage (diagnosis vs fallback)
4. **If refresh still running**: Wait or check `tail -50` of output file
5. **Start Phase 4**: If 3.3 passes, begin Task 4.1 (Add Chart Type State to Reflex)
### Blocked items:
- None (Snowflake connection established)