6d68b5eaa5
- Create data_processing/load_snomed_mapping.py with: - migrate_drug_snomed_mapping() for CSV to SQLite migration - get_drug_snomed_mapping_counts() for statistics - verify_drug_snomed_mapping_migration() for validation - clean_snomed_code() to remove trailing .0 from SNOMED codes - CLI interface: python -m data_processing.load_snomed_mapping - Loaded 144,056 mappings from enriched CSV: - 707 unique drugs - 187 unique search terms - 21,265 unique SNOMED codes
144 lines
7.1 KiB
Plaintext
144 lines
7.1 KiB
Plaintext
# Progress Log - Direct SNOMED Indication Mapping
|
||
|
||
## Project Context
|
||
|
||
This project extends the existing HCD Pathway Analysis application with direct SNOMED code matching from GP records. The previous project (Phases 1-5) established the pre-computed pathway architecture and modern UI. This phase adds:
|
||
|
||
1. **Diagnosis-based directorate assignment** - Primary method using GP SNOMED codes
|
||
2. **Indication-based icicle chart** - New chart type showing Trust → Search_Term → Drug → Pathway
|
||
|
||
## Key Files Reference
|
||
|
||
**Existing (reuse these):**
|
||
- `data_processing/schema.py` - SQLite schema (add new table)
|
||
- `data_processing/diagnosis_lookup.py` - Existing cluster-based lookup (extend with direct SNOMED)
|
||
- `data_processing/pathway_pipeline.py` - Pathway processing (add indication type)
|
||
- `cli/refresh_pathways.py` - CLI refresh command (add chart type support)
|
||
- `pathways_app/pathways_app.py` - Reflex app (add chart type toggle)
|
||
- `tools/data.py` - Data transformations including department_identification()
|
||
|
||
**New data:**
|
||
- `data/drug_snomed_mapping_enriched.csv` - 163K rows, 187 Search_Terms, 364 drugs
|
||
|
||
## Known Patterns
|
||
|
||
### SNOMED Mapping Structure
|
||
The enriched mapping CSV has columns:
|
||
- Drug, Indication, TA_ID (from NICE TAs)
|
||
- Search_Term (simplified grouping, 187 unique values)
|
||
- SNOMEDCode, SNOMEDDescription
|
||
- CleanedDrugName, PrimaryDirectorate, AllDirectorates
|
||
|
||
### Direct SNOMED Lookup Logic
|
||
For a patient on drug X:
|
||
1. Get all SNOMED codes for that drug from ref_drug_snomed_mapping
|
||
2. Query PrimaryCareClinicalCoding for those codes (patient's GP record)
|
||
3. If match found → use Search_Term and PrimaryDirectorate from matched row
|
||
4. If no match → fall back to department_identification()
|
||
5. Use most recent SNOMED code by EventDateTime if multiple matches
|
||
|
||
### Chart Type Architecture
|
||
- `chart_type` column in pathway_nodes: "directory" or "indication"
|
||
- 12 total pathway datasets: 6 date filters × 2 chart types
|
||
- Indication chart: mixed labels (Search_Term for matched, Directorate for unmatched)
|
||
|
||
### Date Filter Combinations
|
||
| ID | Initiated | Last Seen | Default |
|
||
|----|-----------|-----------|---------|
|
||
| `all_6mo` | All years | Last 6 months | Yes |
|
||
| `all_12mo` | All years | Last 12 months | No |
|
||
| `1yr_6mo` | Last 1 year | Last 6 months | No |
|
||
| `1yr_12mo` | Last 1 year | Last 12 months | No |
|
||
| `2yr_6mo` | Last 2 years | Last 6 months | No |
|
||
| `2yr_12mo` | Last 2 years | Last 12 months | No |
|
||
|
||
### Expected Volumes
|
||
- SNOMED mapping: 163K rows
|
||
- Search_Terms: 187 unique
|
||
- Pathway nodes per date filter: ~300 (directory), ~400-600 (indication)
|
||
|
||
---
|
||
|
||
## Iteration Log
|
||
|
||
## Iteration 1 — 2026-02-05
|
||
### Task: 1.1 Create SQLite Table for SNOMED Mapping
|
||
### Why this task:
|
||
- First task in Phase 1 (Data Infrastructure) — all other phases depend on having the data layer in place
|
||
- No external dependencies — pure schema definition work
|
||
- Follows "data infrastructure first" principle
|
||
### Status: COMPLETE
|
||
### What was done:
|
||
- Added `REF_DRUG_SNOMED_MAPPING_SCHEMA` to `data_processing/schema.py` with 11 columns:
|
||
- id, drug_name, indication, ta_id, search_term, snomed_code, snomed_description
|
||
- cleaned_drug_name, primary_directorate, all_directorates, created_at
|
||
- Added 5 custom indexes for lookup performance:
|
||
- idx_ref_drug_snomed_mapping_drug (drug_name)
|
||
- idx_ref_drug_snomed_mapping_cleaned (cleaned_drug_name)
|
||
- idx_ref_drug_snomed_mapping_snomed (snomed_code)
|
||
- idx_ref_drug_snomed_mapping_search_term (search_term)
|
||
- idx_ref_drug_snomed_mapping_drug_snomed (composite: cleaned_drug_name, snomed_code)
|
||
- Added `create_drug_snomed_mapping_table()` helper function
|
||
- Added schema to `REFERENCE_TABLES_SCHEMA` (included in `ALL_TABLES_SCHEMA`)
|
||
- Updated helper functions to include new table:
|
||
- `drop_reference_tables()` — drops new table
|
||
- `get_reference_table_counts()` — counts new table (with try/except for safety)
|
||
- `verify_reference_tables_exist()` — checks for new table
|
||
### Validation results:
|
||
- Tier 1 (Code): `python -m py_compile data_processing/schema.py` — PASSED
|
||
- Tier 1 (Code): Import check — PASSED
|
||
- Tier 2 (Data): Migration created table with 0 rows — PASSED
|
||
- Tier 2 (Data): All 11 columns and 6 indexes present — PASSED
|
||
### Files changed:
|
||
- `data_processing/schema.py` — added schema, updated helpers
|
||
### Committed: cf35937 "feat: add ref_drug_snomed_mapping schema (Task 1.1)"
|
||
### Patterns discovered:
|
||
- Existing schema pattern: CREATE TABLE + CREATE INDEX in single SQL string
|
||
- Helper functions follow consistent pattern: create, drop, get_counts, verify_exists
|
||
- `get_reference_table_counts` needed try/except for tables that might not exist yet
|
||
### Next iteration should:
|
||
- Start Task 1.2: Create `data_processing/load_snomed_mapping.py` script to load the CSV
|
||
- The CSV has columns: Drug, Indication, TA_ID, Search_Term, SNOMEDCode, SNOMEDDescription, CleanedDrugName, PrimaryDirectorate, AllDirectorates
|
||
- Note: SNOMEDCode in CSV has `.0` suffix (e.g., "156370009.0") — may need cleanup
|
||
### Blocked items:
|
||
- None
|
||
|
||
## Iteration 2 — 2026-02-05
|
||
### Task: 1.2 Load Enriched Mapping Data
|
||
### Why this task:
|
||
- Task 1.1 (schema creation) is complete — table `ref_drug_snomed_mapping` exists
|
||
- Task 1.2 is the next logical step in Phase 1 (Data Infrastructure)
|
||
- Task 1.3 (Extend Diagnosis Lookup) depends on having data in the table to query
|
||
- Following the "data infrastructure first" principle
|
||
### Status: COMPLETE
|
||
### What was done:
|
||
- Created `data_processing/load_snomed_mapping.py` with:
|
||
- `migrate_drug_snomed_mapping()` - reads CSV, cleans SNOMED codes, inserts to SQLite
|
||
- `get_drug_snomed_mapping_counts()` - statistics function
|
||
- `verify_drug_snomed_mapping_migration()` - validation function
|
||
- `clean_snomed_code()` - removes trailing .0 from SNOMED codes (e.g., "156370009.0" → "156370009")
|
||
- CLI interface with `--verify-only` and `-v` options
|
||
- Added CLI entry point: `python -m data_processing.load_snomed_mapping`
|
||
- Fixed Unicode encoding issue on Windows console (emojis → ASCII text)
|
||
### Validation results:
|
||
- Tier 1 (Code): `python -m py_compile data_processing/load_snomed_mapping.py` — PASSED
|
||
- Tier 1 (Code): Import check — PASSED
|
||
- Tier 2 (Data): Migration loaded 144,056 rows from 163,937 CSV rows (19,881 duplicates skipped)
|
||
- Tier 2 (Data): Verified: 707 unique drugs, 187 unique search terms, 21,265 SNOMED codes
|
||
- Tier 2 (Data): SNOMED codes cleaned — 0 rows with .0 suffix remaining
|
||
### Files changed:
|
||
- `data_processing/load_snomed_mapping.py` — new file
|
||
### Committed: 6ce45b5 "feat: add SNOMED mapping loader script (Task 1.2)"
|
||
### Patterns discovered:
|
||
- CSV has 163,937 rows but only 144,056 unique (drug_name, indication, snomed_code) combinations
|
||
- SNOMED codes in CSV have `.0` suffix from float formatting — need `clean_snomed_code()` function
|
||
- Windows console can't print Unicode emojis — use ASCII text like `[OK]` instead of ✅
|
||
### Next iteration should:
|
||
- Start Task 1.3: Extend Diagnosis Lookup Module
|
||
- Add `get_drug_snomed_codes(drug_name)` to `data_processing/diagnosis_lookup.py`
|
||
- Add `patient_has_indication_direct(patient_pseudonym, snomed_codes, connector)` function
|
||
- Look at existing `diagnosis_lookup.py` for patterns to follow
|
||
### Blocked items:
|
||
- None
|
||
|