feat: add SNOMED mapping loader script (Task 1.2)

- Create data_processing/load_snomed_mapping.py with: - migrate_drug_snomed_mapping() for CSV to SQLite migration - get_drug_snomed_mapping_counts() for statistics - verify_drug_snomed_mapping_migration() for validation - clean_snomed_code() to remove trailing .0 from SNOMED codes - CLI interface: python -m data_processing.load_snomed_mapping - Loaded 144,056 mappings from enriched CSV: - 707 unique drugs - 187 unique search terms - 21,265 unique SNOMED codes
2026-02-05 14:10:22 +00:00
parent 9943e85761
commit 6d68b5eaa5
3 changed files with 425 additions and 3 deletions
@@ -103,3 +103,41 @@ For a patient on drug X:
 ### Blocked items:
 - None

+## Iteration 2 — 2026-02-05
+### Task: 1.2 Load Enriched Mapping Data
+### Why this task:
+- Task 1.1 (schema creation) is complete — table `ref_drug_snomed_mapping` exists
+- Task 1.2 is the next logical step in Phase 1 (Data Infrastructure)
+- Task 1.3 (Extend Diagnosis Lookup) depends on having data in the table to query
+- Following the "data infrastructure first" principle
+### Status: COMPLETE
+### What was done:
+- Created `data_processing/load_snomed_mapping.py` with:
+  - `migrate_drug_snomed_mapping()` - reads CSV, cleans SNOMED codes, inserts to SQLite
+  - `get_drug_snomed_mapping_counts()` - statistics function
+  - `verify_drug_snomed_mapping_migration()` - validation function
+  - `clean_snomed_code()` - removes trailing .0 from SNOMED codes (e.g., "156370009.0" → "156370009")
+  - CLI interface with `--verify-only` and `-v` options
+- Added CLI entry point: `python -m data_processing.load_snomed_mapping`
+- Fixed Unicode encoding issue on Windows console (emojis → ASCII text)
+### Validation results:
+- Tier 1 (Code): `python -m py_compile data_processing/load_snomed_mapping.py` — PASSED
+- Tier 1 (Code): Import check — PASSED
+- Tier 2 (Data): Migration loaded 144,056 rows from 163,937 CSV rows (19,881 duplicates skipped)
+- Tier 2 (Data): Verified: 707 unique drugs, 187 unique search terms, 21,265 SNOMED codes
+- Tier 2 (Data): SNOMED codes cleaned — 0 rows with .0 suffix remaining
+### Files changed:
+- `data_processing/load_snomed_mapping.py` — new file
+### Committed: 6ce45b5 "feat: add SNOMED mapping loader script (Task 1.2)"
+### Patterns discovered:
+- CSV has 163,937 rows but only 144,056 unique (drug_name, indication, snomed_code) combinations
+- SNOMED codes in CSV have `.0` suffix from float formatting — need `clean_snomed_code()` function
+- Windows console can't print Unicode emojis — use ASCII text like `[OK]` instead of ✅
+### Next iteration should:
+- Start Task 1.3: Extend Diagnosis Lookup Module
+- Add `get_drug_snomed_codes(drug_name)` to `data_processing/diagnosis_lookup.py`
+- Add `patient_has_indication_direct(patient_pseudonym, snomed_codes, connector)` function
+- Look at existing `diagnosis_lookup.py` for patterns to follow
+### Blocked items:
+- None
+