diff --git a/progress.txt b/progress.txt index aafa441..96fd0ed 100644 --- a/progress.txt +++ b/progress.txt @@ -193,3 +193,63 @@ This project extends the indication-based pathway charts (Phase 1-5 complete) wi - The final Phase 1 subtask (1.1 verify with live Snowflake) will be tested during Phase 3/4 integration ### Blocked items: - Task 1.1 final subtask "Verify: Query returns more rows" requires live Snowflake — deferred to Phase 3/4 + +## Iteration 4 — 2026-02-05 +### Task: 2.1 + 2.2 — Create assign_drug_indications() with tiebreaker logic +### Why this task: +- All Phase 1 dependencies complete (1.1 query returns ALL matches, 1.2 asthma merged, 1.3 drug mapping loaded) +- Task 2.1 is the core matching function needed before Phase 3 pipeline integration +- Task 2.2 (tiebreaker) is naturally part of 2.1 — implemented together +- Can be built and tested with mock data (no Snowflake needed) +### Status: COMPLETE +### What was done: +- Added `assign_drug_indications()` to `diagnosis_lookup.py`: + - Input: HCD df (with UPID, Drug Name, PseudoNHSNoLinked, Directory), gp_matches_df, search_term_to_fragments + - Output: (modified_df with UPID|search_term, indication_df mapping modified_UPID → Search_Term) + - Builds GP lookup: PseudoNHSNoLinked → {Search_Term: code_frequency} + - Caches drug→Search_Term lookups to avoid recomputing per row + - For each (UPID, Drug Name) pair: + - Intersects drug's Search_Terms with patient's GP matches + - Single match: use it + - Multiple matches: highest code_frequency wins, alphabetical tiebreak + - No match: fallback to "{Directory} (no GP dx)" + - Applies modified UPIDs via df.apply() (vectorized lookup from cache) + - Builds indication_df with unique modified UPID → Directory column + - Comprehensive logging: match rate, tiebreaker count, fallback count, top 5 indications +- Updated __all__ exports +### Validation results: +- Tier 1 (Code): py_compile PASSED, import check PASSED +- Tier 2 (Data): Mock data tests ALL PASSED: + - ADALIMUMAB + GP dx (RA + asthma) → matched to RA (drug mapping intersection) + - OMALIZUMAB + GP dx (RA + asthma) → matched to asthma (drug mapping intersection) + - ADALIMUMAB + GP dx (RA 3 freq + crohn's 2 freq) → tiebreaker picks RA + - ADALIMUMAB + GP dx (psoriatic 5 freq + RA 5 freq) → alphabetical tiebreak picks psoriatic arthritis + - Higher frequency (47 RA vs 3 psoriatic) → RA wins + - No GP match → fallback to directory + - Empty GP DataFrame → all fallback + - Different drugs with different indications → different modified UPIDs +- Tier 3 (Functional): N/A (no UI changes yet) +### Files changed: +- data_processing/diagnosis_lookup.py (added assign_drug_indications, updated __all__) +- IMPLEMENTATION_PLAN.md (marked 2.1 and 2.2 subtasks [x]) +### Committed: 408976e "feat: add assign_drug_indications() for drug-aware indication matching (Task 2.1 + 2.2)" +### Patterns discovered: +- Function signature takes `search_term_to_fragments` (the second element from load_drug_indication_mapping()) — NOT the full tuple. Callers must destructure: `_, st_to_frags = load_drug_indication_mapping()` +- The function uses df.apply() to set modified UPIDs — for large DataFrames (656K rows), this could be slow. If performance is an issue in Phase 3, could vectorize with merge operations instead. But apply with cached lookup dict should be OK. +- "crohn's disease" is NOT in ADALIMUMAB's DimSearchTerm mapping (ADALIMUMAB maps to: ankylosing spondylitis, axial spondyloarthritis, plaque psoriasis, psoriatic arthritis, rheumatoid arthritis, ulcerative colitis, uveitis). Initial test assumption was wrong. +- indication_df has 'Directory' column (not 'Search_Term') for compatibility with generate_icicle_chart_indication() which expects indication_df.loc[upid, 'Directory'] +### Next iteration should: +- Work on Task 3.1: Update `refresh_pathways.py` indication processing to use assign_drug_indications() + - The current code at lines 424-428 uses `dict(zip(...))` which only keeps LAST match per patient — this must be replaced + - Key changes in the `elif current_chart_type == "indication":` block: + 1. Load drug mapping: `_, st_to_frags = load_drug_indication_mapping()` + 2. Pass `earliest_hcd_date=df['Intervention Date'].min().strftime('%Y-%m-%d')` to get_patient_indication_groups() + 3. Call `assign_drug_indications(df, gp_matches_df, st_to_frags)` to get (modified_df, indication_df) + 4. Use modified_df (not original df) for pathway processing + 5. indication_df is already in the right format (indexed by modified UPID, 'Directory' column) + 6. Remove the old match_lookup/dict(zip) code and the manual indication_df building + - Import assign_drug_indications and load_drug_indication_mapping at top of file + - This replaces ~50 lines of the old approach with ~10 lines using the new function + - Can verify with py_compile; full Snowflake test via --dry-run +### Blocked items: +- None