# Progress Log - Drug-Aware Indication Matching ## Project Context This project extends the indication-based pathway charts (Phase 1-5 complete) with drug-aware matching. **Previous state**: Patients get ONE indication based on their most recent GP diagnosis match (SNOMED cluster codes). This ignores which drugs the patient is taking. **New goal**: Match each drug to an indication by cross-referencing the patient's GP diagnoses AND the drug's Search_Term mapping from DimSearchTerm.csv. ## Key Data/Patterns ### DimSearchTerm.csv - Located at `data/DimSearchTerm.csv` - Columns: Search_Term, CleanedDrugName (pipe-separated), PrimaryDirectorate - ~165 rows mapping clinical conditions to drug name fragments - Drug fragments are substrings that match standardized drug names from HCD data - Some entries have generic fragments: INHALED, CONTINUOUS, STANDARD-DOSE, PEGYLATED ### Current get_patient_indication_groups() in diagnosis_lookup.py - Uses CLUSTER_MAPPING_SQL as CTE in Snowflake query - Returns ONLY the most recent match per patient (QUALIFY ROW_NUMBER() = 1) - Needs to return ALL matching Search_Terms per patient (remove QUALIFY) - Batches 500 patients per query ### Modified UPID approach - Current: UPID = Provider Code[:3] + PersonKey (e.g., "RMV12345") - New: UPID = original + "|" + search_term (e.g., "RMV12345|rheumatoid arthritis") - The pipe delimiter "|" is safe because existing UPIDs are alphanumeric - generate_icicle_chart_indication() treats UPID as an opaque identifier — modified UPIDs work transparently - The " - " delimiter in pathway ids is used for hierarchy levels, not within UPIDs ### PseudoNHSNoLinked mapping - HCD data has PseudoNHSNoLinked column that matches PatientPseudonym in GP records - PersonKey is provider-specific local ID — do NOT use for GP matching - One PseudoNHSNoLinked can map to multiple UPIDs (multi-provider patients) - GP match lookup: PseudoNHSNoLinked → list of matched Search_Terms ### Drug matching logic - For each HCD row (UPID + Drug Name): 1. Get patient's GP-matched Search_Terms with code_frequency (via PseudoNHSNoLinked) 2. Get which Search_Terms list this drug (from DimSearchTerm.csv) 3. Intersection = valid indications 4. If 1: use it. If multiple: pick highest code_frequency (most GP coding = most likely indication). If 0: fallback to directory. - Modified UPID groups drugs under same indication together naturally - code_frequency = COUNT(*) of matching SNOMED codes per Search_Term per patient in GP records - GP code time range: only count codes from MIN(Intervention Date) onwards (the HCD data window) - Reduces noise from old/irrelevant diagnoses, makes frequency more meaningful - Pass earliest_hcd_date as parameter to get_patient_indication_groups() - Tiebreaker rationale: 47 RA codes vs 2 crohn's codes → RA is clearly the active condition ### Known edge cases - Some DimSearchTerm drug fragments are generic (INHALED, ORAL, CONTINUOUS) - These could match broadly but are constrained by GP diagnosis requirement - A patient visiting multiple providers has multiple UPIDs - Each UPID gets its own drug-indication matching independently - Same Search_Term appears twice in DimSearchTerm.csv with different directorates - e.g., "diabetes" → DIABETIC MEDICINE and OPHTHALMOLOGY - For indication charts, we use Search_Term not directorate, so this is fine ## Iteration Log ## Iteration 1 — 2026-02-05 ### Task: 1.2 — Build drug-to-Search_Term lookup from DimSearchTerm.csv ### Why this task: - First iteration, chose Phase 1 foundations. Task 1.2 (CSV loading) is self-contained and testable locally without Snowflake. - Task 1.1 (Snowflake query update) can't be verified without a live connection — better to do 1.2 first. - Both 1.1 and 1.2 are independent, so order doesn't matter for dependencies. ### Status: COMPLETE ### What was done: - Added `load_drug_indication_mapping()` to `diagnosis_lookup.py`: - Loads `data/DimSearchTerm.csv`, builds two dicts: - `fragment_to_search_terms`: drug fragment (UPPER) → list of Search_Terms - `search_term_to_fragments`: search_term → list of drug fragments (UPPER) - Handles duplicate Search_Terms (e.g., "diabetes" rows combined) - Result: 164 Search_Terms, 346 drug fragments - Added `get_search_terms_for_drug()` to `diagnosis_lookup.py`: - Returns all Search_Terms whose drug fragments are substrings of the drug name (case-insensitive) - Named differently from plan's `drug_matches_search_term()` — returns all matches at once rather than single boolean, more practical for Phase 2 - Updated `__all__` exports ### Validation results: - Tier 1 (Code): py_compile passed, import check passed - Tier 2 (Data): ADALIMUMAB → 7 indications (including axial spondyloarthritis, rheumatoid arthritis), OMALIZUMAB → 4 indications (asthma, allergic asthma, etc.), PEGYLATED LIPOSOMAL DOXORUBICIN → 4 matches via substring, "ADALIMUMAB 40MG" matches correctly with dosage info, diabetes fragments combined from 2 CSV rows - Tier 3 (Functional): N/A (no UI changes) ### Files changed: - data_processing/diagnosis_lookup.py (added load_drug_indication_mapping, get_search_terms_for_drug) - IMPLEMENTATION_PLAN.md (marked 1.2 subtasks [x]) ### Committed: 0779df7 "feat: add drug-to-indication mapping from DimSearchTerm.csv (Task 1.2)" ### Patterns discovered: - DimSearchTerm.csv has 164 unique Search_Terms (not 165 as noted) because diabetes appears twice with different directorates but same Search_Term - Some drug fragments are very generic: INHALED, CONTINUOUS, ORAL, STANDARD-DOSE, INTRAVENOUS, PEGYLATED, ROUTINE, INDUCTION — these will match broadly but are constrained by the GP diagnosis requirement in Phase 2 - Function signatures for Phase 2: `get_search_terms_for_drug(drug_name, search_term_to_fragments)` returns list[str] — use this to get candidate indications per drug ### Next iteration should: - Work on Task 1.1: Update `get_patient_indication_groups()` to return ALL matches with code_frequency - The current query at line ~1352 of diagnosis_lookup.py uses `QUALIFY ROW_NUMBER() OVER (PARTITION BY pc."PatientPseudonym" ORDER BY pc."EventDateTime" DESC) = 1` — this must be replaced with GROUP BY + COUNT(*) - Add `earliest_hcd_date` parameter to restrict GP codes to HCD data window - Return columns: PatientPseudonym, Search_Term, code_frequency (not EventDateTime) - OR if Snowflake isn't available to test, skip to Task 2.1 (assign_drug_indications function) which can be built and tested with mock data ### Blocked items: - None