196 lines
13 KiB
Plaintext
196 lines
13 KiB
Plaintext
# Progress Log - Drug-Aware Indication Matching
|
|
|
|
## Project Context
|
|
|
|
This project extends the indication-based pathway charts (Phase 1-5 complete) with drug-aware matching.
|
|
|
|
**Previous state**: Patients get ONE indication based on their most recent GP diagnosis match (SNOMED cluster codes). This ignores which drugs the patient is taking.
|
|
|
|
**New goal**: Match each drug to an indication by cross-referencing the patient's GP diagnoses AND the drug's Search_Term mapping from DimSearchTerm.csv.
|
|
|
|
## Key Data/Patterns
|
|
|
|
### DimSearchTerm.csv
|
|
- Located at `data/DimSearchTerm.csv`
|
|
- Columns: Search_Term, CleanedDrugName (pipe-separated), PrimaryDirectorate
|
|
- ~165 rows mapping clinical conditions to drug name fragments
|
|
- Drug fragments are substrings that match standardized drug names from HCD data
|
|
- Some entries have generic fragments: INHALED, CONTINUOUS, STANDARD-DOSE, PEGYLATED
|
|
|
|
### Current get_patient_indication_groups() in diagnosis_lookup.py
|
|
- Uses CLUSTER_MAPPING_SQL as CTE in Snowflake query
|
|
- Returns ONLY the most recent match per patient (QUALIFY ROW_NUMBER() = 1)
|
|
- Needs to return ALL matching Search_Terms per patient (remove QUALIFY)
|
|
- Batches 500 patients per query
|
|
|
|
### Modified UPID approach
|
|
- Current: UPID = Provider Code[:3] + PersonKey (e.g., "RMV12345")
|
|
- New: UPID = original + "|" + search_term (e.g., "RMV12345|rheumatoid arthritis")
|
|
- The pipe delimiter "|" is safe because existing UPIDs are alphanumeric
|
|
- generate_icicle_chart_indication() treats UPID as an opaque identifier — modified UPIDs work transparently
|
|
- The " - " delimiter in pathway ids is used for hierarchy levels, not within UPIDs
|
|
|
|
### PseudoNHSNoLinked mapping
|
|
- HCD data has PseudoNHSNoLinked column that matches PatientPseudonym in GP records
|
|
- PersonKey is provider-specific local ID — do NOT use for GP matching
|
|
- One PseudoNHSNoLinked can map to multiple UPIDs (multi-provider patients)
|
|
- GP match lookup: PseudoNHSNoLinked → list of matched Search_Terms
|
|
|
|
### Drug matching logic
|
|
- For each HCD row (UPID + Drug Name):
|
|
1. Get patient's GP-matched Search_Terms with code_frequency (via PseudoNHSNoLinked)
|
|
2. Get which Search_Terms list this drug (from DimSearchTerm.csv)
|
|
3. Intersection = valid indications
|
|
4. If 1: use it. If multiple: pick highest code_frequency (most GP coding = most likely indication). If 0: fallback to directory.
|
|
- Modified UPID groups drugs under same indication together naturally
|
|
- code_frequency = COUNT(*) of matching SNOMED codes per Search_Term per patient in GP records
|
|
- GP code time range: only count codes from MIN(Intervention Date) onwards (the HCD data window)
|
|
- Reduces noise from old/irrelevant diagnoses, makes frequency more meaningful
|
|
- Pass earliest_hcd_date as parameter to get_patient_indication_groups()
|
|
- Tiebreaker rationale: 47 RA codes vs 2 crohn's codes → RA is clearly the active condition
|
|
|
|
### Known edge cases
|
|
- Some DimSearchTerm drug fragments are generic (INHALED, ORAL, CONTINUOUS)
|
|
- These could match broadly but are constrained by GP diagnosis requirement
|
|
- A patient visiting multiple providers has multiple UPIDs
|
|
- Each UPID gets its own drug-indication matching independently
|
|
- Same Search_Term appears twice in DimSearchTerm.csv with different directorates
|
|
- e.g., "diabetes" → DIABETIC MEDICINE and OPHTHALMOLOGY
|
|
- For indication charts, we use Search_Term not directorate, so this is fine
|
|
|
|
## Iteration Log
|
|
|
|
## Iteration 1 — 2026-02-05
|
|
### Task: 1.3 — Build drug-to-Search_Term lookup from DimSearchTerm.csv
|
|
### Why this task:
|
|
- First iteration, chose Phase 1 foundations. Task 1.2 (CSV loading) is self-contained and testable locally without Snowflake.
|
|
- Task 1.1 (Snowflake query update) can't be verified without a live connection — better to do 1.2 first.
|
|
- Both 1.1 and 1.2 are independent, so order doesn't matter for dependencies.
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Added `load_drug_indication_mapping()` to `diagnosis_lookup.py`:
|
|
- Loads `data/DimSearchTerm.csv`, builds two dicts:
|
|
- `fragment_to_search_terms`: drug fragment (UPPER) → list of Search_Terms
|
|
- `search_term_to_fragments`: search_term → list of drug fragments (UPPER)
|
|
- Handles duplicate Search_Terms (e.g., "diabetes" rows combined)
|
|
- Result: 164 Search_Terms, 346 drug fragments
|
|
- Added `get_search_terms_for_drug()` to `diagnosis_lookup.py`:
|
|
- Returns all Search_Terms whose drug fragments are substrings of the drug name (case-insensitive)
|
|
- Named differently from plan's `drug_matches_search_term()` — returns all matches at once rather than single boolean, more practical for Phase 2
|
|
- Updated `__all__` exports
|
|
### Validation results:
|
|
- Tier 1 (Code): py_compile passed, import check passed
|
|
- Tier 2 (Data): ADALIMUMAB → 7 indications (including axial spondyloarthritis, rheumatoid arthritis), OMALIZUMAB → 4 indications (asthma, allergic asthma, etc.), PEGYLATED LIPOSOMAL DOXORUBICIN → 4 matches via substring, "ADALIMUMAB 40MG" matches correctly with dosage info, diabetes fragments combined from 2 CSV rows
|
|
- Tier 3 (Functional): N/A (no UI changes)
|
|
### Files changed:
|
|
- data_processing/diagnosis_lookup.py (added load_drug_indication_mapping, get_search_terms_for_drug)
|
|
- IMPLEMENTATION_PLAN.md (marked 1.2 subtasks [x])
|
|
### Committed: 0779df7 "feat: add drug-to-indication mapping from DimSearchTerm.csv (Task 1.3)"
|
|
### Patterns discovered:
|
|
- DimSearchTerm.csv has 164 unique Search_Terms (not 165 as noted) because diabetes appears twice with different directorates but same Search_Term
|
|
- Some drug fragments are very generic: INHALED, CONTINUOUS, ORAL, STANDARD-DOSE, INTRAVENOUS, PEGYLATED, ROUTINE, INDUCTION — these will match broadly but are constrained by the GP diagnosis requirement in Phase 2
|
|
- Function signatures for Phase 2: `get_search_terms_for_drug(drug_name, search_term_to_fragments)` returns list[str] — use this to get candidate indications per drug
|
|
### Next iteration should:
|
|
- Work on Task 1.2: Merge asthma Search_Terms in CLUSTER_MAPPING_SQL and load_drug_indication_mapping()
|
|
- Merge "allergic asthma", "asthma", "severe persistent allergic asthma" → "asthma"
|
|
- Keep "urticaria" separate
|
|
- This is self-contained and testable locally
|
|
- OR work on Task 1.1: Update `get_patient_indication_groups()` to return ALL matches with code_frequency
|
|
- The current query at line ~1352 of diagnosis_lookup.py uses `QUALIFY ROW_NUMBER() OVER (PARTITION BY pc."PatientPseudonym" ORDER BY pc."EventDateTime" DESC) = 1` — this must be replaced with GROUP BY + COUNT(*)
|
|
- Add `earliest_hcd_date` parameter to restrict GP codes to HCD data window
|
|
- Return columns: PatientPseudonym, Search_Term, code_frequency (not EventDateTime)
|
|
- OR if Snowflake isn't available to test 1.1, skip to Task 2.1 (assign_drug_indications function) which can be built and tested with mock data
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 2 — 2026-02-05
|
|
### Task: 1.2 — Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL
|
|
### Why this task:
|
|
- Previous iteration recommended this as the next task (self-contained, testable locally)
|
|
- Both CLUSTER_MAPPING_SQL and load_drug_indication_mapping() need consistent Search_Term names
|
|
- Must be done before Task 1.1 (Snowflake query) to ensure GP lookups return "asthma" not "allergic asthma"
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Updated CLUSTER_MAPPING_SQL: changed 'allergic asthma' → 'asthma' (AST_COD) and 'severe persistent allergic asthma' → 'asthma' (SEVAST_COD)
|
|
- Now 3 rows for 'asthma': AST_COD, eFI2_Asthma, SEVAST_COD
|
|
- urticaria (XSAL_COD) stays separate
|
|
- Added SEARCH_TERM_MERGE_MAP constant: {"allergic asthma": "asthma", "severe persistent allergic asthma": "asthma"}
|
|
- Updated load_drug_indication_mapping() to apply SEARCH_TERM_MERGE_MAP when loading CSV
|
|
- Normalizes Search_Term before accumulating fragments
|
|
- Drug fragments from all 3 original rows combined under "asthma" key
|
|
- Exported SEARCH_TERM_MERGE_MAP in __all__
|
|
### Validation results:
|
|
- Tier 1 (Code): py_compile passed, import check passed
|
|
- Tier 2 (Data):
|
|
- "asthma" fragments: OMALIZUMAB, BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, RESLIZUMAB (complete combined list)
|
|
- "allergic asthma" no longer exists as separate key
|
|
- "severe persistent allergic asthma" no longer exists as separate key
|
|
- "urticaria" → ['OMALIZUMAB'] — correctly separate
|
|
- OMALIZUMAB maps to: ['asthma', 'urticaria'] — correct
|
|
- Total Search_Terms: 162 (was 164, 3 asthma entries → 1)
|
|
- Total fragments: 346 (unchanged)
|
|
- Tier 3 (Functional): N/A (no UI changes)
|
|
### Files changed:
|
|
- data_processing/diagnosis_lookup.py (CLUSTER_MAPPING_SQL, SEARCH_TERM_MERGE_MAP, load_drug_indication_mapping)
|
|
- IMPLEMENTATION_PLAN.md (marked 1.2 subtasks [x])
|
|
### Committed: b0a8a9d "feat: merge asthma Search_Term variants in CLUSTER_MAPPING_SQL and drug mapping (Task 1.2)"
|
|
### Patterns discovered:
|
|
- SEARCH_TERM_MERGE_MAP is reusable: any future module that receives Search_Terms from Snowflake can apply the same normalization
|
|
- The merge approach (normalize at load time) is cleaner than post-hoc deduplication
|
|
### Next iteration should:
|
|
- Work on Task 1.1: Update `get_patient_indication_groups()` to return ALL matches with code_frequency
|
|
- The current query at ~line 1467 uses `QUALIFY ROW_NUMBER() OVER (PARTITION BY pc."PatientPseudonym" ORDER BY pc."EventDateTime" DESC) = 1`
|
|
- Replace with GROUP BY + COUNT(*) for code_frequency
|
|
- Add `earliest_hcd_date` parameter to restrict GP codes to HCD data window
|
|
- Return columns: PatientPseudonym, Search_Term, code_frequency
|
|
- Empty DataFrame columns should match new return type
|
|
- This requires Snowflake connectivity to fully test, but code changes can be verified with py_compile and import checks
|
|
- OR work on Task 2.1: Create assign_drug_indications() — can be built and tested with mock data
|
|
- This is independent of Task 1.1 if you mock the gp_matches_df input
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 3 — 2026-02-05
|
|
### Task: 1.1 — Update get_patient_indication_groups() to return ALL matches with code_frequency
|
|
### Why this task:
|
|
- Previous iteration recommended this as the next task
|
|
- This is the last remaining Phase 1 dependency — Tasks 2.1/2.2 need the GP matches data in the right format (multiple rows per patient with code_frequency)
|
|
- Task 1.1 changes the query fundamentals; better to do this before building assign_drug_indications() which consumes the output
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Modified `get_patient_indication_groups()` in `diagnosis_lookup.py`:
|
|
1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY pc."PatientPseudonym" ORDER BY pc."EventDateTime" DESC) = 1`
|
|
2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"`
|
|
3. **Added** `earliest_hcd_date: Optional[str] = None` parameter
|
|
4. **Added** optional `AND pc."EventDateTime" >= %s` when earliest_hcd_date is provided
|
|
5. **Updated** return columns from `(PatientPseudonym, Search_Term, EventDateTime)` to `(PatientPseudonym, Search_Term, code_frequency)`
|
|
6. **Updated** all empty DataFrame returns to use new column names
|
|
7. **Updated** logging to show multiple-rows-per-patient stats (avg indications per patient)
|
|
8. **Updated** docstring to describe new behavior and parameters
|
|
- Backward compatible: `earliest_hcd_date` defaults to `None`, existing callers still work
|
|
- Note: caller in `refresh_pathways.py` (line 424-428) does `dict(zip(...))` which will only keep last match per patient with new multi-row format — this will be updated in Task 3.1
|
|
### Validation results:
|
|
- Tier 1 (Code): py_compile PASSED, import check PASSED, function signature verified
|
|
- Tier 2 (Data): Empty DataFrame returns correct columns ['PatientPseudonym', 'Search_Term', 'code_frequency']; live Snowflake test deferred to Phase 3/4
|
|
- Tier 3 (Functional): N/A (no UI changes)
|
|
### Files changed:
|
|
- data_processing/diagnosis_lookup.py (modified get_patient_indication_groups function)
|
|
- IMPLEMENTATION_PLAN.md (marked 1.1 subtasks [x])
|
|
### Committed: c93417f "feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1)"
|
|
### Patterns discovered:
|
|
- The `earliest_hcd_date` parameter is passed as a string in ISO format (YYYY-MM-DD) via Snowflake %s placeholder — Snowflake handles string-to-timestamp comparison implicitly
|
|
- The GROUP BY approach naturally deduplicates SNOMED codes within the same Search_Term — a patient with the same SNOMED code recorded 5 times gets code_frequency=5 (reflecting clinical activity intensity)
|
|
- params list is built dynamically: `batch_pseudonyms + [earliest_hcd_date]` only when date filter is active
|
|
### Next iteration should:
|
|
- Work on Task 2.1: Create `assign_drug_indications()` function
|
|
- This is now unblocked since 1.1 is complete (return format is known)
|
|
- Input: HCD df, gp_matches_df (PatientPseudonym, Search_Term, code_frequency), drug_mapping from load_drug_indication_mapping()
|
|
- Output: (modified_df with UPID|search_term, indication_df mapping modified_UPID → Search_Term)
|
|
- Can be built and tested with mock data (no Snowflake needed)
|
|
- Key logic: for each UPID+Drug pair, intersect drug's Search_Terms with patient's GP matches, pick highest code_frequency as tiebreaker
|
|
- The function needs PseudoNHSNoLinked to look up GP matches, so the df must have that column
|
|
- Task 2.2 (tiebreaker logic) can be done within 2.1 or as a follow-up
|
|
- The final Phase 1 subtask (1.1 verify with live Snowflake) will be tested during Phase 3/4 integration
|
|
### Blocked items:
|
|
- Task 1.1 final subtask "Verify: Query returns more rows" requires live Snowflake — deferred to Phase 3/4
|