Files
HighCostDrugsDemo/IMPLEMENTATION_PLAN.md
T
Andrew Charlwood f3bba6dfab docs: complete Phase 4 validation — full refresh and data verification (Task 4.1-4.3)
Full refresh: 2,947 nodes (1,101 directory + 1,846 indication) in 738s.
Validation: RA/asthma drugs correctly grouped, fallback labels present,
directory charts unchanged, Reflex compiles. All completion criteria met.
2026-02-06 00:12:53 +00:00

247 lines
13 KiB
Markdown

# Implementation Plan - Drug-Aware Indication Matching
## Project Overview
Update the indication-based pathway charts so that patient indications are matched **per drug**, not just per patient. Currently, each patient gets ONE indication (most recent GP diagnosis match). This ignores which drugs the patient is actually taking.
### The Problem
A patient on ADALIMUMAB + OMALIZUMAB currently gets assigned a single indication (e.g., "rheumatoid arthritis" — the most recent GP match). But:
- ADALIMUMAB is used for rheumatoid arthritis, axial spondyloarthritis, crohn's disease, etc.
- OMALIZUMAB is used for asthma, allergic asthma, urticaria
These are different clinical pathways and should be treated as separate treatment journeys.
### The Solution
Match each drug to an indication by cross-referencing:
1. **GP diagnosis** — which Search_Terms the patient has matching SNOMED codes for
2. **Drug mapping** — which Search_Terms list each drug (from `DimSearchTerm.csv`)
Only assign a drug to an indication if BOTH conditions are met. If a patient's drugs map to different indications, they become separate pathways (via modified UPID).
### Key Design Decisions
| Aspect | Decision |
|--------|----------|
| Drug-indication source | `data/DimSearchTerm.csv` — Search_Term → CleanedDrugName mapping |
| UPID modification | `{original_UPID}\|{search_term}` for drugs with matched indication |
| GP diagnosis matching | Return ALL matches per patient (not just most recent) |
| Drug matching | Substring match: HCD drug name contains DimSearchTerm fragment |
| Multiple indication matches per drug | Use highest GP code frequency as tiebreaker (COUNT of matching SNOMED codes per Search_Term) |
| GP code time range | Only codes from MIN(Intervention Date) onwards — restricts to HCD data window |
| No indication match | Fallback to directory (same as current behavior) |
| Same patient, different indications | Separate pathways via different modified UPIDs |
### Examples
**Patient on ADALIMUMAB + GOLIMUMAB, GP dx: axial spondyloarthritis + asthma**
- axial spondyloarthritis drug list includes both ADALIMUMAB and GOLIMUMAB
- → Both drugs grouped under "axial spondyloarthritis", single pathway
- Modified UPID: `RMV12345|axial spondyloarthritis`
**Patient on ADALIMUMAB + OMALIZUMAB, GP dx: axial spondyloarthritis + asthma**
- axial spondyloarthritis lists ADALIMUMAB but not OMALIZUMAB
- asthma lists OMALIZUMAB but not ADALIMUMAB
- → Two separate pathways:
- `RMV12345|axial spondyloarthritis` with ADALIMUMAB
- `RMV12345|asthma` with OMALIZUMAB
**Patient on ADALIMUMAB, GP dx: rheumatoid arthritis (47 codes) + crohn's disease (2 codes)**
- Both Search_Terms list ADALIMUMAB AND patient has GP dx for both
- → Tiebreaker: highest code frequency — rheumatoid arthritis has 47 matching SNOMED codes vs 2 for crohn's
- → Single pathway under rheumatoid arthritis (more clinical activity = more likely the treatment indication)
---
## Phase 1: Update Snowflake Query & Drug Mapping
### 1.1 Update `get_patient_indication_groups()` to return ALL matches with frequency
- [x] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py):
- Remove `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1`
- Return ALL matching Search_Terms per patient with code frequency:
```sql
SELECT pc."PatientPseudonym" AS "PatientPseudonym",
aic.Search_Term AS "Search_Term",
COUNT(*) AS "code_frequency"
FROM PrimaryCareClinicalCoding pc
JOIN AllIndicationCodes aic ON pc."SNOMEDCode" = aic.SNOMEDCode
WHERE pc."PatientPseudonym" IN (...)
AND pc."EventDateTime" >= :earliest_hcd_date
GROUP BY pc."PatientPseudonym", aic.Search_Term
```
- `code_frequency` = number of matching SNOMED codes per Search_Term per patient
- Higher frequency = more clinical activity = stronger signal for tiebreaker
- `earliest_hcd_date` = `MIN(Intervention Date)` from the HCD DataFrame — restricts GP codes to the HCD data window, reducing noise from old/irrelevant diagnoses
- [x] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query
- [x] Keep batch processing (500 patients per query)
- [x] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency)
- [x] Verify: Query returns more rows than before — 537,794 patient-indication rows (avg 16.0 per matched patient) vs previous single row per patient
### 1.2 Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL
- [x] In `CLUSTER_MAPPING_SQL` (diagnosis_lookup.py), merge these 3 Search_Terms into one `"asthma"` entry:
- `allergic asthma` (Cluster: OMALIZUMAB only)
- `asthma` (Cluster: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, OMALIZUMAB, RESLIZUMAB)
- `severe persistent allergic asthma` (Cluster: OMALIZUMAB only)
- [x] Map all 3 Cluster_IDs to `Search_Term = 'asthma'` in the CTE VALUES
- [x] `urticaria` (OMALIZUMAB, DERMATOLOGY) stays SEPARATE — do NOT merge with asthma
- [x] Also update `load_drug_indication_mapping()` to apply the same merge when loading DimSearchTerm.csv:
- Combine drug lists from all 3 entries under a single `"asthma"` key
- Deduplicate drug fragments (OMALIZUMAB appears in all 3)
- [x] Verify: GP code lookup returns `"asthma"` (not `"allergic asthma"` or `"severe persistent allergic asthma"`)
- [x] Verify: Drug mapping for `"asthma"` includes full combined drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, OMALIZUMAB, RESLIZUMAB
### 1.3 Build drug-to-Search_Term lookup from DimSearchTerm.csv
- [x] Add function `load_drug_indication_mapping()` to `diagnosis_lookup.py`:
- Loads `data/DimSearchTerm.csv`
- Builds dict: `drug_fragment (uppercase) → list[Search_Term]`
- Also builds reverse: `search_term → list[drug_fragments]`
- CleanedDrugName is pipe-separated (e.g., "ADALIMUMAB|GOLIMUMAB|IXEKIZUMAB")
- [x] Add function `get_search_terms_for_drug(drug_name, search_term_to_fragments) -> list[str]`:
- Returns all Search_Terms whose drug fragments are substrings of the drug name (case-insensitive)
- More practical than per-term boolean check — returns all matches at once for Phase 2 use
- [x] Verify: ADALIMUMAB matches "axial spondyloarthritis", OMALIZUMAB matches "asthma"
---
## Phase 2: Drug-Aware Indication Matching Logic
### 2.1 Create `assign_drug_indications()` function
- [x] Add to `diagnosis_lookup.py` or `pathway_pipeline.py`:
```
def assign_drug_indications(
df: pd.DataFrame, # HCD data with UPID, Drug Name columns
gp_matches_df: pd.DataFrame, # PatientPseudonym → list of matched Search_Terms
drug_mapping: dict, # From load_drug_indication_mapping()
) -> tuple[pd.DataFrame, pd.DataFrame]:
Returns: (modified_df, indication_df)
- modified_df: HCD data with UPID replaced by {UPID}|{indication}
- indication_df: mapping modified_UPID → Search_Term
```
- [x] Logic per UPID + Drug Name pair:
1. Get patient's GP-matched Search_Terms with code_frequency (from gp_matches_df via PseudoNHSNoLinked)
2. Get which Search_Terms include this drug (from drug_mapping)
3. Intersection = valid indications for this drug-patient pair
4. If 1 match: use it
5. If multiple matches: use highest code_frequency as tiebreaker (most GP coding activity = most likely treatment indication)
6. If 0 matches: use fallback directory
- [x] Modify UPID in df rows: `{original_UPID}|{matched_search_term}`
- [x] Build indication_df: `{modified_UPID}` → `Search_Term` (or fallback label)
- [x] Verify: Function compiles, handles edge cases (no GP match, no drug match)
### 2.2 Handle tiebreaker for multiple indication matches
- [x] When a drug matches multiple Search_Terms AND patient has GP dx for multiple:
- Use `code_frequency` from the GP query (COUNT of matching SNOMED codes per Search_Term)
- Higher code_frequency = more clinical activity for that condition = more likely treatment indication
- E.g., patient with 47 RA codes and 2 crohn's codes → ADALIMUMAB assigned to RA
- code_frequency is already returned by the updated query in Task 1.1
- [x] Verify: Tiebreaker logic correctly picks highest-frequency diagnosis
- [x] Verify: Tie on frequency (rare but possible) falls back to alphabetical Search_Term for determinism
---
## Phase 3: Pipeline Integration
### 3.1 Update `refresh_pathways.py` indication processing
- [x] In the `elif current_chart_type == "indication":` block:
1. Call `get_patient_indication_groups()` as before (but now returns ALL matches)
2. Load drug mapping: `drug_mapping = load_drug_indication_mapping()`
3. Call `assign_drug_indications(df, gp_matches_df, drug_mapping)`
4. Use modified_df (with indication-aware UPIDs) for pathway processing
5. Use indication_df for the indication mapping
- [x] Pass modified_df (not original df) to `process_indication_pathway_for_date_filter()`
- [x] Verify: Pipeline compiles, `python -m py_compile cli/refresh_pathways.py`
### 3.2 Test with dry run
- [x] Run `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
- [x] Verify:
- Modified UPIDs appear in pipeline log (42,072 unique modified UPIDs)
- Patient counts are reasonable (42,072 modified UPIDs vs 36,628 original patients)
- Drug-indication matching is logged (49.3% match, 50.7% fallback, 15,238 tiebreakers)
- Pathway hierarchy shows drug-specific grouping under correct indications (1,846 total nodes)
- [x] Fixed: network_timeout increased from 30→600 (was killing GP lookup queries)
- [x] Fixed: batch_size increased from 500→5000 (reduces CTE compilation overhead from 74 to 8 batches)
---
## Phase 4: Full Refresh & Validation
### 4.1 Full refresh with both chart types
- [x] Run `python -m cli.refresh_pathways --chart-type all`
- [x] Verify:
- Both chart types generate data (directory: 1,101 nodes, indication: 1,846 nodes)
- Directory charts unchanged (293-329 nodes per date filter, same as before)
- Indication charts reflect drug-aware matching (42,072 modified UPIDs, 49.3% match rate)
### 4.2 Validate indication chart correctness
- [x] Check that drugs under an indication all appear in that Search_Term's drug list
- RA: ADALIMUMAB, RITUXIMAB, BARICITINIB, CERTOLIZUMAB PEGOL, TOCILIZUMAB ✓
- Asthma: DUPILUMAB, OMALIZUMAB ✓
- [x] Verify that a patient on drugs for different indications creates separate pathway branches
- 42,072 modified UPIDs vs 36,628 original patients confirms splitting ✓
- [x] Verify that drugs sharing an indication are grouped in the same pathway
- Multiple RA drugs (ADALIMUMAB, RITUXIMAB, etc.) all under "rheumatoid arthritis" ✓
- [x] Log: patient count comparison (old vs new approach)
- Old: 36,628 patients → single indication each
- New: 42,072 modified UPIDs → drug-specific indications (15% increase from splitting)
### 4.3 Validate Reflex UI
- [x] Run `python -m reflex compile` to verify app compiles (compiled in 16.6s)
- [x] Verify chart type toggle still works (no code changes to UI, toggle mechanism unchanged)
- [x] Verify indication chart shows correct hierarchy (42 unique search_terms at level 2 for all_6mo)
---
## Completion Criteria
All tasks marked `[x]` AND:
- [x] App compiles without errors (`reflex compile` succeeds — 16.6s)
- [x] Both chart types generate pathway data (directory: 1,101, indication: 1,846)
- [x] Indication charts show drug-specific indication matching (49.3% match rate)
- [x] Drugs under the same indication for the same patient are in one pathway (validated via SQLite queries)
- [x] Drugs under different indications for the same patient create separate pathways (42,072 modified UPIDs > 36,628 original)
- [x] Fallback works for drugs with no indication match (RHEUMATOLOGY/OPHTHALMOLOGY/etc. "(no GP dx)" labels present)
- [x] Full refresh completes successfully (2,947 records in 738.4s)
- [x] Existing directory charts are unaffected (1,101 nodes, same count range as previous refresh)
---
## Reference
### DimSearchTerm.csv Structure
```
Search_Term,CleanedDrugName,PrimaryDirectorate
rheumatoid arthritis,ABATACEPT|ADALIMUMAB|ANAKINRA|BARICITINIB|...,RHEUMATOLOGY
asthma,BENRALIZUMAB|DUPILUMAB|INHALED|MEPOLIZUMAB|OMALIZUMAB|RESLIZUMAB,THORACIC MEDICINE
```
### Modified UPID Format
```
Original: RMV12345
Modified: RMV12345|rheumatoid arthritis
Fallback: RMV12345|RHEUMATOLOGY (no GP dx)
```
### Current vs New Indication Flow
```
CURRENT:
Patient → GP dx (most recent) → single Search_Term → one pathway
NEW:
Patient + Drug A → GP dx matching Drug A → Search_Term X
Patient + Drug B → GP dx matching Drug B → Search_Term Y
→ If X == Y: one pathway under X
→ If X != Y: two pathways (modified UPIDs)
```
### Key Files
| File | Changes |
|------|---------|
| `data_processing/diagnosis_lookup.py` | Update query, add drug mapping functions |
| `data_processing/pathway_pipeline.py` | Possibly minor changes for modified UPIDs |
| `cli/refresh_pathways.py` | Integrate drug-aware matching into pipeline |
| `data/DimSearchTerm.csv` | Reference data (read-only) |
| `analysis/pathway_analyzer.py` | No changes expected (UPID changes are transparent) |
| `pathways_app/pathways_app.py` | No changes expected |