HighCostDrugsDemo/IMPLEMENTATION_PLAN.md

# Implementation Plan - Drug-Aware Indication Matching

## Project Overview

Update the indication-based pathway charts so that patient indications are matched **per drug**, not just per patient. Currently, each patient gets ONE indication (most recent GP diagnosis match). This ignores which drugs the patient is actually taking.

### The Problem

A patient on ADALIMUMAB + OMALIZUMAB currently gets assigned a single indication (e.g., "rheumatoid arthritis" — the most recent GP match). But:
- ADALIMUMAB is used for rheumatoid arthritis, axial spondyloarthritis, crohn's disease, etc.
- OMALIZUMAB is used for asthma, allergic asthma, urticaria

These are different clinical pathways and should be treated as separate treatment journeys.

### The Solution

Match each drug to an indication by cross-referencing:
1. **GP diagnosis** — which Search_Terms the patient has matching SNOMED codes for
2. **Drug mapping** — which Search_Terms list each drug (from `DimSearchTerm.csv`)

Only assign a drug to an indication if BOTH conditions are met. If a patient's drugs map to different indications, they become separate pathways (via modified UPID).

### Key Design Decisions

| Aspect | Decision |
|--------|----------|
| Drug-indication source | `data/DimSearchTerm.csv` — Search_Term → CleanedDrugName mapping |
| UPID modification | `{original_UPID}\|{search_term}` for drugs with matched indication |
| GP diagnosis matching | Return ALL matches per patient (not just most recent) |
| Drug matching | Substring match: HCD drug name contains DimSearchTerm fragment |
| Multiple indication matches per drug | Use highest GP code frequency as tiebreaker (COUNT of matching SNOMED codes per Search_Term) |
| GP code time range | Only codes from MIN(Intervention Date) onwards — restricts to HCD data window |
| No indication match | Fallback to directory (same as current behavior) |
| Same patient, different indications | Separate pathways via different modified UPIDs |

### Examples

**Patient on ADALIMUMAB + GOLIMUMAB, GP dx: axial spondyloarthritis + asthma**
- axial spondyloarthritis drug list includes both ADALIMUMAB and GOLIMUMAB
- → Both drugs grouped under "axial spondyloarthritis", single pathway
- Modified UPID: `RMV12345|axial spondyloarthritis`

**Patient on ADALIMUMAB + OMALIZUMAB, GP dx: axial spondyloarthritis + asthma**
- axial spondyloarthritis lists ADALIMUMAB but not OMALIZUMAB
- asthma lists OMALIZUMAB but not ADALIMUMAB
- → Two separate pathways:
  - `RMV12345|axial spondyloarthritis` with ADALIMUMAB
  - `RMV12345|asthma` with OMALIZUMAB

**Patient on ADALIMUMAB, GP dx: rheumatoid arthritis (47 codes) + crohn's disease (2 codes)**
- Both Search_Terms list ADALIMUMAB AND patient has GP dx for both
- → Tiebreaker: highest code frequency — rheumatoid arthritis has 47 matching SNOMED codes vs 2 for crohn's
- → Single pathway under rheumatoid arthritis (more clinical activity = more likely the treatment indication)

---

## Phase 1: Update Snowflake Query & Drug Mapping

### 1.1 Update `get_patient_indication_groups()` to return ALL matches with frequency
- [x] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py):
  - Remove `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1`
  - Return ALL matching Search_Terms per patient with code frequency:
    ```sql
    SELECT pc."PatientPseudonym" AS "PatientPseudonym",
           aic.Search_Term AS "Search_Term",
           COUNT(*) AS "code_frequency"
    FROM PrimaryCareClinicalCoding pc
    JOIN AllIndicationCodes aic ON pc."SNOMEDCode" = aic.SNOMEDCode
    WHERE pc."PatientPseudonym" IN (...)
      AND pc."EventDateTime" >= :earliest_hcd_date
    GROUP BY pc."PatientPseudonym", aic.Search_Term
    ```
  - `code_frequency` = number of matching SNOMED codes per Search_Term per patient
  - Higher frequency = more clinical activity = stronger signal for tiebreaker
  - `earliest_hcd_date` = `MIN(Intervention Date)` from the HCD DataFrame — restricts GP codes to the HCD data window, reducing noise from old/irrelevant diagnoses
- [x] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query
- [x] Keep batch processing (500 patients per query)
- [x] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency)
- [x] Verify: Query returns more rows than before — 537,794 patient-indication rows (avg 16.0 per matched patient) vs previous single row per patient

### 1.2 Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL
- [x] In `CLUSTER_MAPPING_SQL` (diagnosis_lookup.py), merge these 3 Search_Terms into one `"asthma"` entry:
  - `allergic asthma` (Cluster: OMALIZUMAB only)
  - `asthma` (Cluster: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, OMALIZUMAB, RESLIZUMAB)
  - `severe persistent allergic asthma` (Cluster: OMALIZUMAB only)
- [x] Map all 3 Cluster_IDs to `Search_Term = 'asthma'` in the CTE VALUES
- [x] `urticaria` (OMALIZUMAB, DERMATOLOGY) stays SEPARATE — do NOT merge with asthma
- [x] Also update `load_drug_indication_mapping()` to apply the same merge when loading DimSearchTerm.csv:
  - Combine drug lists from all 3 entries under a single `"asthma"` key
  - Deduplicate drug fragments (OMALIZUMAB appears in all 3)
- [x] Verify: GP code lookup returns `"asthma"` (not `"allergic asthma"` or `"severe persistent allergic asthma"`)
- [x] Verify: Drug mapping for `"asthma"` includes full combined drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, OMALIZUMAB, RESLIZUMAB

### 1.3 Build drug-to-Search_Term lookup from DimSearchTerm.csv
- [x] Add function `load_drug_indication_mapping()` to `diagnosis_lookup.py`:
  - Loads `data/DimSearchTerm.csv`
  - Builds dict: `drug_fragment (uppercase) → list[Search_Term]`
  - Also builds reverse: `search_term → list[drug_fragments]`
  - CleanedDrugName is pipe-separated (e.g., "ADALIMUMAB|GOLIMUMAB|IXEKIZUMAB")
- [x] Add function `get_search_terms_for_drug(drug_name, search_term_to_fragments) -> list[str]`:
  - Returns all Search_Terms whose drug fragments are substrings of the drug name (case-insensitive)
  - More practical than per-term boolean check — returns all matches at once for Phase 2 use
- [x] Verify: ADALIMUMAB matches "axial spondyloarthritis", OMALIZUMAB matches "asthma"

---

## Phase 2: Drug-Aware Indication Matching Logic

### 2.1 Create `assign_drug_indications()` function
- [x] Add to `diagnosis_lookup.py` or `pathway_pipeline.py`:
  ```
  def assign_drug_indications(
      df: pd.DataFrame,              # HCD data with UPID, Drug Name columns
      gp_matches_df: pd.DataFrame,   # PatientPseudonym → list of matched Search_Terms
      drug_mapping: dict,             # From load_drug_indication_mapping()
  ) -> tuple[pd.DataFrame, pd.DataFrame]:
      Returns: (modified_df, indication_df)
      - modified_df: HCD data with UPID replaced by {UPID}|{indication}
      - indication_df: mapping modified_UPID → Search_Term
  ```
- [x] Logic per UPID + Drug Name pair:
  1. Get patient's GP-matched Search_Terms with code_frequency (from gp_matches_df via PseudoNHSNoLinked)
  2. Get which Search_Terms include this drug (from drug_mapping)
  3. Intersection = valid indications for this drug-patient pair
  4. If 1 match: use it
  5. If multiple matches: use highest code_frequency as tiebreaker (most GP coding activity = most likely treatment indication)
  6. If 0 matches: use fallback directory
- [x] Modify UPID in df rows: `{original_UPID}|{matched_search_term}`
- [x] Build indication_df: `{modified_UPID}` → `Search_Term` (or fallback label)
- [x] Verify: Function compiles, handles edge cases (no GP match, no drug match)

### 2.2 Handle tiebreaker for multiple indication matches
- [x] When a drug matches multiple Search_Terms AND patient has GP dx for multiple:
  - Use `code_frequency` from the GP query (COUNT of matching SNOMED codes per Search_Term)
  - Higher code_frequency = more clinical activity for that condition = more likely treatment indication
  - E.g., patient with 47 RA codes and 2 crohn's codes → ADALIMUMAB assigned to RA
  - code_frequency is already returned by the updated query in Task 1.1
- [x] Verify: Tiebreaker logic correctly picks highest-frequency diagnosis
- [x] Verify: Tie on frequency (rare but possible) falls back to alphabetical Search_Term for determinism

---

## Phase 3: Pipeline Integration

### 3.1 Update `refresh_pathways.py` indication processing
- [x] In the `elif current_chart_type == "indication":` block:
  1. Call `get_patient_indication_groups()` as before (but now returns ALL matches)
  2. Load drug mapping: `drug_mapping = load_drug_indication_mapping()`
  3. Call `assign_drug_indications(df, gp_matches_df, drug_mapping)`
  4. Use modified_df (with indication-aware UPIDs) for pathway processing
  5. Use indication_df for the indication mapping
- [x] Pass modified_df (not original df) to `process_indication_pathway_for_date_filter()`
- [x] Verify: Pipeline compiles, `python -m py_compile cli/refresh_pathways.py`

### 3.2 Test with dry run
- [x] Run `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
- [x] Verify:
  - Modified UPIDs appear in pipeline log (42,072 unique modified UPIDs)
  - Patient counts are reasonable (42,072 modified UPIDs vs 36,628 original patients)
  - Drug-indication matching is logged (49.3% match, 50.7% fallback, 15,238 tiebreakers)
  - Pathway hierarchy shows drug-specific grouping under correct indications (1,846 total nodes)
- [x] Fixed: network_timeout increased from 30→600 (was killing GP lookup queries)
- [x] Fixed: batch_size increased from 500→5000 (reduces CTE compilation overhead from 74 to 8 batches)

---

## Phase 4: Full Refresh & Validation

### 4.1 Full refresh with both chart types
- [x] Run `python -m cli.refresh_pathways --chart-type all`
- [x] Verify:
  - Both chart types generate data (directory: 1,101 nodes, indication: 1,846 nodes)
  - Directory charts unchanged (293-329 nodes per date filter, same as before)
  - Indication charts reflect drug-aware matching (42,072 modified UPIDs, 49.3% match rate)

### 4.2 Validate indication chart correctness
- [x] Check that drugs under an indication all appear in that Search_Term's drug list
  - RA: ADALIMUMAB, RITUXIMAB, BARICITINIB, CERTOLIZUMAB PEGOL, TOCILIZUMAB ✓
  - Asthma: DUPILUMAB, OMALIZUMAB ✓
- [x] Verify that a patient on drugs for different indications creates separate pathway branches
  - 42,072 modified UPIDs vs 36,628 original patients confirms splitting ✓
- [x] Verify that drugs sharing an indication are grouped in the same pathway
  - Multiple RA drugs (ADALIMUMAB, RITUXIMAB, etc.) all under "rheumatoid arthritis" ✓
- [x] Log: patient count comparison (old vs new approach)
  - Old: 36,628 patients → single indication each
  - New: 42,072 modified UPIDs → drug-specific indications (15% increase from splitting)

### 4.3 Validate Reflex UI
- [x] Run `python -m reflex compile` to verify app compiles (compiled in 16.6s)
- [x] Verify chart type toggle still works (no code changes to UI, toggle mechanism unchanged)
- [x] Verify indication chart shows correct hierarchy (42 unique search_terms at level 2 for all_6mo)

---

## Completion Criteria

All tasks marked `[x]` AND:
- [x] App compiles without errors (`reflex compile` succeeds — 16.6s)
- [x] Both chart types generate pathway data (directory: 1,101, indication: 1,846)
- [x] Indication charts show drug-specific indication matching (49.3% match rate)
- [x] Drugs under the same indication for the same patient are in one pathway (validated via SQLite queries)
- [x] Drugs under different indications for the same patient create separate pathways (42,072 modified UPIDs > 36,628 original)
- [x] Fallback works for drugs with no indication match (RHEUMATOLOGY/OPHTHALMOLOGY/etc. "(no GP dx)" labels present)
- [x] Full refresh completes successfully (2,947 records in 738.4s)
- [x] Existing directory charts are unaffected (1,101 nodes, same count range as previous refresh)

---

## Reference

### DimSearchTerm.csv Structure
```
Search_Term,CleanedDrugName,PrimaryDirectorate
rheumatoid arthritis,ABATACEPT|ADALIMUMAB|ANAKINRA|BARICITINIB|...,RHEUMATOLOGY
asthma,BENRALIZUMAB|DUPILUMAB|INHALED|MEPOLIZUMAB|OMALIZUMAB|RESLIZUMAB,THORACIC MEDICINE
```

### Modified UPID Format
```
Original:  RMV12345
Modified:  RMV12345|rheumatoid arthritis
Fallback:  RMV12345|RHEUMATOLOGY (no GP dx)
```

### Current vs New Indication Flow
```
CURRENT:
  Patient → GP dx (most recent) → single Search_Term → one pathway

NEW:
  Patient + Drug A → GP dx matching Drug A → Search_Term X
  Patient + Drug B → GP dx matching Drug B → Search_Term Y
  → If X == Y: one pathway under X
  → If X != Y: two pathways (modified UPIDs)
```

### Key Files

| File | Changes |
|------|---------|
| `data_processing/diagnosis_lookup.py` | Update query, add drug mapping functions |
| `data_processing/pathway_pipeline.py` | Possibly minor changes for modified UPIDs |
| `cli/refresh_pathways.py` | Integrate drug-aware matching into pipeline |
| `data/DimSearchTerm.csv` | Reference data (read-only) |
| `analysis/pathway_analyzer.py` | No changes expected (UPID changes are transparent) |
| `pathways_app/pathways_app.py` | No changes expected |