feat: add assign_drug_indications() for drug-aware indication matching (Task 2.1 + 2.2)

This commit is contained in:
Andrew Charlwood
2026-02-05 23:05:40 +00:00
parent 947b87a331
commit 408976e001
2 changed files with 179 additions and 8 deletions
+8 -8
View File
@@ -107,7 +107,7 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
## Phase 2: Drug-Aware Indication Matching Logic
### 2.1 Create `assign_drug_indications()` function
- [ ] Add to `diagnosis_lookup.py` or `pathway_pipeline.py`:
- [x] Add to `diagnosis_lookup.py` or `pathway_pipeline.py`:
```
def assign_drug_indications(
df: pd.DataFrame, # HCD data with UPID, Drug Name columns
@@ -118,25 +118,25 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
- modified_df: HCD data with UPID replaced by {UPID}|{indication}
- indication_df: mapping modified_UPID → Search_Term
```
- [ ] Logic per UPID + Drug Name pair:
- [x] Logic per UPID + Drug Name pair:
1. Get patient's GP-matched Search_Terms with code_frequency (from gp_matches_df via PseudoNHSNoLinked)
2. Get which Search_Terms include this drug (from drug_mapping)
3. Intersection = valid indications for this drug-patient pair
4. If 1 match: use it
5. If multiple matches: use highest code_frequency as tiebreaker (most GP coding activity = most likely treatment indication)
6. If 0 matches: use fallback directory
- [ ] Modify UPID in df rows: `{original_UPID}|{matched_search_term}`
- [ ] Build indication_df: `{modified_UPID}` → `Search_Term` (or fallback label)
- [ ] Verify: Function compiles, handles edge cases (no GP match, no drug match)
- [x] Modify UPID in df rows: `{original_UPID}|{matched_search_term}`
- [x] Build indication_df: `{modified_UPID}` → `Search_Term` (or fallback label)
- [x] Verify: Function compiles, handles edge cases (no GP match, no drug match)
### 2.2 Handle tiebreaker for multiple indication matches
- [ ] When a drug matches multiple Search_Terms AND patient has GP dx for multiple:
- [x] When a drug matches multiple Search_Terms AND patient has GP dx for multiple:
- Use `code_frequency` from the GP query (COUNT of matching SNOMED codes per Search_Term)
- Higher code_frequency = more clinical activity for that condition = more likely treatment indication
- E.g., patient with 47 RA codes and 2 crohn's codes → ADALIMUMAB assigned to RA
- code_frequency is already returned by the updated query in Task 1.1
- [ ] Verify: Tiebreaker logic correctly picks highest-frequency diagnosis
- [ ] Verify: Tie on frequency (rare but possible) falls back to alphabetical Search_Term for determinism
- [x] Verify: Tiebreaker logic correctly picks highest-frequency diagnosis
- [x] Verify: Tie on frequency (rare but possible) falls back to alphabetical Search_Term for determinism
---