feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1)

- Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching Search_Terms per patient instead of just the most recent - Add earliest_hcd_date parameter to restrict GP codes to HCD data window - Return code_frequency column (count of matching SNOMED codes per Search_Term) for use as tiebreaker in drug-aware indication matching - Update empty DataFrame returns to match new column format
2026-02-05 23:01:01 +00:00
parent 4fed0e53df
commit c93417f0e7
3 changed files with 89 additions and 27 deletions
@@ -57,7 +57,7 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
 ## Phase 1: Update Snowflake Query & Drug Mapping

 ### 1.1 Update `get_patient_indication_groups()` to return ALL matches with frequency
- [ ] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py):
+- [x] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py):
  - Remove `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1`
  - Return ALL matching Search_Terms per patient with code frequency:
    ```sql
@@ -73,10 +73,10 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
  - `code_frequency` = number of matching SNOMED codes per Search_Term per patient
  - Higher frequency = more clinical activity = stronger signal for tiebreaker
  - `earliest_hcd_date` = `MIN(Intervention Date)` from the HCD DataFrame — restricts GP codes to the HCD data window, reducing noise from old/irrelevant diagnoses
- [ ] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query
- [ ] Keep batch processing (500 patients per query)
- [ ] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency)
- [ ] Verify: Query returns more rows than before (patients with multiple matching diagnoses)
+- [x] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query
+- [x] Keep batch processing (500 patients per query)
+- [x] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency)
+- [ ] Verify: Query returns more rows than before (patients with multiple matching diagnoses) *(requires live Snowflake — will be verified in Phase 3/4)*

 ### 1.2 Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL
 - [x] In `CLUSTER_MAPPING_SQL` (diagnosis_lookup.py), merge these 3 Search_Terms into one `"asthma"` entry: