feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1)

- Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching
  Search_Terms per patient instead of just the most recent
- Add earliest_hcd_date parameter to restrict GP codes to HCD data window
- Return code_frequency column (count of matching SNOMED codes per Search_Term)
  for use as tiebreaker in drug-aware indication matching
- Update empty DataFrame returns to match new column format
This commit is contained in:
Andrew Charlwood
2026-02-05 23:01:01 +00:00
parent 4fed0e53df
commit c93417f0e7
3 changed files with 89 additions and 27 deletions
+5 -5
View File
@@ -57,7 +57,7 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
## Phase 1: Update Snowflake Query & Drug Mapping ## Phase 1: Update Snowflake Query & Drug Mapping
### 1.1 Update `get_patient_indication_groups()` to return ALL matches with frequency ### 1.1 Update `get_patient_indication_groups()` to return ALL matches with frequency
- [ ] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py): - [x] Modify the Snowflake query in `get_patient_indication_groups()` (diagnosis_lookup.py):
- Remove `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1` - Remove `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1`
- Return ALL matching Search_Terms per patient with code frequency: - Return ALL matching Search_Terms per patient with code frequency:
```sql ```sql
@@ -73,10 +73,10 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
- `code_frequency` = number of matching SNOMED codes per Search_Term per patient - `code_frequency` = number of matching SNOMED codes per Search_Term per patient
- Higher frequency = more clinical activity = stronger signal for tiebreaker - Higher frequency = more clinical activity = stronger signal for tiebreaker
- `earliest_hcd_date` = `MIN(Intervention Date)` from the HCD DataFrame — restricts GP codes to the HCD data window, reducing noise from old/irrelevant diagnoses - `earliest_hcd_date` = `MIN(Intervention Date)` from the HCD DataFrame — restricts GP codes to the HCD data window, reducing noise from old/irrelevant diagnoses
- [ ] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query - [x] Accept `earliest_hcd_date` parameter in `get_patient_indication_groups()` and pass to query
- [ ] Keep batch processing (500 patients per query) - [x] Keep batch processing (500 patients per query)
- [ ] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency) - [x] Update return type: DataFrame now has multiple rows per patient (PatientPseudonym, Search_Term, code_frequency)
- [ ] Verify: Query returns more rows than before (patients with multiple matching diagnoses) - [ ] Verify: Query returns more rows than before (patients with multiple matching diagnoses) *(requires live Snowflake — will be verified in Phase 3/4)*
### 1.2 Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL ### 1.2 Merge related asthma Search_Terms in CLUSTER_MAPPING_SQL
- [x] In `CLUSTER_MAPPING_SQL` (diagnosis_lookup.py), merge these 3 Search_Terms into one `"asthma"` entry: - [x] In `CLUSTER_MAPPING_SQL` (diagnosis_lookup.py), merge these 3 Search_Terms into one `"asthma"` entry:
+40 -22
View File
@@ -1400,49 +1400,57 @@ def get_patient_indication_groups(
patient_pseudonyms: list[str], patient_pseudonyms: list[str],
connector: Optional[SnowflakeConnector] = None, connector: Optional[SnowflakeConnector] = None,
batch_size: int = 500, batch_size: int = 500,
earliest_hcd_date: Optional[str] = None,
) -> "pd.DataFrame": ) -> "pd.DataFrame":
""" """
Batch lookup GP diagnosis-based indication groups using Snowflake cluster query. Batch lookup GP diagnosis-based indication groups using Snowflake cluster query.
This function queries Snowflake directly using the embedded cluster CTE Returns ALL matching Search_Terms per patient with code_frequency (count of
(from snomed_indication_mapping_query.sql) to find patients with matching matching SNOMED codes). This enables drug-aware indication matching where
GP diagnoses. This is the NEW approach replacing the old SQLite-based lookup. each drug is cross-referenced against the patient's GP diagnoses.
The query: The query:
1. Uses the cluster mapping CTE to get all Search_Term -> SNOMED code mappings 1. Uses the cluster mapping CTE to get all Search_Term -> SNOMED code mappings
2. Joins with PrimaryCareClinicalCoding to find patients with matching codes 2. Joins with PrimaryCareClinicalCoding to find patients with matching codes
3. Returns the most recent match per patient (by EventDateTime) 3. Groups by patient + Search_Term and counts matching codes (code_frequency)
4. Optionally restricts to GP codes from earliest_hcd_date onwards
Args: Args:
patient_pseudonyms: List of PseudoNHSNoLinked values (matches PatientPseudonym in GP records) patient_pseudonyms: List of PseudoNHSNoLinked values (matches PatientPseudonym in GP records)
connector: Optional SnowflakeConnector (defaults to singleton) connector: Optional SnowflakeConnector (defaults to singleton)
batch_size: Number of patients per Snowflake query batch (default 500) batch_size: Number of patients per Snowflake query batch (default 500)
earliest_hcd_date: Optional ISO date string (YYYY-MM-DD). If provided, only
counts GP codes from this date onwards. Should be MIN(Intervention Date)
from the HCD DataFrame to restrict to the HCD data window.
Returns: Returns:
DataFrame with columns: DataFrame with columns:
- PatientPseudonym: The patient identifier (PseudoNHSNoLinked value) - PatientPseudonym: The patient identifier (PseudoNHSNoLinked value)
- Search_Term: The matched indication (e.g., "rheumatoid arthritis") - Search_Term: The matched indication (e.g., "rheumatoid arthritis")
- EventDateTime: Date of the GP diagnosis record - code_frequency: Count of matching SNOMED codes for this Search_Term
Multiple rows per patient (one per matched Search_Term).
Patients not found in results have no matching GP diagnosis. Patients not found in results have no matching GP diagnosis.
""" """
import pandas as pd import pandas as pd
logger.info(f"Starting Snowflake-direct indication lookup for {len(patient_pseudonyms)} patients...") logger.info(f"Starting Snowflake-direct indication lookup for {len(patient_pseudonyms)} patients...")
if earliest_hcd_date:
logger.info(f" Restricting GP codes to >= {earliest_hcd_date}")
# Handle edge case: empty patient list # Handle edge case: empty patient list
if not patient_pseudonyms: if not patient_pseudonyms:
logger.warning("Empty patient list provided") logger.warning("Empty patient list provided")
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime']) return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'code_frequency'])
# Check Snowflake availability # Check Snowflake availability
if not SNOWFLAKE_AVAILABLE: if not SNOWFLAKE_AVAILABLE:
logger.error("Snowflake connector not available - cannot lookup GP records") logger.error("Snowflake connector not available - cannot lookup GP records")
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime']) return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'code_frequency'])
if not is_snowflake_configured(): if not is_snowflake_configured():
logger.error("Snowflake not configured - cannot lookup GP records") logger.error("Snowflake not configured - cannot lookup GP records")
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime']) return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'code_frequency'])
if connector is None: if connector is None:
connector = get_connector() connector = get_connector()
@@ -1463,36 +1471,43 @@ def get_patient_indication_groups(
# Build patient IN clause placeholders # Build patient IN clause placeholders
patient_placeholders = ", ".join(["%s"] * len(batch_pseudonyms)) patient_placeholders = ", ".join(["%s"] * len(batch_pseudonyms))
# Build WHERE clause with optional date filter
date_filter = ""
if earliest_hcd_date:
date_filter = f"\n AND pc.\"EventDateTime\" >= %s"
# Build the full query with cluster CTE # Build the full query with cluster CTE
# This finds the most recent matching diagnosis for each patient # Returns ALL matching Search_Terms per patient with code_frequency
# Note: Column names must be aliased to ensure consistent casing in results # code_frequency = COUNT of matching SNOMED codes per Search_Term per patient
query = f""" query = f"""
{CLUSTER_MAPPING_SQL} {CLUSTER_MAPPING_SQL}
SELECT SELECT
pc."PatientPseudonym" AS "PatientPseudonym", pc."PatientPseudonym" AS "PatientPseudonym",
aic.Search_Term AS "Search_Term", aic.Search_Term AS "Search_Term",
pc."EventDateTime" AS "EventDateTime" COUNT(*) AS "code_frequency"
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pc FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pc
INNER JOIN AllIndicationCodes aic INNER JOIN AllIndicationCodes aic
ON pc."SNOMEDCode" = aic.SNOMEDCode ON pc."SNOMEDCode" = aic.SNOMEDCode
WHERE pc."PatientPseudonym" IN ({patient_placeholders}) WHERE pc."PatientPseudonym" IN ({patient_placeholders}){date_filter}
QUALIFY ROW_NUMBER() OVER ( GROUP BY pc."PatientPseudonym", aic.Search_Term
PARTITION BY pc."PatientPseudonym"
ORDER BY pc."EventDateTime" DESC
) = 1
""" """
# Build params: patient pseudonyms + optional date
params = list(batch_pseudonyms)
if earliest_hcd_date:
params.append(earliest_hcd_date)
try: try:
results = connector.execute_dict(query, tuple(batch_pseudonyms)) results = connector.execute_dict(query, tuple(params))
for row in results: for row in results:
all_results.append({ all_results.append({
'PatientPseudonym': row.get('PatientPseudonym'), 'PatientPseudonym': row.get('PatientPseudonym'),
'Search_Term': row.get('Search_Term'), 'Search_Term': row.get('Search_Term'),
'EventDateTime': row.get('EventDateTime'), 'code_frequency': row.get('code_frequency', 0),
}) })
logger.debug(f"Batch {batch_num}: found {len(results)} matches") logger.debug(f"Batch {batch_num}: found {len(results)} patient-indication matches")
except Exception as e: except Exception as e:
logger.error(f"Error querying GP records for batch {batch_num}: {e}") logger.error(f"Error querying GP records for batch {batch_num}: {e}")
@@ -1503,12 +1518,15 @@ QUALIFY ROW_NUMBER() OVER (
# Log summary statistics # Log summary statistics
if len(result_df) > 0: if len(result_df) > 0:
matched_count = len(result_df) unique_patients = result_df['PatientPseudonym'].nunique()
match_rate = 100 * matched_count / total_patients total_rows = len(result_df)
match_rate = 100 * unique_patients / total_patients
unique_terms = result_df['Search_Term'].nunique() unique_terms = result_df['Search_Term'].nunique()
avg_indications = total_rows / unique_patients if unique_patients > 0 else 0
logger.info(f"Indication lookup complete:") logger.info(f"Indication lookup complete:")
logger.info(f" Total patients queried: {total_patients}") logger.info(f" Total patients queried: {total_patients}")
logger.info(f" Patients with GP match: {matched_count} ({match_rate:.1f}%)") logger.info(f" Patients with GP match: {unique_patients} ({match_rate:.1f}%)")
logger.info(f" Total patient-indication rows: {total_rows} (avg {avg_indications:.1f} per patient)")
logger.info(f" Unique Search_Terms found: {unique_terms}") logger.info(f" Unique Search_Terms found: {unique_terms}")
# Log top Search_Terms # Log top Search_Terms
+44
View File
@@ -149,3 +149,47 @@ This project extends the indication-based pathway charts (Phase 1-5 complete) wi
- This is independent of Task 1.1 if you mock the gp_matches_df input - This is independent of Task 1.1 if you mock the gp_matches_df input
### Blocked items: ### Blocked items:
- None - None
## Iteration 3 — 2026-02-05
### Task: 1.1 — Update get_patient_indication_groups() to return ALL matches with code_frequency
### Why this task:
- Previous iteration recommended this as the next task
- This is the last remaining Phase 1 dependency — Tasks 2.1/2.2 need the GP matches data in the right format (multiple rows per patient with code_frequency)
- Task 1.1 changes the query fundamentals; better to do this before building assign_drug_indications() which consumes the output
### Status: COMPLETE
### What was done:
- Modified `get_patient_indication_groups()` in `diagnosis_lookup.py`:
1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY pc."PatientPseudonym" ORDER BY pc."EventDateTime" DESC) = 1`
2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"`
3. **Added** `earliest_hcd_date: Optional[str] = None` parameter
4. **Added** optional `AND pc."EventDateTime" >= %s` when earliest_hcd_date is provided
5. **Updated** return columns from `(PatientPseudonym, Search_Term, EventDateTime)` to `(PatientPseudonym, Search_Term, code_frequency)`
6. **Updated** all empty DataFrame returns to use new column names
7. **Updated** logging to show multiple-rows-per-patient stats (avg indications per patient)
8. **Updated** docstring to describe new behavior and parameters
- Backward compatible: `earliest_hcd_date` defaults to `None`, existing callers still work
- Note: caller in `refresh_pathways.py` (line 424-428) does `dict(zip(...))` which will only keep last match per patient with new multi-row format — this will be updated in Task 3.1
### Validation results:
- Tier 1 (Code): py_compile PASSED, import check PASSED, function signature verified
- Tier 2 (Data): Empty DataFrame returns correct columns ['PatientPseudonym', 'Search_Term', 'code_frequency']; live Snowflake test deferred to Phase 3/4
- Tier 3 (Functional): N/A (no UI changes)
### Files changed:
- data_processing/diagnosis_lookup.py (modified get_patient_indication_groups function)
- IMPLEMENTATION_PLAN.md (marked 1.1 subtasks [x])
### Committed: [pending]
### Patterns discovered:
- The `earliest_hcd_date` parameter is passed as a string in ISO format (YYYY-MM-DD) via Snowflake %s placeholder — Snowflake handles string-to-timestamp comparison implicitly
- The GROUP BY approach naturally deduplicates SNOMED codes within the same Search_Term — a patient with the same SNOMED code recorded 5 times gets code_frequency=5 (reflecting clinical activity intensity)
- params list is built dynamically: `batch_pseudonyms + [earliest_hcd_date]` only when date filter is active
### Next iteration should:
- Work on Task 2.1: Create `assign_drug_indications()` function
- This is now unblocked since 1.1 is complete (return format is known)
- Input: HCD df, gp_matches_df (PatientPseudonym, Search_Term, code_frequency), drug_mapping from load_drug_indication_mapping()
- Output: (modified_df with UPID|search_term, indication_df mapping modified_UPID → Search_Term)
- Can be built and tested with mock data (no Snowflake needed)
- Key logic: for each UPID+Drug pair, intersect drug's Search_Terms with patient's GP matches, pick highest code_frequency as tiebreaker
- The function needs PseudoNHSNoLinked to look up GP matches, so the df must have that column
- Task 2.2 (tiebreaker logic) can be done within 2.1 or as a follow-up
- The final Phase 1 subtask (1.1 verify with live Snowflake) will be tested during Phase 3/4 integration
### Blocked items:
- Task 1.1 final subtask "Verify: Query returns more rows" requires live Snowflake — deferred to Phase 3/4