fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1)

Three issues identified and fixed during Task 3.1 testing:

1. Snowflake column name casing:
   - Unquoted columns in Snowflake are returned as UPPERCASE
   - Fixed by aliasing columns with quoted names: AS "Search_Term"
   - Now correctly populates 139 unique Search_Terms (was 0)

2. Duplicate UPID index error:
   - indication_df_for_chart could have duplicate UPIDs
   - Added drop_duplicates(subset=['UPID']) before set_index()
   - Keeps first occurrence (DIAGNOSIS over FALLBACK)

3. Missing UPIDs in indication lookup:
   - Old code: built indication_df from unique PseudoNHSNoLinked only
   - Problem: patients with multiple UPIDs (multi-provider) were missing
   - Fixed: now builds indication_df from ALL unique UPIDs in df
   - Also handles NaN values in Directory column safely

Validation results from test run:
- 36,628 patients queried
- 34,006 (92.8%) had GP diagnosis matches
- 139 unique Search_Terms found
- Top 5: drug misuse (8602), influenza (6239), diabetes (2476)

Still to verify: full pathway processing after these fixes.
This commit is contained in:
Andrew Charlwood
2026-02-05 18:30:23 +00:00
parent f7166b38c8
commit 22222fe9ca
3 changed files with 27 additions and 15 deletions
+4 -3
View File
@@ -1348,12 +1348,13 @@ def get_patient_indication_groups(
# Build the full query with cluster CTE
# This finds the most recent matching diagnosis for each patient
# Note: Column names must be aliased to ensure consistent casing in results
query = f"""
{CLUSTER_MAPPING_SQL}
SELECT
pc."PatientPseudonym",
aic.Search_Term,
pc."EventDateTime"
pc."PatientPseudonym" AS "PatientPseudonym",
aic.Search_Term AS "Search_Term",
pc."EventDateTime" AS "EventDateTime"
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pc
INNER JOIN AllIndicationCodes aic
ON pc."SNOMEDCode" = aic.SNOMEDCode