Three issues identified and fixed during Task 3.1 testing:
1. Snowflake column name casing:
- Unquoted columns in Snowflake are returned as UPPERCASE
- Fixed by aliasing columns with quoted names: AS "Search_Term"
- Now correctly populates 139 unique Search_Terms (was 0)
2. Duplicate UPID index error:
- indication_df_for_chart could have duplicate UPIDs
- Added drop_duplicates(subset=['UPID']) before set_index()
- Keeps first occurrence (DIAGNOSIS over FALLBACK)
3. Missing UPIDs in indication lookup:
- Old code: built indication_df from unique PseudoNHSNoLinked only
- Problem: patients with multiple UPIDs (multi-provider) were missing
- Fixed: now builds indication_df from ALL unique UPIDs in df
- Also handles NaN values in Directory column safely
Validation results from test run:
- 36,628 patients queried
- 34,006 (92.8%) had GP diagnosis matches
- 139 unique Search_Terms found
- Top 5: drug misuse (8602), influenza (6239), diabetes (2476)
Still to verify: full pathway processing after these fixes.
- Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql
- Add get_patient_indication_groups() function that queries Snowflake directly
- Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- Handles edge cases: empty list, Snowflake unavailable
- Batch processing with configurable batch_size (default 500)
- Comprehensive logging for match statistics
Two critical fixes for the indication-based pathway feature:
1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16")
- CSV export from pandas/Excel converts large SNOMED codes to scientific notation
- Without this fix, codes like "10629311000119108" were stored as "1.06e+16"
- Now properly converts to full integer strings
2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey
- PersonKey is LocalPatientID (provider-specific like "J188448")
- PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records
- Without this fix, 0% of patients matched GP records
- Test shows ~20% match rate for ADALIMUMAB patients with correct identifier
- Added DirectorateAssignment dataclass for return type
- Added get_directorate_from_diagnosis() function to diagnosis_lookup.py
- Logic: Try diagnosis-based lookup first (direct SNOMED match)
- Returns FALLBACK source if no match found, letting caller handle fallback
- Extracts PatientPseudonym from UPID (last part after provider code)
- Updated __all__ exports with new dataclass and function
- Tested: function handles no-match cases correctly
Add two new functions to diagnosis_lookup.py for direct SNOMED code matching:
- get_drug_snomed_codes(drug_name): Query ref_drug_snomed_mapping for all
SNOMED codes mapped to a drug. Returns list of DrugSnomedMapping with
snomed_code, snomed_description, search_term, primary_directorate.
Tested: ADALIMUMAB returns 1320 mappings across 10 Search_Terms.
- patient_has_indication_direct(patient_pseudonym, mappings, connector):
Query PrimaryCareClinicalCoding for exact SNOMED code matches.
Returns most recent match by EventDateTime with DirectSnomedMatchResult.
Both functions follow existing patterns in the module and are exported
in __all__. The lookup is case-insensitive for drug names.