Commit Graph

17 Commits

Author SHA1 Message Date
Andrew Charlwood c93417f0e7 feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1)
- Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching
  Search_Terms per patient instead of just the most recent
- Add earliest_hcd_date parameter to restrict GP codes to HCD data window
- Return code_frequency column (count of matching SNOMED codes per Search_Term)
  for use as tiebreaker in drug-aware indication matching
- Update empty DataFrame returns to match new column format
2026-02-05 23:01:01 +00:00
Andrew Charlwood b0a8a9de1c feat: merge asthma Search_Term variants in CLUSTER_MAPPING_SQL and drug mapping (Task 1.2)
Merge 'allergic asthma' and 'severe persistent allergic asthma' into
canonical 'asthma' in both CLUSTER_MAPPING_SQL (Snowflake CTE) and
load_drug_indication_mapping() (DimSearchTerm.csv loader).

- CLUSTER_MAPPING_SQL: 3 Cluster_IDs (AST_COD, eFI2_Asthma, SEVAST_COD) now
  all map to Search_Term = 'asthma'
- Added SEARCH_TERM_MERGE_MAP constant for reusable normalization
- load_drug_indication_mapping() applies merge at CSV load time
- urticaria (XSAL_COD) stays separate — not merged with asthma
- Combined asthma drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB,
  OMALIZUMAB, RESLIZUMAB
2026-02-05 22:56:29 +00:00
Andrew Charlwood 0779df78d1 feat: add drug-to-indication mapping from DimSearchTerm.csv (Task 1.2)
Add load_drug_indication_mapping() and get_search_terms_for_drug() to
diagnosis_lookup.py. Loads DimSearchTerm.csv to build bidirectional
lookup between drug name fragments and Search_Terms. Uses substring
matching for drug fragments (handles both exact names like ADALIMUMAB
and partial fragments like PEGYLATED). Handles duplicate Search_Terms
(e.g., diabetes appearing under two directorates) by combining fragments.
2026-02-05 22:48:09 +00:00
Andrew Charlwood 22222fe9ca fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1)
Three issues identified and fixed during Task 3.1 testing:

1. Snowflake column name casing:
   - Unquoted columns in Snowflake are returned as UPPERCASE
   - Fixed by aliasing columns with quoted names: AS "Search_Term"
   - Now correctly populates 139 unique Search_Terms (was 0)

2. Duplicate UPID index error:
   - indication_df_for_chart could have duplicate UPIDs
   - Added drop_duplicates(subset=['UPID']) before set_index()
   - Keeps first occurrence (DIAGNOSIS over FALLBACK)

3. Missing UPIDs in indication lookup:
   - Old code: built indication_df from unique PseudoNHSNoLinked only
   - Problem: patients with multiple UPIDs (multi-provider) were missing
   - Fixed: now builds indication_df from ALL unique UPIDs in df
   - Also handles NaN values in Directory column safely

Validation results from test run:
- 36,628 patients queried
- 34,006 (92.8%) had GP diagnosis matches
- 139 unique Search_Terms found
- Top 5: drug misuse (8602), influenza (6239), diabetes (2476)

Still to verify: full pathway processing after these fixes.
2026-02-05 18:30:23 +00:00
Andrew Charlwood 1a817b8257 feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)
- Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql
- Add get_patient_indication_groups() function that queries Snowflake directly
- Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- Handles edge cases: empty list, Snowflake unavailable
- Batch processing with configurable batch_size (default 500)
- Comprehensive logging for match statistics
2026-02-05 17:03:12 +00:00
Andrew Charlwood 5b1569ed5c fix: correct patient identifier for GP diagnosis lookup (Task 3.3)
Two critical fixes for the indication-based pathway feature:

1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16")
   - CSV export from pandas/Excel converts large SNOMED codes to scientific notation
   - Without this fix, codes like "10629311000119108" were stored as "1.06e+16"
   - Now properly converts to full integer strings

2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey
   - PersonKey is LocalPatientID (provider-specific like "J188448")
   - PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records
   - Without this fix, 0% of patients matched GP records
   - Test shows ~20% match rate for ADALIMUMAB patients with correct identifier
2026-02-05 15:49:24 +00:00
Andrew Charlwood 8952156798 feat: integrate batch GP diagnosis lookup for indication charts (Task 3.2)
- Add batch_lookup_indication_groups() to diagnosis_lookup.py
  - Efficient batch Snowflake queries (500 patients per batch)
  - Returns UPID → Indication_Group mapping
  - Source tracking: DIAGNOSIS vs FALLBACK
- Update cli/refresh_pathways.py indication processing
  - Call batch_lookup_indication_groups() before chart generation
  - Build indication_df for process_indication_pathway_for_date_filter()
  - Log diagnosis coverage statistics
- Enables full --chart-type all functionality
2026-02-05 14:45:06 +00:00
Andrew Charlwood 7cbc648c6d feat: add indication pathway processing functions (Task 2.3)
- Add generate_icicle_chart_indication() to pathway_analyzer.py
  - Variant that uses indication_df instead of directory_df
  - Groups by Trust → Search_Term → Drug → Pathway
  - Accepts indication_df mapping UPID → Indication_Group

- Add process_indication_pathway_for_date_filter() to pathway_pipeline.py
  - Processes indication-based pathway for a single date filter
  - Uses generate_icicle_chart_indication() for hierarchy building

- Add extract_indication_fields() to pathway_pipeline.py
  - Extracts trust_name, search_term, drug_sequence from ids column
  - Similar to extract_denormalized_fields() but for indication charts

- Update convert_to_records() with chart_type parameter
  - Includes chart_type column in output records
  - Supports "directory" and "indication" values

- Add ChartType type alias (Literal["directory", "indication"])

- Update __all__ exports with new functions
2026-02-05 14:32:28 +00:00
Andrew Charlwood 19607d72b0 feat: add chart_type column to pathway_nodes schema (Task 2.2)
- Add chart_type column (TEXT NOT NULL DEFAULT 'directory')
- Update UNIQUE constraint to (date_filter_id, chart_type, ids)
- Add idx_pathway_nodes_chart_type index for filtering
- Add migrate_pathway_nodes_chart_type() function for existing databases
- Update initialize_database() to run migration automatically
- Existing rows default to 'directory' chart type
2026-02-05 14:24:57 +00:00
Andrew Charlwood 506769470d feat: add get_directorate_from_diagnosis() function (Task 2.1)
- Added DirectorateAssignment dataclass for return type
- Added get_directorate_from_diagnosis() function to diagnosis_lookup.py
- Logic: Try diagnosis-based lookup first (direct SNOMED match)
- Returns FALLBACK source if no match found, letting caller handle fallback
- Extracts PatientPseudonym from UPID (last part after provider code)
- Updated __all__ exports with new dataclass and function
- Tested: function handles no-match cases correctly
2026-02-05 14:19:18 +00:00
Andrew Charlwood b44d22de2c feat: add direct SNOMED lookup functions (Task 1.3)
Add two new functions to diagnosis_lookup.py for direct SNOMED code matching:

- get_drug_snomed_codes(drug_name): Query ref_drug_snomed_mapping for all
  SNOMED codes mapped to a drug. Returns list of DrugSnomedMapping with
  snomed_code, snomed_description, search_term, primary_directorate.
  Tested: ADALIMUMAB returns 1320 mappings across 10 Search_Terms.

- patient_has_indication_direct(patient_pseudonym, mappings, connector):
  Query PrimaryCareClinicalCoding for exact SNOMED code matches.
  Returns most recent match by EventDateTime with DirectSnomedMatchResult.

Both functions follow existing patterns in the module and are exported
in __all__. The lookup is case-insensitive for drug names.
2026-02-05 14:14:55 +00:00
Andrew Charlwood 6d68b5eaa5 feat: add SNOMED mapping loader script (Task 1.2)
- Create data_processing/load_snomed_mapping.py with:
  - migrate_drug_snomed_mapping() for CSV to SQLite migration
  - get_drug_snomed_mapping_counts() for statistics
  - verify_drug_snomed_mapping_migration() for validation
  - clean_snomed_code() to remove trailing .0 from SNOMED codes
  - CLI interface: python -m data_processing.load_snomed_mapping

- Loaded 144,056 mappings from enriched CSV:
  - 707 unique drugs
  - 187 unique search terms
  - 21,265 unique SNOMED codes
2026-02-05 14:10:36 +00:00
Andrew Charlwood 9943e85761 feat: add ref_drug_snomed_mapping schema (Task 1.1)
- Add REF_DRUG_SNOMED_MAPPING_SCHEMA with 11 columns for direct SNOMED mapping
- Add 5 indexes for lookup performance (drug, cleaned_drug, snomed, search_term, composite)
- Add create_drug_snomed_mapping_table() helper function
- Update helper functions (drop, get_counts, verify_exists) to include new table
- Table is included in REFERENCE_TABLES_SCHEMA and created by migration
2026-02-05 14:06:31 +00:00
Andrew Charlwood adc1dbfc58 feat: complete Task 2.2 - test refresh pipeline with Snowflake data
Tested full refresh pipeline end-to-end with real Snowflake data:
- Fixed trust filter to read Name column from defaultTrusts.csv
- Fixed Decimal type handling in calculate_cost_per_patient_per_annum
- Fixed array handling in convert_to_records for average_administered
- Added required reference CSV files to data/ directory
- Configured Snowflake connection (account, warehouse, user)

Results:
- Snowflake fetch: 656,695 records in ~7s
- Transformations: 519,848 records after UPID/drug/directory
- Pathway nodes: 293 for all_6mo (8 trusts, 14 directories)
- Total processing time: ~6.2 minutes
2026-02-05 00:20:12 +00:00
Andrew Charlwood 5945649ae3 feat: add pathway pipeline module (Task 1.2)
Create data_processing/pathway_pipeline.py with:
- DateFilterConfig dataclass for date filter configuration
- DATE_FILTER_CONFIGS with 6 pre-defined combinations
- compute_date_ranges() for computing actual dates from config
- fetch_and_transform_data() for Snowflake fetch + transformations
- process_pathway_for_date_filter() using existing generate_icicle_chart()
- extract_denormalized_fields() to parse trust/directory/drugs from ids
- convert_to_records() for SQLite insertion
- process_all_date_filters() convenience function
2026-02-04 23:21:39 +00:00
Andrew Charlwood 34396fef5e feat: add pathway data architecture schema (Task 1.1)
Add three new tables to support pre-computed pathway data:
- pathway_date_filters: 6 pre-defined date filter combinations
- pathway_nodes: pre-computed pathway hierarchy with all visualization data
- pathway_refresh_log: tracks data refresh status

Includes:
- 8 indexes for efficient filtering by date_filter_id, trust, directory, drug
- Helper functions: create/drop/verify/get_counts for pathway tables
- clear_pathway_nodes() for selective or full data clearing
- get_pathway_refresh_status() for checking last refresh
- Integration with existing ALL_TABLES_SCHEMA and combined helpers
2026-02-04 23:17:27 +00:00
Andrew Charlwood fdd33a67af Initial commit before Ralph loop 2026-02-04 13:04:29 +00:00