HighCostDrugsDemo

Author	SHA1	Message	Date
Andrew Charlwood	778ed99ef6	refactor: slim pathways.db from 351 MB to 3.5 MB by removing unused tables Drop fact_interventions (440K rows), mv_patient_treatment_summary (35K rows), ref_drug_snomed_mapping (144K rows), and processed_files — all unused since the app moved to pre-computed pathway_nodes. Key changes: - Rewrite load_data() to source from pathway_nodes + pathway_refresh_log - Remove 7 dead methods and 8 dead state vars from pathways_app.py - Delete patient_data.py, load_snomed_mapping.py, test_large_dataset_performance.py - Remove SQLiteDataLoader (depended on fact_interventions) - Remove file tracking schema (processed_files tracked fact_interventions loads) - Remove legacy diagnosis functions from diagnosis_lookup.py - Add source_row_count migration for pathway_refresh_log - Clean all cross-references in __init__.py, data_source.py, migrate.py	2026-02-06 08:51:03 +00:00
Andrew Charlwood	c6e426e36c	fix: increase network timeout and batch size for GP lookup queries (Task 3.2) Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.	2026-02-05 23:55:12 +00:00
Andrew Charlwood	920570b437	feat: integrate drug-aware indication matching into refresh pipeline (Task 3.1) Replace old per-patient indication matching in refresh_pathways.py with drug-aware matching via assign_drug_indications(). Each drug is now cross-referenced against both the patient's GP diagnoses AND the DimSearchTerm.csv drug mapping. GP codes restricted to HCD data window via earliest_hcd_date parameter.	2026-02-05 23:11:01 +00:00
Andrew Charlwood	22222fe9ca	fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1) Three issues identified and fixed during Task 3.1 testing: 1. Snowflake column name casing: - Unquoted columns in Snowflake are returned as UPPERCASE - Fixed by aliasing columns with quoted names: AS "Search_Term" - Now correctly populates 139 unique Search_Terms (was 0) 2. Duplicate UPID index error: - indication_df_for_chart could have duplicate UPIDs - Added drop_duplicates(subset=['UPID']) before set_index() - Keeps first occurrence (DIAGNOSIS over FALLBACK) 3. Missing UPIDs in indication lookup: - Old code: built indication_df from unique PseudoNHSNoLinked only - Problem: patients with multiple UPIDs (multi-provider) were missing - Fixed: now builds indication_df from ALL unique UPIDs in df - Also handles NaN values in Directory column safely Validation results from test run: - 36,628 patients queried - 34,006 (92.8%) had GP diagnosis matches - 139 unique Search_Terms found - Top 5: drug misuse (8602), influenza (6239), diabetes (2476) Still to verify: full pathway processing after these fixes.	2026-02-05 18:30:23 +00:00
Andrew Charlwood	ad10b374cb	feat: integrate Snowflake-direct indication lookup into CLI refresh (Task 1.2, 2.3) Replace batch_lookup_indication_groups() with get_patient_indication_groups() for indication chart processing. The new approach: - Extracts unique PseudoNHSNoLinked values from HCD data - Queries Snowflake directly using the cluster CTE - Builds indication_df mapping UPID → Search_Term (matched) or Directory (fallback) - Logs coverage statistics (diagnosis % vs fallback %) This completes the integration of the new Snowflake-direct GP lookup approach.	2026-02-05 17:06:34 +00:00
Andrew Charlwood	8952156798	feat: integrate batch GP diagnosis lookup for indication charts (Task 3.2) - Add batch_lookup_indication_groups() to diagnosis_lookup.py - Efficient batch Snowflake queries (500 patients per batch) - Returns UPID → Indication_Group mapping - Source tracking: DIAGNOSIS vs FALLBACK - Update cli/refresh_pathways.py indication processing - Call batch_lookup_indication_groups() before chart generation - Build indication_df for process_indication_pathway_for_date_filter() - Log diagnosis coverage statistics - Enables full --chart-type all functionality	2026-02-05 14:45:06 +00:00
Andrew Charlwood	593d14c70f	feat: add chart_type argument to refresh command (Task 3.1) - Add --chart-type argument with choices: directory, indication, all - Update insert_pathway_records to include chart_type column - Update refresh_pathways to process multiple chart types - Update logging to show chart type counts - Indication chart processing deferred to Task 3.2 (GP diagnosis integration)	2026-02-05 14:38:57 +00:00
Andrew Charlwood	adc1dbfc58	feat: complete Task 2.2 - test refresh pipeline with Snowflake data Tested full refresh pipeline end-to-end with real Snowflake data: - Fixed trust filter to read Name column from defaultTrusts.csv - Fixed Decimal type handling in calculate_cost_per_patient_per_annum - Fixed array handling in convert_to_records for average_administered - Added required reference CSV files to data/ directory - Configured Snowflake connection (account, warehouse, user) Results: - Snowflake fetch: 656,695 records in ~7s - Transformations: 519,848 records after UPID/drug/directory - Pathway nodes: 293 for all_6mo (8 trusts, 14 directories) - Total processing time: ~6.2 minutes	2026-02-05 00:20:12 +00:00
Andrew Charlwood	092fdbba5a	feat: add CLI refresh command for pathway data (Task 2.1) Add cli/refresh_pathways.py with: - refresh_pathways() main function for full pipeline orchestration - insert_pathway_records() for SQLite insertion - log_refresh_start/complete/failed() for refresh tracking - CLI with --minimum-patients, --provider-codes, --dry-run, --verbose Uses existing pipeline functions: - fetch_and_transform_data() from pathway_pipeline.py - process_all_date_filters() for 6 date filter combinations - Schema helpers from data_processing/schema.py	2026-02-04 23:30:11 +00:00

9 Commits