HighCostDrugsDemo

Author	SHA1	Message	Date
admin	09be4c2472	demo docker file	2026-02-17 11:32:25 +00:00
admin	fcbde7c689	Restructured src to more logical heirachy	2026-02-09 16:22:05 +00:00
Andrew Charlwood	76838887e6	refactor: reorganize repository to src/ layout Move 6 packages (core, config, data_processing, analysis, visualization, cli) into src/ to reduce root clutter. Merge tools/data.py into data_processing/transforms.py. Move docs to docs/. Path resolution via .pth file (setup_dev.py), pytest pythonpath config, and sys.path bootstrap in rxconfig.py and CLI entry points. Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python). Fix tomllib import for Python 3.10 compatibility. All 113 tests pass.	2026-02-06 12:03:48 +00:00
Andrew Charlwood	778ed99ef6	refactor: slim pathways.db from 351 MB to 3.5 MB by removing unused tables Drop fact_interventions (440K rows), mv_patient_treatment_summary (35K rows), ref_drug_snomed_mapping (144K rows), and processed_files — all unused since the app moved to pre-computed pathway_nodes. Key changes: - Rewrite load_data() to source from pathway_nodes + pathway_refresh_log - Remove 7 dead methods and 8 dead state vars from pathways_app.py - Delete patient_data.py, load_snomed_mapping.py, test_large_dataset_performance.py - Remove SQLiteDataLoader (depended on fact_interventions) - Remove file tracking schema (processed_files tracked fact_interventions loads) - Remove legacy diagnosis functions from diagnosis_lookup.py - Add source_row_count migration for pathway_refresh_log - Clean all cross-references in __init__.py, data_source.py, migrate.py	2026-02-06 08:51:03 +00:00
Andrew Charlwood	c6e426e36c	fix: increase network timeout and batch size for GP lookup queries (Task 3.2) Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.	2026-02-05 23:55:12 +00:00
Andrew Charlwood	408976e001	feat: add assign_drug_indications() for drug-aware indication matching (Task 2.1 + 2.2)	2026-02-05 23:05:40 +00:00
Andrew Charlwood	c93417f0e7	feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1) - Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching Search_Terms per patient instead of just the most recent - Add earliest_hcd_date parameter to restrict GP codes to HCD data window - Return code_frequency column (count of matching SNOMED codes per Search_Term) for use as tiebreaker in drug-aware indication matching - Update empty DataFrame returns to match new column format	2026-02-05 23:01:01 +00:00
Andrew Charlwood	b0a8a9de1c	feat: merge asthma Search_Term variants in CLUSTER_MAPPING_SQL and drug mapping (Task 1.2) Merge 'allergic asthma' and 'severe persistent allergic asthma' into canonical 'asthma' in both CLUSTER_MAPPING_SQL (Snowflake CTE) and load_drug_indication_mapping() (DimSearchTerm.csv loader). - CLUSTER_MAPPING_SQL: 3 Cluster_IDs (AST_COD, eFI2_Asthma, SEVAST_COD) now all map to Search_Term = 'asthma' - Added SEARCH_TERM_MERGE_MAP constant for reusable normalization - load_drug_indication_mapping() applies merge at CSV load time - urticaria (XSAL_COD) stays separate — not merged with asthma - Combined asthma drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB, OMALIZUMAB, RESLIZUMAB	2026-02-05 22:56:29 +00:00
Andrew Charlwood	0779df78d1	feat: add drug-to-indication mapping from DimSearchTerm.csv (Task 1.2) Add load_drug_indication_mapping() and get_search_terms_for_drug() to diagnosis_lookup.py. Loads DimSearchTerm.csv to build bidirectional lookup between drug name fragments and Search_Terms. Uses substring matching for drug fragments (handles both exact names like ADALIMUMAB and partial fragments like PEGYLATED). Handles duplicate Search_Terms (e.g., diabetes appearing under two directorates) by combining fragments.	2026-02-05 22:48:09 +00:00
Andrew Charlwood	22222fe9ca	fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1) Three issues identified and fixed during Task 3.1 testing: 1. Snowflake column name casing: - Unquoted columns in Snowflake are returned as UPPERCASE - Fixed by aliasing columns with quoted names: AS "Search_Term" - Now correctly populates 139 unique Search_Terms (was 0) 2. Duplicate UPID index error: - indication_df_for_chart could have duplicate UPIDs - Added drop_duplicates(subset=['UPID']) before set_index() - Keeps first occurrence (DIAGNOSIS over FALLBACK) 3. Missing UPIDs in indication lookup: - Old code: built indication_df from unique PseudoNHSNoLinked only - Problem: patients with multiple UPIDs (multi-provider) were missing - Fixed: now builds indication_df from ALL unique UPIDs in df - Also handles NaN values in Directory column safely Validation results from test run: - 36,628 patients queried - 34,006 (92.8%) had GP diagnosis matches - 139 unique Search_Terms found - Top 5: drug misuse (8602), influenza (6239), diabetes (2476) Still to verify: full pathway processing after these fixes.	2026-02-05 18:30:23 +00:00
Andrew Charlwood	1a817b8257	feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1) - Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql - Add get_patient_indication_groups() function that queries Snowflake directly - Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient - Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime - Handles edge cases: empty list, Snowflake unavailable - Batch processing with configurable batch_size (default 500) - Comprehensive logging for match statistics	2026-02-05 17:03:12 +00:00
Andrew Charlwood	5b1569ed5c	fix: correct patient identifier for GP diagnosis lookup (Task 3.3) Two critical fixes for the indication-based pathway feature: 1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16") - CSV export from pandas/Excel converts large SNOMED codes to scientific notation - Without this fix, codes like "10629311000119108" were stored as "1.06e+16" - Now properly converts to full integer strings 2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey - PersonKey is LocalPatientID (provider-specific like "J188448") - PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records - Without this fix, 0% of patients matched GP records - Test shows ~20% match rate for ADALIMUMAB patients with correct identifier	2026-02-05 15:49:24 +00:00
Andrew Charlwood	8952156798	feat: integrate batch GP diagnosis lookup for indication charts (Task 3.2) - Add batch_lookup_indication_groups() to diagnosis_lookup.py - Efficient batch Snowflake queries (500 patients per batch) - Returns UPID → Indication_Group mapping - Source tracking: DIAGNOSIS vs FALLBACK - Update cli/refresh_pathways.py indication processing - Call batch_lookup_indication_groups() before chart generation - Build indication_df for process_indication_pathway_for_date_filter() - Log diagnosis coverage statistics - Enables full --chart-type all functionality	2026-02-05 14:45:06 +00:00
Andrew Charlwood	7cbc648c6d	feat: add indication pathway processing functions (Task 2.3) - Add generate_icicle_chart_indication() to pathway_analyzer.py - Variant that uses indication_df instead of directory_df - Groups by Trust → Search_Term → Drug → Pathway - Accepts indication_df mapping UPID → Indication_Group - Add process_indication_pathway_for_date_filter() to pathway_pipeline.py - Processes indication-based pathway for a single date filter - Uses generate_icicle_chart_indication() for hierarchy building - Add extract_indication_fields() to pathway_pipeline.py - Extracts trust_name, search_term, drug_sequence from ids column - Similar to extract_denormalized_fields() but for indication charts - Update convert_to_records() with chart_type parameter - Includes chart_type column in output records - Supports "directory" and "indication" values - Add ChartType type alias (Literal["directory", "indication"]) - Update __all__ exports with new functions	2026-02-05 14:32:28 +00:00
Andrew Charlwood	19607d72b0	feat: add chart_type column to pathway_nodes schema (Task 2.2) - Add chart_type column (TEXT NOT NULL DEFAULT 'directory') - Update UNIQUE constraint to (date_filter_id, chart_type, ids) - Add idx_pathway_nodes_chart_type index for filtering - Add migrate_pathway_nodes_chart_type() function for existing databases - Update initialize_database() to run migration automatically - Existing rows default to 'directory' chart type	2026-02-05 14:24:57 +00:00
Andrew Charlwood	506769470d	feat: add get_directorate_from_diagnosis() function (Task 2.1) - Added DirectorateAssignment dataclass for return type - Added get_directorate_from_diagnosis() function to diagnosis_lookup.py - Logic: Try diagnosis-based lookup first (direct SNOMED match) - Returns FALLBACK source if no match found, letting caller handle fallback - Extracts PatientPseudonym from UPID (last part after provider code) - Updated __all__ exports with new dataclass and function - Tested: function handles no-match cases correctly	2026-02-05 14:19:18 +00:00
Andrew Charlwood	b44d22de2c	feat: add direct SNOMED lookup functions (Task 1.3) Add two new functions to diagnosis_lookup.py for direct SNOMED code matching: - get_drug_snomed_codes(drug_name): Query ref_drug_snomed_mapping for all SNOMED codes mapped to a drug. Returns list of DrugSnomedMapping with snomed_code, snomed_description, search_term, primary_directorate. Tested: ADALIMUMAB returns 1320 mappings across 10 Search_Terms. - patient_has_indication_direct(patient_pseudonym, mappings, connector): Query PrimaryCareClinicalCoding for exact SNOMED code matches. Returns most recent match by EventDateTime with DirectSnomedMatchResult. Both functions follow existing patterns in the module and are exported in __all__. The lookup is case-insensitive for drug names.	2026-02-05 14:14:55 +00:00
Andrew Charlwood	6d68b5eaa5	feat: add SNOMED mapping loader script (Task 1.2) - Create data_processing/load_snomed_mapping.py with: - migrate_drug_snomed_mapping() for CSV to SQLite migration - get_drug_snomed_mapping_counts() for statistics - verify_drug_snomed_mapping_migration() for validation - clean_snomed_code() to remove trailing .0 from SNOMED codes - CLI interface: python -m data_processing.load_snomed_mapping - Loaded 144,056 mappings from enriched CSV: - 707 unique drugs - 187 unique search terms - 21,265 unique SNOMED codes	2026-02-05 14:10:36 +00:00
Andrew Charlwood	9943e85761	feat: add ref_drug_snomed_mapping schema (Task 1.1) - Add REF_DRUG_SNOMED_MAPPING_SCHEMA with 11 columns for direct SNOMED mapping - Add 5 indexes for lookup performance (drug, cleaned_drug, snomed, search_term, composite) - Add create_drug_snomed_mapping_table() helper function - Update helper functions (drop, get_counts, verify_exists) to include new table - Table is included in REFERENCE_TABLES_SCHEMA and created by migration	2026-02-05 14:06:31 +00:00
Andrew Charlwood	adc1dbfc58	feat: complete Task 2.2 - test refresh pipeline with Snowflake data Tested full refresh pipeline end-to-end with real Snowflake data: - Fixed trust filter to read Name column from defaultTrusts.csv - Fixed Decimal type handling in calculate_cost_per_patient_per_annum - Fixed array handling in convert_to_records for average_administered - Added required reference CSV files to data/ directory - Configured Snowflake connection (account, warehouse, user) Results: - Snowflake fetch: 656,695 records in ~7s - Transformations: 519,848 records after UPID/drug/directory - Pathway nodes: 293 for all_6mo (8 trusts, 14 directories) - Total processing time: ~6.2 minutes	2026-02-05 00:20:12 +00:00
Andrew Charlwood	5945649ae3	feat: add pathway pipeline module (Task 1.2) Create data_processing/pathway_pipeline.py with: - DateFilterConfig dataclass for date filter configuration - DATE_FILTER_CONFIGS with 6 pre-defined combinations - compute_date_ranges() for computing actual dates from config - fetch_and_transform_data() for Snowflake fetch + transformations - process_pathway_for_date_filter() using existing generate_icicle_chart() - extract_denormalized_fields() to parse trust/directory/drugs from ids - convert_to_records() for SQLite insertion - process_all_date_filters() convenience function	2026-02-04 23:21:39 +00:00
Andrew Charlwood	34396fef5e	feat: add pathway data architecture schema (Task 1.1) Add three new tables to support pre-computed pathway data: - pathway_date_filters: 6 pre-defined date filter combinations - pathway_nodes: pre-computed pathway hierarchy with all visualization data - pathway_refresh_log: tracks data refresh status Includes: - 8 indexes for efficient filtering by date_filter_id, trust, directory, drug - Helper functions: create/drop/verify/get_counts for pathway tables - clear_pathway_nodes() for selective or full data clearing - get_pathway_refresh_status() for checking last refresh - Integration with existing ALL_TABLES_SCHEMA and combined helpers	2026-02-04 23:17:27 +00:00
Andrew Charlwood	fdd33a67af	Initial commit before Ralph loop	2026-02-04 13:04:29 +00:00

23 Commits