Commit Graph

112 Commits

Author SHA1 Message Date
Andrew Charlwood b71748fa7d feat: add shared pathway query functions for Dash data access (Task 1.1)
Extract load_data() and load_pathway_data() logic from Reflex AppState
into standalone functions in src/data_processing/pathway_queries.py.
Create thin dash_app/data/queries.py wrapper with DB_PATH resolution.
2026-02-06 13:02:34 +00:00
Andrew Charlwood 1c3ece6480 feat: create dash_app skeleton with nhs.css and MantineProvider (Phase 0)
- dash_app/ directory structure: app.py, assets/, data/, components/, callbacks/, utils/
- run_dash.py entry point at project root
- Added dash>=2.14.0 and dash-mantine-components>=0.14.0 to pyproject.toml
- app.py: Dash app with MantineProvider wrapper and 3 dcc.Store components
- nhs.css: extracted from 01_nhs_classic.html (sans mock icicle CSS)
- Validated: app starts cleanly at localhost:8050
2026-02-06 12:57:47 +00:00
Andrew Charlwood 76838887e6 refactor: reorganize repository to src/ layout
Move 6 packages (core, config, data_processing, analysis, visualization, cli)
into src/ to reduce root clutter. Merge tools/data.py into
data_processing/transforms.py. Move docs to docs/.

Path resolution via .pth file (setup_dev.py), pytest pythonpath config,
and sys.path bootstrap in rxconfig.py and CLI entry points.

Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python).
Fix tomllib import for Python 3.10 compatibility.

All 113 tests pass.
2026-02-06 12:03:48 +00:00
Andrew Charlwood 1581b1d3dd docs: update CLAUDE.md to reflect slimmed database architecture
Remove references to deleted tables (fact_interventions,
mv_patient_treatment_summary, ref_drug_snomed_mapping, processed_files),
deleted files (patient_data.py, load_snomed_mapping.py), and removed
classes (SQLiteDataLoader). Update package structure, data loaders,
database schema, fallback chain, and AppState descriptions.
2026-02-06 09:39:19 +00:00
Andrew Charlwood 778ed99ef6 refactor: slim pathways.db from 351 MB to 3.5 MB by removing unused tables
Drop fact_interventions (440K rows), mv_patient_treatment_summary (35K rows),
ref_drug_snomed_mapping (144K rows), and processed_files — all unused since
the app moved to pre-computed pathway_nodes.

Key changes:
- Rewrite load_data() to source from pathway_nodes + pathway_refresh_log
- Remove 7 dead methods and 8 dead state vars from pathways_app.py
- Delete patient_data.py, load_snomed_mapping.py, test_large_dataset_performance.py
- Remove SQLiteDataLoader (depended on fact_interventions)
- Remove file tracking schema (processed_files tracked fact_interventions loads)
- Remove legacy diagnosis functions from diagnosis_lookup.py
- Add source_row_count migration for pathway_refresh_log
- Clean all cross-references in __init__.py, data_source.py, migrate.py
2026-02-06 08:51:03 +00:00
Andrew Charlwood bb93c1673e chore: archive unused files and move legacy code to can_delete
archive/ — unused reference files (no active code references):
  - LookupSearchTermCleanedDrugName.csv, condition_directorate_mapping.csv
  - na_directory_rows.csv (diagnostic output), ta-recommendations.xlsx
  - snomed_indication_mapping_query.sql (source for embedded SQL)
  - IMPROVEMENT_RECOMMENDATIONS.md, power query.pq

archive/can_delete/ — legacy code and logs safe to remove:
  - dashboard_gui.py (replaced by Reflex app)
  - pathways_app_old.py.bak (old backup)
  - Ralph loop iteration logs (iterations 2-8)
2026-02-06 01:01:02 +00:00
Andrew Charlwood a31907aa1f feat: complete drug-aware indication matching and cleanup app_v2
- Remove app_v2.py (consolidated into pathways_app.py), fix __init__ import
- Add DimSearchTerm.csv, drug_indication_clusters.csv, drug_snomed_mapping_enriched.csv
  as reference data for SNOMED-based indication matching
- Add snomed_indication_mapping_query.sql (source for embedded cluster mapping)
- Update DESIGN_SYSTEM.md, RALPH_PROMPT.md, ralph.ps1, uv.lock
2026-02-06 00:33:29 +00:00
Andrew Charlwood f3bba6dfab docs: complete Phase 4 validation — full refresh and data verification (Task 4.1-4.3)
Full refresh: 2,947 nodes (1,101 directory + 1,846 indication) in 738s.
Validation: RA/asthma drugs correctly grouped, fallback labels present,
directory charts unchanged, Reflex compiles. All completion criteria met.
2026-02-06 00:12:53 +00:00
Andrew Charlwood b674543878 docs: update progress.txt with Iteration 6 results (Task 3.2) 2026-02-05 23:55:26 +00:00
Andrew Charlwood c6e426e36c fix: increase network timeout and batch size for GP lookup queries (Task 3.2)
Dry run test revealed GP lookup queries timing out at 30s (connection_timeout
in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to
5000 — query time is ~40s regardless of batch size (CTE compilation overhead),
so larger batches reduce total time from ~50min to ~6min for 36K patients.

Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate,
42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
2026-02-05 23:55:12 +00:00
Andrew Charlwood 73088b063b docs: update progress.txt with Iteration 5 results (Task 3.1) 2026-02-05 23:11:14 +00:00
Andrew Charlwood 920570b437 feat: integrate drug-aware indication matching into refresh pipeline (Task 3.1)
Replace old per-patient indication matching in refresh_pathways.py with
drug-aware matching via assign_drug_indications(). Each drug is now
cross-referenced against both the patient's GP diagnoses AND the
DimSearchTerm.csv drug mapping. GP codes restricted to HCD data window
via earliest_hcd_date parameter.
2026-02-05 23:11:01 +00:00
Andrew Charlwood d9891c8991 docs: update progress.txt with Iteration 4 results (Task 2.1 + 2.2) 2026-02-05 23:06:27 +00:00
Andrew Charlwood 408976e001 feat: add assign_drug_indications() for drug-aware indication matching (Task 2.1 + 2.2) 2026-02-05 23:05:40 +00:00
Andrew Charlwood 947b87a331 docs: update progress.txt with Iteration 3 results (Task 1.1) 2026-02-05 23:01:15 +00:00
Andrew Charlwood c93417f0e7 feat: return ALL GP matches with code_frequency in get_patient_indication_groups (Task 1.1)
- Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching
  Search_Terms per patient instead of just the most recent
- Add earliest_hcd_date parameter to restrict GP codes to HCD data window
- Return code_frequency column (count of matching SNOMED codes per Search_Term)
  for use as tiebreaker in drug-aware indication matching
- Update empty DataFrame returns to match new column format
2026-02-05 23:01:01 +00:00
Andrew Charlwood 4fed0e53df docs: update progress.txt with Iteration 2 results (Task 1.2) 2026-02-05 22:56:44 +00:00
Andrew Charlwood b0a8a9de1c feat: merge asthma Search_Term variants in CLUSTER_MAPPING_SQL and drug mapping (Task 1.2)
Merge 'allergic asthma' and 'severe persistent allergic asthma' into
canonical 'asthma' in both CLUSTER_MAPPING_SQL (Snowflake CTE) and
load_drug_indication_mapping() (DimSearchTerm.csv loader).

- CLUSTER_MAPPING_SQL: 3 Cluster_IDs (AST_COD, eFI2_Asthma, SEVAST_COD) now
  all map to Search_Term = 'asthma'
- Added SEARCH_TERM_MERGE_MAP constant for reusable normalization
- load_drug_indication_mapping() applies merge at CSV load time
- urticaria (XSAL_COD) stays separate — not merged with asthma
- Combined asthma drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB,
  OMALIZUMAB, RESLIZUMAB
2026-02-05 22:56:29 +00:00
Andrew Charlwood c85aae4f6a docs: update progress.txt with Iteration 1 results (Task 1.2) 2026-02-05 22:48:46 +00:00
Andrew Charlwood 0779df78d1 feat: add drug-to-indication mapping from DimSearchTerm.csv (Task 1.2)
Add load_drug_indication_mapping() and get_search_terms_for_drug() to
diagnosis_lookup.py. Loads DimSearchTerm.csv to build bidirectional
lookup between drug name fragments and Search_Terms. Uses substring
matching for drug fragments (handles both exact names like ADALIMUMAB
and partial fragments like PEGYLATED). Handles duplicate Search_Terms
(e.g., diabetes appearing under two directorates) by combining fragments.
2026-02-05 22:48:09 +00:00
Andrew Charlwood 1c4d2c07ee docs: mark project complete - all tasks done, viewport testing blocked by env (Iteration 9) 2026-02-05 20:51:48 +00:00
Andrew Charlwood fed909481e docs: update CLAUDE.md with indication chart architecture and CLI docs (Task 5.2) 2026-02-05 20:50:01 +00:00
Andrew Charlwood 4884e0a8cc fix: recreate pathway_nodes with correct UNIQUE constraint and validate end-to-end (Task 5.1)
The UNIQUE constraint was UNIQUE(date_filter_id, ids) instead of
UNIQUE(date_filter_id, chart_type, ids), causing INSERT OR REPLACE
to overwrite directory chart root/trust nodes when indication nodes
were inserted. Dropped and recreated the table, re-ran full refresh.

Validation: both chart types have all hierarchy levels (0-5),
all 12 date filters produce valid icicle charts, KPIs correct.
2026-02-05 20:43:01 +00:00
Andrew Charlwood 6331d44165 fix: prevent DataFrame mutation in prepare_data() causing indication charts to fail
prepare_data() mapped Provider Code → Name in-place. When called for directory
charts first, then indication charts, the second call re-mapped already-mapped
values to NaN, silently dropping all data. Added df.copy() to prevent mutation.

Also fixes directory charts only generating data for the first date filter.

Results: 3,633 pathway nodes now generated (1,101 directory + 2,532 indication)
across all 12 datasets (6 date filters × 2 chart types).
2026-02-05 20:10:12 +00:00
Andrew Charlwood 6f88a59978 feat: add chart type toggle for Directory/Indication views (Task 4.1, 4.2, 4.3)
- Add selected_chart_type state variable and set_chart_type() handler
- Add chart_type filter to load_pathway_data() WHERE clause
- Create segmented control toggle component in filter strip
- Add dynamic hierarchy label (Directorate vs Indication)
- Update chart title to include chart type prefix
2026-02-05 19:39:45 +00:00
Andrew Charlwood 2deaa2f6da docs: mark Task 3.1 complete - indication pipeline verified (Task 3.1)
Pipeline test results:
- 695 indication pathway nodes generated for all_6mo filter
- 92.8% GP diagnosis match rate (34,006/36,628 patients)
- 139 unique Search_Terms found
- Top indications: drug misuse, influenza, diabetes, sepsis, cardiovascular disease
- Full pipeline completes in ~10 minutes

Phase 3 complete, Phase 4 (Reflex UI) ready to begin.
2026-02-05 18:44:34 +00:00
Andrew Charlwood 0b5b462766 docs: update progress.txt with iteration 3, add new guardrails (Task 3.1) 2026-02-05 18:31:29 +00:00
Andrew Charlwood 22222fe9ca fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1)
Three issues identified and fixed during Task 3.1 testing:

1. Snowflake column name casing:
   - Unquoted columns in Snowflake are returned as UPPERCASE
   - Fixed by aliasing columns with quoted names: AS "Search_Term"
   - Now correctly populates 139 unique Search_Terms (was 0)

2. Duplicate UPID index error:
   - indication_df_for_chart could have duplicate UPIDs
   - Added drop_duplicates(subset=['UPID']) before set_index()
   - Keeps first occurrence (DIAGNOSIS over FALLBACK)

3. Missing UPIDs in indication lookup:
   - Old code: built indication_df from unique PseudoNHSNoLinked only
   - Problem: patients with multiple UPIDs (multi-provider) were missing
   - Fixed: now builds indication_df from ALL unique UPIDs in df
   - Also handles NaN values in Directory column safely

Validation results from test run:
- 36,628 patients queried
- 34,006 (92.8%) had GP diagnosis matches
- 139 unique Search_Terms found
- Top 5: drug misuse (8602), influenza (6239), diabetes (2476)

Still to verify: full pathway processing after these fixes.
2026-02-05 18:30:23 +00:00
Andrew Charlwood f7166b38c8 docs: update progress.txt with iteration 2 completion (Task 1.2, 2.3) 2026-02-05 17:07:06 +00:00
Andrew Charlwood ad10b374cb feat: integrate Snowflake-direct indication lookup into CLI refresh (Task 1.2, 2.3)
Replace batch_lookup_indication_groups() with get_patient_indication_groups()
for indication chart processing. The new approach:

- Extracts unique PseudoNHSNoLinked values from HCD data
- Queries Snowflake directly using the cluster CTE
- Builds indication_df mapping UPID → Search_Term (matched) or Directory (fallback)
- Logs coverage statistics (diagnosis % vs fallback %)

This completes the integration of the new Snowflake-direct GP lookup approach.
2026-02-05 17:06:34 +00:00
Andrew Charlwood 1a817b8257 feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)
- Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql
- Add get_patient_indication_groups() function that queries Snowflake directly
- Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- Handles edge cases: empty list, Snowflake unavailable
- Batch processing with configurable batch_size (default 500)
- Comprehensive logging for match statistics
2026-02-05 17:03:12 +00:00
Andrew Charlwood 99bab08402 docs: add guardrails for patient identifier and SNOMED code handling 2026-02-05 15:51:52 +00:00
Andrew Charlwood 843b4f23cc docs: update progress.txt with iteration 9 (Task 3.3 in progress)
Fixed two critical bugs preventing GP diagnosis matching:
1. SNOMED codes in scientific notation now converted to integers
2. Using PseudoNHSNoLinked (not PersonKey) for GP record lookup

Full refresh is running in background - next iteration should verify completion.
2026-02-05 15:51:17 +00:00
Andrew Charlwood 5b1569ed5c fix: correct patient identifier for GP diagnosis lookup (Task 3.3)
Two critical fixes for the indication-based pathway feature:

1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16")
   - CSV export from pandas/Excel converts large SNOMED codes to scientific notation
   - Without this fix, codes like "10629311000119108" were stored as "1.06e+16"
   - Now properly converts to full integer strings

2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey
   - PersonKey is LocalPatientID (provider-specific like "J188448")
   - PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records
   - Without this fix, 0% of patients matched GP records
   - Test shows ~20% match rate for ADALIMUMAB patients with correct identifier
2026-02-05 15:49:24 +00:00
Andrew Charlwood b9f4041670 docs: update progress.txt with iteration 8 completion (Task 3.2) 2026-02-05 14:45:57 +00:00
Andrew Charlwood 8952156798 feat: integrate batch GP diagnosis lookup for indication charts (Task 3.2)
- Add batch_lookup_indication_groups() to diagnosis_lookup.py
  - Efficient batch Snowflake queries (500 patients per batch)
  - Returns UPID → Indication_Group mapping
  - Source tracking: DIAGNOSIS vs FALLBACK
- Update cli/refresh_pathways.py indication processing
  - Call batch_lookup_indication_groups() before chart generation
  - Build indication_df for process_indication_pathway_for_date_filter()
  - Log diagnosis coverage statistics
- Enables full --chart-type all functionality
2026-02-05 14:45:06 +00:00
Andrew Charlwood 50b8548688 docs: update progress.txt with iteration 7 completion (Task 3.1) 2026-02-05 14:39:35 +00:00
Andrew Charlwood 593d14c70f feat: add chart_type argument to refresh command (Task 3.1)
- Add --chart-type argument with choices: directory, indication, all
- Update insert_pathway_records to include chart_type column
- Update refresh_pathways to process multiple chart types
- Update logging to show chart type counts
- Indication chart processing deferred to Task 3.2 (GP diagnosis integration)
2026-02-05 14:38:57 +00:00
Andrew Charlwood 0d15000aa0 docs: update progress.txt with iteration 6 completion (Task 2.3) 2026-02-05 14:33:16 +00:00
Andrew Charlwood 7cbc648c6d feat: add indication pathway processing functions (Task 2.3)
- Add generate_icicle_chart_indication() to pathway_analyzer.py
  - Variant that uses indication_df instead of directory_df
  - Groups by Trust → Search_Term → Drug → Pathway
  - Accepts indication_df mapping UPID → Indication_Group

- Add process_indication_pathway_for_date_filter() to pathway_pipeline.py
  - Processes indication-based pathway for a single date filter
  - Uses generate_icicle_chart_indication() for hierarchy building

- Add extract_indication_fields() to pathway_pipeline.py
  - Extracts trust_name, search_term, drug_sequence from ids column
  - Similar to extract_denormalized_fields() but for indication charts

- Update convert_to_records() with chart_type parameter
  - Includes chart_type column in output records
  - Supports "directory" and "indication" values

- Add ChartType type alias (Literal["directory", "indication"])

- Update __all__ exports with new functions
2026-02-05 14:32:28 +00:00
Andrew Charlwood aabe4bf45d docs: update progress.txt with iteration 5 completion (Task 2.2) 2026-02-05 14:25:44 +00:00
Andrew Charlwood 19607d72b0 feat: add chart_type column to pathway_nodes schema (Task 2.2)
- Add chart_type column (TEXT NOT NULL DEFAULT 'directory')
- Update UNIQUE constraint to (date_filter_id, chart_type, ids)
- Add idx_pathway_nodes_chart_type index for filtering
- Add migrate_pathway_nodes_chart_type() function for existing databases
- Update initialize_database() to run migration automatically
- Existing rows default to 'directory' chart type
2026-02-05 14:24:57 +00:00
Andrew Charlwood 3db93a685b docs: update progress.txt with iteration 4 completion (Task 2.1) 2026-02-05 14:20:04 +00:00
Andrew Charlwood 506769470d feat: add get_directorate_from_diagnosis() function (Task 2.1)
- Added DirectorateAssignment dataclass for return type
- Added get_directorate_from_diagnosis() function to diagnosis_lookup.py
- Logic: Try diagnosis-based lookup first (direct SNOMED match)
- Returns FALLBACK source if no match found, letting caller handle fallback
- Extracts PatientPseudonym from UPID (last part after provider code)
- Updated __all__ exports with new dataclass and function
- Tested: function handles no-match cases correctly
2026-02-05 14:19:18 +00:00
Andrew Charlwood b44d22de2c feat: add direct SNOMED lookup functions (Task 1.3)
Add two new functions to diagnosis_lookup.py for direct SNOMED code matching:

- get_drug_snomed_codes(drug_name): Query ref_drug_snomed_mapping for all
  SNOMED codes mapped to a drug. Returns list of DrugSnomedMapping with
  snomed_code, snomed_description, search_term, primary_directorate.
  Tested: ADALIMUMAB returns 1320 mappings across 10 Search_Terms.

- patient_has_indication_direct(patient_pseudonym, mappings, connector):
  Query PrimaryCareClinicalCoding for exact SNOMED code matches.
  Returns most recent match by EventDateTime with DirectSnomedMatchResult.

Both functions follow existing patterns in the module and are exported
in __all__. The lookup is case-insensitive for drug names.
2026-02-05 14:14:55 +00:00
Andrew Charlwood 6d68b5eaa5 feat: add SNOMED mapping loader script (Task 1.2)
- Create data_processing/load_snomed_mapping.py with:
  - migrate_drug_snomed_mapping() for CSV to SQLite migration
  - get_drug_snomed_mapping_counts() for statistics
  - verify_drug_snomed_mapping_migration() for validation
  - clean_snomed_code() to remove trailing .0 from SNOMED codes
  - CLI interface: python -m data_processing.load_snomed_mapping

- Loaded 144,056 mappings from enriched CSV:
  - 707 unique drugs
  - 187 unique search terms
  - 21,265 unique SNOMED codes
2026-02-05 14:10:36 +00:00
Andrew Charlwood 9943e85761 feat: add ref_drug_snomed_mapping schema (Task 1.1)
- Add REF_DRUG_SNOMED_MAPPING_SCHEMA with 11 columns for direct SNOMED mapping
- Add 5 indexes for lookup performance (drug, cleaned_drug, snomed, search_term, composite)
- Add create_drug_snomed_mapping_table() helper function
- Update helper functions (drop, get_counts, verify_exists) to include new table
- Table is included in REFERENCE_TABLES_SCHEMA and created by migration
2026-02-05 14:06:31 +00:00
Andrew Charlwood fa72fb3098 docs: mark all tasks complete in IMPLEMENTATION_PLAN.md 2026-02-05 02:17:17 +00:00
Andrew Charlwood 139a71b752 docs: update progress.txt with iteration 17 completion (Task 5.6) 2026-02-05 02:16:28 +00:00
Andrew Charlwood 9b466b4e6c feat: add hover/focus states and clean up unused styles (Task 5.6)
- Add subtle hover states to KPI badges, dropdown triggers, tabs
- Add consistent focus rings for accessibility (2px Pale Blue)
- Update button styles with focus/active states
- Clean up unused styles: compact_kpi_* (Option B), unused imports
- All interactive elements now have appropriate hover/focus feedback
2026-02-05 02:16:01 +00:00