diff --git a/.claude/settings.json b/.claude/settings.json deleted file mode 100644 index 4d2dd1b..0000000 --- a/.claude/settings.json +++ /dev/null @@ -1,26 +0,0 @@ -{ - "permissions": { - "allow": [ - "Bash(python*)", - "Bash(git*)", - "Bash(cd*)", - "Bash(ls*)", - "Bash(cat*)", - "Bash(head*)", - "Bash(tail*)", - "Bash(mkdir*)", - "Bash(touch*)", - "Bash(rm*)", - "Bash(mv*)", - "Bash(cp*)", - "Bash(timeout*)", - "Bash(reflex*)", - "Read", - "Write", - "Edit", - "Glob", - "Grep" - ], - "deny": [] - } -} diff --git a/.claude/settings.local.json b/.claude/settings.local.json deleted file mode 100644 index efc7afc..0000000 --- a/.claude/settings.local.json +++ /dev/null @@ -1,22 +0,0 @@ -{ - "permissions": { - "allow": [ - "WebSearch", - "Bash(wc:*)", - "WebFetch(domain:flet.dev)", - "WebFetch(domain:github.com)", - "WebFetch(domain:docs.flet.dev)", - "Bash(python -c:*)", - "Bash(Remove-Item -Path \"C:\\\\Users\\\\charlwoodand\\\\Ralph local\\\\Tasks\\\\UI Redesign\\\\logs\\\\*\" -Force)", - "Bash(git commit -m \"$\\(cat <<''EOF''\ndocs: update CLAUDE.md to reflect slimmed database architecture\n\nRemove references to deleted tables \\(fact_interventions,\nmv_patient_treatment_summary, ref_drug_snomed_mapping, processed_files\\),\ndeleted files \\(patient_data.py, load_snomed_mapping.py\\), and removed\nclasses \\(SQLiteDataLoader\\). Update package structure, data loaders,\ndatabase schema, fallback chain, and AppState descriptions.\nEOF\n\\)\")", - "Bash(Get-ChildItem -Recurse -Force)", - "Bash(Select-Object FullName)", - "Bash(uv run python:*)", - "Bash(uv run:*)", - "Bash(PYTHONPATH=src uv run python:*)", - "Bash(uv pip install:*)", - "Bash(uv lock:*)", - "Bash(git commit:*)" - ] - } -} diff --git a/.gitignore b/.gitignore index a38a9db..116689d 100644 --- a/.gitignore +++ b/.gitignore @@ -9,6 +9,7 @@ build/ dist/ wheels/ *.egg-info +*claude* # Virtual environments .venv @@ -32,10 +33,27 @@ hcd_20250411.parquet logs/*.log logs/*.jsonl -# Reflex build artifacts (future) +# Reflex build artifacts (legacy) .web/ .states/ +# Legacy files (old Reflex/CustomTkinter app) +archive/ +images/ +run.bat + +# Ralph loop files +ralph.ps1 +RALPH_PROMPT.md +progress.txt + +# Planning/process docs (not used by application) +guardrails.md +IMPLEMENTATION_PLAN.md + +# Windows artifacts +NUL + # SQLite databases (except pathways.db which contains pre-computed data) *.db !data/pathways.db diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index f56c33b..0000000 --- a/CLAUDE.md +++ /dev/null @@ -1,655 +0,0 @@ -# CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## Project Overview - -NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns as interactive Plotly icicle charts. - -**Key Features:** -- **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle -- **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering -- **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate) -- Data pipeline: Snowflake → pre-computed SQLite pathway nodes (CSV/Parquet file loading retained for legacy compatibility) -- Interactive browser-based UI using Dash (Plotly) + Dash Mantine Components -- 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times - -## Running the Application - -```bash -# Install dependencies -uv sync - -# One-time dev setup: adds src/ to Python path via .pth file -uv run python setup_dev.py - -# Initialize/migrate the database (creates pathway tables) -python -m data_processing.migrate - -# Refresh pathway data from Snowflake (requires SSO auth) -python -m cli.refresh_pathways - -# Run the Dash web application -python run_dash.py -``` - -The application requires Python 3.10+ and runs on http://localhost:8050 by default. - -### CLI Commands - -**Refresh Pathway Data:** -```bash -# Full refresh — both chart types (directory + indication), all date filters -python -m cli.refresh_pathways --chart-type all - -# Directory charts only (faster, skips GP diagnosis lookup) -python -m cli.refresh_pathways --chart-type directory - -# Indication charts only -python -m cli.refresh_pathways --chart-type indication - -# Dry run (test without database changes) -python -m cli.refresh_pathways --chart-type all --dry-run -v - -# Custom minimum patient threshold -python -m cli.refresh_pathways --minimum-patients 10 - -# Help -python -m cli.refresh_pathways --help -``` - -The `--chart-type` argument controls which pathway types are processed: -- `all` (default) — generates both directory and indication charts (~15 minutes) -- `directory` — directory-based charts only (~5 minutes) -- `indication` — indication-based charts only (~12 minutes, includes GP lookup) - -The refresh command: -1. Fetches activity data from Snowflake (656K+ records, ~7 seconds) -2. Applies UPID, drug name, and directory transformations (~6 minutes) -3. For indication charts: queries GP records via SNOMED clusters (~9 minutes for 37K patients) -4. Processes 6 date filter combinations × selected chart types -5. Inserts pathway nodes to SQLite for fast Dash filtering - -## Architecture - -### Package Structure - -``` -. -├── src/ # All application library code -│ ├── core/ # Foundation: paths, models, logging -│ │ ├── config.py # PathConfig dataclass for file paths -│ │ ├── models.py # AnalysisFilters dataclass -│ │ └── logging_config.py # Structured logging setup -│ │ -│ ├── config/ # Service configuration -│ │ ├── __init__.py # SnowflakeConfig + loader -│ │ └── snowflake.toml # Connection settings (co-located with loader) -│ │ -│ ├── data_processing/ # Data layer -│ │ ├── database.py # SQLite connection management -│ │ ├── schema.py # Database schema (reference + pathway tables) -│ │ ├── pathway_pipeline.py # Pipeline: Snowflake → SQLite -│ │ ├── transforms.py # Data transformations (UPID, drug names, directory) -│ │ ├── loader.py # FileDataLoader for CSV/Parquet files -│ │ ├── reference_data.py # Reference data migration -│ │ ├── snowflake_connector.py # Snowflake integration -│ │ ├── cache.py # Query result caching -│ │ ├── data_source.py # Data source fallback chain -│ │ ├── diagnosis_lookup.py # GP diagnosis lookup (SNOMED clusters) -│ │ └── parsing.py # Parse average_spacing HTML, pathway drugs, retention rates -│ │ -│ ├── analysis/ # Analysis pipeline -│ │ ├── pathway_analyzer.py # prepare_data, calculate_statistics, build_hierarchy -│ │ └── statistics.py # Statistical calculation functions -│ │ -│ ├── visualization/ # Chart generation -│ │ └── plotly_generator.py # Icicle, market share, cost effectiveness, waterfall, Sankey, dosing, heatmap, duration figures -│ │ -│ └── cli/ # CLI tools -│ └── refresh_pathways.py # Data refresh command -│ -├── dash_app/ # Dash web application -│ ├── app.py # Dash app, layout root, dcc.Store, register_callbacks -│ ├── assets/ -│ │ └── nhs.css # NHS design system CSS -│ ├── data/ -│ │ ├── queries.py # Thin wrapper calling src/data_processing/pathway_queries.py -│ │ └── card_browser.py # DimSearchTerm.csv → directorate tree for modals -│ ├── components/ -│ │ ├── header.py # Top header bar with fraction KPIs + data freshness -│ │ ├── sidebar.py # Left nav: Patient Pathways + Trust Comparison -│ │ ├── sub_header.py # Global filter bar (date dropdowns + chart type toggle) -│ │ ├── filter_bar.py # Patient Pathways filter buttons (drugs, trusts, directorates) -│ │ ├── chart_card.py # Chart area with Icicle/Sankey tabs + dcc.Graph -│ │ ├── modals.py # dmc.Modal dialogs for drug/trust/directorate selection -│ │ ├── trust_comparison.py # Trust Comparison landing page + 6-chart dashboard -│ │ └── footer.py # Page footer -│ ├── callbacks/ -│ │ ├── __init__.py # register_callbacks(app) -│ │ ├── filters.py # Reference data loading + filter state management -│ │ ├── chart.py # Tab switching, pathway data loading, chart dispatch -│ │ ├── modals.py # Modal open/close + drug/trust/directorate selection -│ │ ├── navigation.py # Sidebar view switching + Trust Comparison navigation -│ │ ├── trust_comparison.py # 6 Trust Comparison chart callbacks -│ │ └── kpi.py # Header fraction KPI updates -│ └── utils/ -│ └── __init__.py -│ -├── run_dash.py # Entry point: python run_dash.py -├── tests/ # Test suite (113 tests) -├── data/ # Reference data + SQLite DB -├── docs/ # Documentation -├── assets/ # Static assets (logo, favicon) -├── archive/ # Historical/deprecated (includes old Reflex app) -└── logs/ # Runtime logs -``` - -**Path resolution**: `src/` is added to `sys.path` via a `.pth` file (created by `setup_dev.py`). -All imports use package names directly: `from core import ...`, `from data_processing import ...`, etc. - -### Pathway Data Architecture - -The application uses a pre-computed pathway architecture for performance: - -**Architecture:** `Snowflake → Pathway Processing → SQLite (pre-computed) → Dash (filter & view)` - -**Key Benefits:** -- **Performance**: Pathway calculation done once during data refresh, not on every filter change -- **Simplicity**: Dash callbacks filter pre-computed data with simple SQL WHERE clauses -- **Full Pathways**: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics - -**Chart Types:** - -| Type | Hierarchy | Level 2 Source | -|------|-----------|----------------| -| `directory` | Trust → Directory → Drug → Pathway | Assigned directorate (5-level fallback) | -| `indication` | Trust → GP Diagnosis → Drug → Pathway | SNOMED cluster Search_Term from GP records | - -For indication charts, ~93% of patients are matched to a GP diagnosis (Search_Term). Unmatched patients use their directorate as a fallback label (e.g., "RHEUMATOLOGY (no GP dx)"). - -**Date Filter Combinations:** -| ID | Initiated | Last Seen | Default | -|----|-----------|-----------|---------| -| `all_6mo` | All years | Last 6 months | Yes | -| `all_12mo` | All years | Last 12 months | No | -| `1yr_6mo` | Last 1 year | Last 6 months | No | -| `1yr_12mo` | Last 1 year | Last 12 months | No | -| `2yr_6mo` | Last 2 years | Last 6 months | No | -| `2yr_12mo` | Last 2 years | Last 12 months | No | - -Total pre-computed datasets: 6 date filters × 2 chart types = 12 datasets (~3,600 pathway nodes). - -**Pathway Node Structure:** -Each node in `pathway_nodes` contains: -- Routing: `chart_type` ("directory" or "indication"), `date_filter_id` -- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory/Indication, 3=Drug, 4+=Pathway) -- Counts: `value` (patient count) -- Costs: `cost`, `costpp`, `cost_pp_pa` (per patient per annum) -- Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent` -- Statistics: `average_spacing`, `average_administered`, `avg_days` -- Denormalized: `trust_name`, `directory`, `drug_sequence` (for efficient filtering) -- Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` - -### Core Module (`core/`) - -- **PathConfig** - Dataclass encapsulating all file paths, with `validate()` method -- **AnalysisFilters** - Dataclass for filter state (dates, drugs, trusts, directories) -- **logging_config** - Structured logging with file and console output - -### CLI Module (`cli/`) - -- **refresh_pathways.py** - Command-line tool to refresh pre-computed pathway data: - - `refresh_pathways()` - Main function orchestrating the full pipeline - - `insert_pathway_records()` - SQLite insertion with parameterized queries - - `log_refresh_start/complete/failed()` - Refresh tracking in `pathway_refresh_log` - - `get_default_filters()` - Load trusts/drugs/directories from CSV files - -### Data Processing Module (`data_processing/`) - -**Database Management:** -- `DatabaseManager` - SQLite connection pooling and transaction management -- **Reference Tables**: `ref_drug_names`, `ref_organizations`, `ref_directories`, `ref_drug_directory_map`, `ref_drug_indication_clusters` -- **Pathway Tables**: `pathway_date_filters`, `pathway_nodes`, `pathway_refresh_log` - -**Pathway Pipeline (`pathway_pipeline.py`):** -- `DateFilterConfig` - Dataclass for date filter configuration -- `DATE_FILTER_CONFIGS` - All 6 pre-defined date combinations -- `compute_date_ranges(config, max_date)` - Computes actual ISO dates from config -- `fetch_and_transform_data()` - Snowflake fetch + UPID/drug/directory transformations -- Directory chart functions: - - `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()` - - `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence -- Indication chart functions: - - `process_indication_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart_indication()` - - `extract_indication_fields()` - Parses `ids` for indication charts (trust, search_term, drug_sequence) -- Shared functions: - - `convert_to_records(ice_df, chart_type)` - Converts ice_df to list of dicts with `chart_type` column - - `process_all_date_filters()` - Convenience function to process all 6 filters - -**Data Loaders:** -- `FileDataLoader` - Loads from CSV/Parquet files (used by legacy pipeline, not by Dash app) -- Factory function `get_loader()` creates a `FileDataLoader` - -**Snowflake Integration:** -- SSO authentication via `externalbrowser` authenticator -- `fetch_activity_data(start_date, end_date, provider_codes)` method -- Query caching with TTL-based invalidation - -**GP Diagnosis Lookup (`diagnosis_lookup.py`):** -- `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes -- `get_patient_indication_groups(patient_pseudonyms)` - Batch queries Snowflake to match patients to GP diagnoses: - - Embeds cluster mapping as CTE, joins with `PrimaryCareClinicalCoding` - - Uses `PseudoNHSNoLinked` (not PersonKey) to match `PatientPseudonym` in GP records - - Returns most recent match per patient via `QUALIFY ROW_NUMBER()` - - Batches 500 patients per query, returns DataFrame with PatientPseudonym, Search_Term, EventDateTime -- `patient_has_indication(patient_pseudonym, cluster_ids)` - Single-patient GP record check (legacy) -- `validate_indication(patient_pseudonym, drug_name)` - Full validation result with source tracking (legacy) - -### Analysis Module (`analysis/`) - -Refactored from the original 267-line `generate_graph()` function: - -- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories (copies df to prevent mutation) -- **calculate_statistics()** - Compute frequency, cost, duration statistics -- **build_hierarchy()** - Create Trust → Directory → Drug → Pathway structure -- **prepare_chart_data()** - Format data for Plotly icicle chart -- **generate_icicle_chart_indication(df, indication_df, ...)** - Build indication-based hierarchy using Search_Term instead of Directory. Takes an `indication_df` (UPID → Search_Term mapping) alongside the main activity DataFrame. - -### Visualization Module (`visualization/`) - -- **create_icicle_figure(ice_df)** - Generate Plotly icicle chart from DataFrame (legacy/pipeline use) -- **create_icicle_from_nodes(nodes, title)** - Generate icicle chart from list-of-dicts (Dash use). Accepts JSON-serializable node dicts from `dcc.Store`. Uses NHS blue gradient colorscale, 10-field customdata, Source Sans 3 font. -- **create_market_share_figure(data, title)** - Horizontal stacked bar chart: drugs grouped by directorate/indication, bar length = % patients -- **create_cost_effectiveness_figure(data, retention, title)** - Lollipop chart: pathway cost_pp_pa with dot size = patient count, retention annotations -- **create_cost_waterfall_figure(data, title)** - Waterfall chart: directorate-level cost_pp_pa sorted highest to lowest -- **create_sankey_figure(data, title)** - Sankey diagram: drug switching flows across treatment lines (1st → 2nd → 3rd) -- **create_dosing_figure(data, title, group_by)** - Grouped horizontal bar chart: dosing intervals by drug or trust -- **create_heatmap_figure(data, title, metric)** - Matrix heatmap: directorate × drug with patient/cost/cost_pp_pa colouring -- **create_duration_figure(data, title, show_directory)** - Horizontal bar chart: average treatment duration in days per drug -- **create_trust_market_share_figure(data, title)** - Trust Comparison: horizontal stacked bars grouped by trust, drugs as segments -- **create_trust_heatmap_figure(data, title, metric)** - Trust Comparison: trust × drug matrix with NHS blue colorscale -- **create_trust_duration_figure(data, title)** - Trust Comparison: grouped horizontal bars with one trace per trust -- **save_figure_html()** - Save interactive HTML file -- **open_figure_in_browser()** - Open chart in default browser - -### Parsing Utilities (`data_processing/parsing.py`) - -- **parse_average_spacing(spacing_html)** - Extract drug_name, dose_count, weekly_interval, total_weeks from HTML string -- **parse_pathway_drugs(ids, level)** - Extract ordered drug list from ids column at level 4+ -- **calculate_retention_rate(nodes)** - For each N-drug pathway, calculate % not escalating to N+1 drugs - -### Shared Data Queries (`data_processing/pathway_queries.py`) - -Shared query functions used by the Dash app (via thin wrappers in `dash_app/data/queries.py`): -- **load_initial_data(db_path)** - Returns available drugs (42), directorates (14), indications (32), trusts (7), total_patients, last_updated -- **load_pathway_nodes(db_path, filter_id, chart_type, selected_drugs, selected_directorates, selected_trusts)** - Returns pathway nodes, unique_patients, total_drugs, total_cost, last_updated. Parameterized SQL with optional drug/directorate/trust filters. -- **get_drug_market_share(db_path, filter_id, chart_type, directory, trust)** - Level 3 nodes grouped by directory, returns drug, value, colour -- **get_pathway_costs(db_path, filter_id, chart_type, directory, trust)** - Level 4+ nodes with cost_pp_pa, pathway labels, patient counts -- **get_cost_waterfall(db_path, filter_id, chart_type, trust)** - Level 2 nodes with cost_pp_pa per directorate/indication -- **get_drug_transitions(db_path, filter_id, chart_type, directory, trust)** - Level 3+ nodes parsed into source→target drug transitions -- **get_dosing_intervals(db_path, filter_id, chart_type, drug, trust)** - Level 3 nodes with parsed average_spacing intervals -- **get_drug_directory_matrix(db_path, filter_id, chart_type, trust)** - Level 3 nodes pivoted as directory × drug matrix -- **get_treatment_durations(db_path, filter_id, chart_type, directory, trust)** - Level 3 nodes with avg_days by drug -- **get_trust_market_share(db_path, filter_id, chart_type, directory)** - Trust Comparison: drugs by trust within a single directorate -- **get_trust_cost_waterfall(db_path, filter_id, chart_type, directory)** - Trust Comparison: one bar per trust showing cost_pp within directorate -- **get_trust_dosing(db_path, filter_id, chart_type, directory)** - Trust Comparison: drug dosing intervals broken down by trust -- **get_trust_heatmap(db_path, filter_id, chart_type, directory)** - Trust Comparison: trust × drug matrix for one directorate -- **get_trust_durations(db_path, filter_id, chart_type, directory)** - Trust Comparison: drug durations by trust within directorate -- **get_directorate_summary(db_path, filter_id, chart_type, directory)** - Summary stats for a directorate (total patients, drugs, cost) - -### Dash Application (`dash_app/`) - -**Two-View Architecture:** -The application is split into two analytical perspectives, selectable via the sidebar: -- **Patient Pathways**: Pathway-focused analysis (Icicle + Sankey charts) with drug/trust/directorate filters -- **Trust Comparison**: Per-directorate analysis comparing drugs across trusts (6 charts for a selected directorate) - -**State Management** via 4 `dcc.Store` components: -- **app-state** (session): `chart_type`, `initiated`, `last_seen`, `date_filter_id`, `selected_drugs`, `selected_directorates`, `selected_trusts`, `active_view` ("patient-pathways" | "trust-comparison"), `selected_comparison_directorate` (null | directorate name) -- **chart-data** (memory): `nodes[]`, `unique_patients`, `total_drugs`, `total_cost`, `last_updated` -- **reference-data** (session): `available_drugs`, `available_directorates`, `available_indications`, `available_trusts`, `total_patients`, `last_updated` -- **active-tab** (memory): Currently selected chart tab within Patient Pathways ("icicle" | "sankey") - -**Callback Chain** (unidirectional): -``` -Page Load → load_reference_data → reference-data store + header indicators - → update_app_state → app-state store (default filters) - → load_pathway_data → chart-data store - ├→ update_kpis → header fraction KPIs - └→ update_chart → dcc.Graph (Icicle or Sankey) - -Filter change → update_app_state → app-state → load_pathway_data → (chain above) -Modal selection → drug/trust chips → update_app_state → (chain above) -Tab click → switch_tab → active-tab store → update_chart → dcc.Graph (lazy rendering) - -Sidebar click → switch_view → active_view in app-state → show/hide views -Trust Comparison: - Landing page → directorate button click → selected_comparison_directorate → 6 chart callbacks - Back button → clear selected_comparison_directorate → return to landing -``` - -**Key Components:** -- **Header** (`header.py`): NHS branding, fraction KPIs (X/X patients, X/X drugs, £X/£X cost), data freshness indicator -- **Sidebar** (`sidebar.py`): 2 navigation items — "Patient Pathways" (default), "Trust Comparison" -- **Sub-Header** (`sub_header.py`): Global filter bar — date dropdowns (Initiated, Last Seen) + chart type toggle pills (By Directory / By Indication). Constant across both views. -- **Filter Bar** (`filter_bar.py`): Patient Pathways-only filter buttons — Drugs (with count badge), Trusts (with count badge), Directorates (with count badge), Clear All. Only visible on Patient Pathways view. -- **Chart Card** (`chart_card.py`): 2-tab chart area (Icicle, Sankey) with `dcc.Loading` spinner, dynamic subtitle, and `dcc.Store(id="active-tab")` -- **Modals** (`modals.py`): 3 `dmc.Modal` dialogs for drug selection (ChipGroup), trust selection (ChipGroup), directorate browser (Accordion with indication sub-items and drug fragment badges) -- **Trust Comparison** (`trust_comparison.py`): Landing page (directorate/indication button grid) + 6-chart dashboard (Market Share, Cost Waterfall, Dosing, Heatmap, Duration, Cost Effectiveness) -- **Footer** (`footer.py`): NHS Norfolk and Waveney ICB branding - -**Filter Modals:** -- Drug Modal: flat `dmc.ChipGroup` with 42 drugs from pathway_nodes level 3 -- Trust Modal: `dmc.ChipGroup` with 7 trusts -- Directorate Modal: nested `dmc.Accordion` — 19 directorates → indications → drug fragment `dmc.Badge` items -- Clicking a drug fragment badge selects all full drug names containing that fragment (substring match) -- "Clear All Filters" button resets drug and trust selections - -**Trust Comparison Dashboard (6 Charts):** -All scoped to a single selected directorate, comparing drugs across trusts: -1. **Market Share**: Drug breakdown per trust (stacked bars per trust) -2. **Cost Waterfall**: Per-trust cost within directorate -3. **Dosing**: Drug dosing intervals by trust -4. **Heatmap**: Trust × drug matrix -5. **Duration**: Drug durations by trust -6. **Cost Effectiveness**: Pathway costs within directorate (NOT split by trust) - -### Data Transformations (`data_processing/transforms.py`) - -Core data transformation functions used by the pipeline: -- `patient_id()` - Creates UPID = Provider Code (first 3 chars) + PersonKey -- `drug_names()` - Standardizes via drugnames.csv lookup -- `department_identification()` - 5-level fallback chain for directory assignment - -### Data Flow - -**Pre-Computed Pathway Architecture (Current):** - -``` -[CLI: python -m cli.refresh_pathways --chart-type all] - - Snowflake Data Warehouse - │ - ▼ (fetch_and_transform_data) - ┌──────────────────────────────────────────┐ - │ Data Transformations (data_processing/transforms.py) │ - │ → patient_id() creates UPID │ - │ → drug_names() standardizes names │ - │ → department_identification() → Dir │ - └──────────────────────────────────────────┘ - │ - ├─── Directory Charts ──────────────────────────────────────┐ - │ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ For each of 6 date filter combos: │ │ - │ │ → generate_icicle_chart() │ │ - │ │ → extract_denormalized_fields() │ │ - │ │ → convert_to_records("directory") │ │ - │ └──────────────────────────────────────────┘ │ - │ │ - ├─── Indication Charts ─────────────────────────────────────┤ - │ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ GP Diagnosis Lookup (diagnosis_lookup.py)│ │ - │ │ → Extract PseudoNHSNoLinked from HCD │ │ - │ │ → get_patient_indication_groups() │ │ - │ │ (SNOMED cluster CTE + GP records) │ │ - │ │ → Build indication_df: UPID → Search │ │ - │ │ Term (matched) or Directorate (no GP)│ │ - │ └──────────────────────────────────────────┘ │ - │ │ │ - │ ▼ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ For each of 6 date filter combos: │ │ - │ │ → generate_icicle_chart_indication() │ │ - │ │ → extract_indication_fields() │ │ - │ │ → convert_to_records("indication") │ │ - │ └──────────────────────────────────────────┘ │ - │ │ - └───────────────────────┬───────────────────────────────────┘ - │ - ▼ (insert_pathway_records) - ┌──────────────────────────────────────────┐ - │ SQLite: pathway_nodes table │ - │ → ~3,600 nodes across 12 datasets │ - │ → UNIQUE(date_filter_id, chart_type, │ - │ ids) prevents cross-type overwrites │ - │ → Indexed for fast filtering │ - └──────────────────────────────────────────┘ - - -[Dash App: python run_dash.py] - - ┌──────────────────────────────────────────┐ - │ Global Sub-Header (date dropdowns, │ - │ chart type toggle pills) │ - │ → Triggers update_app_state callback │ - └──────────────────────────────────────────┘ - │ - ├─── Patient Pathways View ─────────────────────────────┐ - │ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ Filter Bar (Drugs/Trusts/Directorates) │ │ - │ │ → Modal selections → app-state │ │ - │ └──────────────────────────────────────────┘ │ - │ │ │ - │ ▼ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ load_pathway_data callback │ │ - │ │ → chart-data store │ │ - │ └──────────────────────────────────────────┘ │ - │ │ │ - │ ├──────────────────────────────┐ │ - │ ▼ ▼ │ - │ ┌────────────────────┐ ┌──────────────────────┐ │ - │ │ update_kpis │ │ update_chart │ │ - │ │ → header KPIs │ │ → Icicle or Sankey │ │ - │ └────────────────────┘ └──────────────────────┘ │ - │ │ - ├─── Trust Comparison View ─────────────────────────────┤ - │ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ Landing Page │ │ - │ │ → Directorate/Indication buttons │ │ - │ │ → Click → selected_comparison_dir │ │ - │ └──────────────────────────────────────────┘ │ - │ │ │ - │ ▼ │ - │ ┌──────────────────────────────────────────┐ │ - │ │ 6-Chart Dashboard │ │ - │ │ → Market Share, Cost Waterfall, Dosing │ │ - │ │ → Heatmap, Duration, Cost Effectiveness│ │ - │ │ → All per-trust within one directorate │ │ - │ └──────────────────────────────────────────┘ │ - │ │ - └────────────────────────────────────────────────────────┘ -``` - -### Reference Data Files (`data/`) - -| File | Purpose | -|------|---------| -| `include.csv` | Drug filter list with default selections (Include=1) | -| `defaultTrusts.csv` | NHS Trust list for filter | -| `directory_list.csv` | Medical specialties/directories | -| `drugnames.csv` | Drug name standardization mapping | -| `org_codes.csv` | Provider code to organization name mapping | -| `drug_directory_list.csv` | Valid drug-to-directory mappings (pipe-separated) | -| `treatment_function_codes.csv` | NHS treatment function code mappings | -| `drug_indication_clusters.csv` | Drug to SNOMED cluster mappings | -| `ta-recommendations.xlsx` | NICE TA recommendations | -| `pathways.db` | SQLite database (~3.5 MB: reference tables + pathway nodes) | - -### Key Patterns - -**Department Identification Fallback Chain:** -The `department_identification()` function has 5 levels of fallback: -1. **SINGLE_VALID_DIR** - Drug has only one valid directory -2. **EXTRACTED** - Extracted from Additional Detail/Description fields -3. **CALCULATED_MOST_FREQ** - Most frequent valid directory for UPID/Drug -4. **UPID_INFERENCE** - Inferred from other records with same UPID -5. **UNDEFINED** - No directory could be determined - -**Indication Lookup Workflow (for indication charts):** -1. Extract unique `PseudoNHSNoLinked` values from HCD activity data -2. Query Snowflake in batches of 500 patients: - - Embed `CLUSTER_MAPPING_SQL` (~148 Search_Term → Cluster_ID mappings) as CTE - - Join `ClinicalCodingClusterSnomedCodes` to get SNOMED codes per cluster - - Join `PrimaryCareClinicalCoding` on `PatientPseudonym` = `PseudoNHSNoLinked` - - Use `QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1` for most recent match -3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched) -4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building - -**Data Source Fallback Chain** (for raw data loading, not used by Dash app): -1. Query cache for recent results -2. Attempt Snowflake connection -3. Fall back to CSV/Parquet files - -## Database Schema (~3.5 MB) - -### Reference Tables -- `ref_drug_names` - Drug name standardization -- `ref_organizations` - Provider code to name mapping -- `ref_directories` - Valid directory names -- `ref_drug_directory_map` - Valid drug-directory pairs -- `ref_drug_indication_clusters` - Drug to SNOMED cluster mapping - -### Pathway Tables -- `pathway_date_filters` - 6 pre-defined date filter combinations - - Columns: `id`, `initiated`, `last_seen`, `is_default`, `description` - - Auto-populated via migration -- `pathway_nodes` - Pre-computed pathway hierarchy nodes (~3,600 rows for 12 datasets) - - Routing: `chart_type` ("directory" or "indication"), `date_filter_id` - - Hierarchy: `parents`, `ids`, `labels`, `level` - - Metrics: `value`, `cost`, `costpp`, `cost_pp_pa`, `colour` - - Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent` - - Statistics: `average_spacing`, `average_administered`, `avg_days` - - Denormalized: `trust_name`, `directory`, `drug_sequence` - - Foreign key: `date_filter_id` → `pathway_date_filters.id` - - Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness - - Indexed for: date_filter_id, chart_type, trust_name, directory, level -- `pathway_refresh_log` - Tracks data refresh status - - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`, `source_row_count` - -## Input Data Requirements - -The input data (CSV/Parquet) must contain columns including: -- `Provider Code`, `PersonKey` - Used to create UPID -- `PseudoNHSNoLinked` - NHS pseudonym for GP record matching (indication charts) -- `Drug Name`, `Intervention Date`, `Price Actual` -- `OrganisationName` -- Various `Additional Detail/Description` columns for directory extraction -- `Treatment Function Code` - -## Output - -Two-view Dash application with distinct analytical perspectives: - -**Patient Pathways View** (2 tabs): -1. **Icicle** — Hierarchical pathway view (Directory: Trust → Directorate → Drug → Pathway; Indication: Trust → GP Diagnosis → Drug → Pathway) -2. **Sankey** — Drug switching flows across 1st → 2nd → 3rd treatment lines - -Patient Pathways supports: -- Directory / Indication toggle -- Date filter combinations (6 options) -- Trust, drug, and directorate filters via modals -- Lazy rendering (only active tab computed) - -**Trust Comparison View** (6 charts in dashboard): -Landing page with directorate/indication buttons → 6-chart dashboard for selected directorate: -1. **Market Share** — Drug breakdown per trust (stacked bars) -2. **Cost Waterfall** — Per-trust cost within directorate -3. **Dosing** — Drug dosing intervals by trust -4. **Heatmap** — Trust × drug matrix -5. **Duration** — Drug durations by trust -6. **Cost Effectiveness** — Pathway costs within directorate (not split by trust) - -Trust Comparison supports: -- Directory / Indication toggle (changes landing page buttons) -- Date filter combinations (6 options) -- All 6 charts scoped to selected directorate - -## Testing - -```bash -# Run all tests with coverage -python -m pytest tests/ -v --cov=core --cov=analysis - -# Run specific test file -python -m pytest tests/test_config.py -v - -# Run specific test class -python -m pytest tests/test_data_transformations.py::TestPatientId -v -``` - -Test coverage includes: -- PathConfig validation (23 tests) -- AnalysisFilters validation (26 tests) -- Data transformation functions (23 tests) -- Directory assignment logic (19 tests) - -## Configuration - -### Snowflake Connection (`src/config/snowflake.toml`) - -```toml -[snowflake] -account = "your-account" -database = "DATA_HUB" -schema = "CDM" -warehouse = "your-warehouse" -authenticator = "externalbrowser" # Required for NHS SSO -``` - -### Logging - -Logs are written to `logs/` directory with structured format. -Configure via `src/core/logging_config.py`. - -## Breaking Changes from Original App - -The pre-computed pathway architecture introduces these changes: - -### Date Filters -- **Old**: Date pickers for arbitrary `start_date` and `end_date` -- **New**: Two dropdowns: - - "Treatment Initiated": All years, Last 2 years, Last 1 year - - "Last Seen": Last 6 months, Last 12 months -- **Reason**: Pre-computed pathways require fixed date combinations for performance - -### Data Refresh -- **Old**: Real-time pathway calculation on each filter change -- **New**: Pre-computed pathways stored in SQLite, refreshed via CLI command -- **Impact**: Data is as fresh as the last `python -m cli.refresh_pathways` run -- **Benefit**: Sub-50ms filter response time vs multi-minute calculations - -### State Management (Dash) -- State lives in 4 `dcc.Store` components: `app-state`, `chart-data`, `reference-data`, `active-tab` -- Filter state: `chart_type`, `initiated`, `last_seen`, `date_filter_id`, `selected_drugs`, `selected_directorates`, `selected_trusts` -- View state: `active_view` ("patient-pathways" | "trust-comparison"), `selected_comparison_directorate` (null | directorate name) -- Chart type toggle: "By Directory" / "By Indication" pills in global sub-header -- Drug/trust/directorate selection via `dmc.Modal` dialogs (Patient Pathways only) -- Fraction KPIs in header (X/X patients, X/X drugs, £X/£X cost) - -### Icicle Chart (Patient Pathways) -- Full 10-field customdata structure (value, colour, cost, costpp, first_seen, last_seen, first_seen_parent, last_seen_parent, average_spacing, cost_pp_pa) -- NHS blue gradient colorscale: Heritage Blue #003087 → Pale Blue #E3F2FD -- Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips -- First/last seen dates for drug nodes -- `create_icicle_from_nodes()` in `src/visualization/plotly_generator.py` — shared function accepting list-of-dicts - -## Development - -### Adding New Analysis Features - -1. Add statistical functions to `src/analysis/statistics.py` -2. Integrate into pipeline in `src/analysis/pathway_analyzer.py` -3. Update visualization in `src/visualization/plotly_generator.py` - -### Adding New Reference Data - -1. Add CSV file to `data/` directory -2. Define schema in `src/data_processing/schema.py` -3. Create migration function in `src/data_processing/reference_data.py` -4. Add path to `PathConfig` in `src/core/config.py`