Update CLAUDE.md with new pathway data architecture: - Add Pathway Data Architecture section with date filter table - Update package structure with cli/ and pathway_pipeline.py - Add CLI module and pathway pipeline documentation - Update data flow diagrams (pre-computed vs legacy) - Add pathway tables to database schema section - Add CLI commands section with usage examples - Add Breaking Changes section documenting: - Date filter changes (pickers -> dropdowns) - Data refresh model (real-time -> pre-computed) - State variable changes - Icicle chart enhancements Mark all Task 4.3 subtasks complete in IMPLEMENTATION_PLAN.md Update completion criteria status
21 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns (Trust → Directory/Specialty → Drug → Patient pathway) as interactive Plotly icicle charts.
Key Features:
- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
- Pre-computed pathway architecture: Treatment pathways pre-processed and stored in SQLite for instant filtering
- GP diagnosis integration for indication validation via SNOMED clusters
- Interactive browser-based UI using Reflex framework
- 6 pre-defined date filter combinations with sub-50ms response times
Running the Application
# Install dependencies
pip install -r requirements.txt
# OR with uv
uv sync
# Initialize/migrate the database (creates pathway tables)
python -m data_processing.migrate
# Refresh pathway data from Snowflake (requires SSO auth)
python -m cli.refresh_pathways
# Run the Reflex web application
reflex run
The application requires Python 3.10+ and runs on http://localhost:3000 by default.
CLI Commands
Refresh Pathway Data:
# Full refresh with default filters (all trusts, default drugs)
python -m cli.refresh_pathways
# Dry run (test without database changes)
python -m cli.refresh_pathways --dry-run -v
# Custom minimum patient threshold
python -m cli.refresh_pathways --minimum-patients 10
# Help
python -m cli.refresh_pathways --help
The refresh command:
- Fetches activity data from Snowflake (656K+ records, ~7 seconds)
- Applies UPID, drug name, and directory transformations (~6 minutes)
- Processes 6 date filter combinations (all_6mo, all_12mo, 1yr_6mo, etc.)
- Inserts pathway nodes to SQLite for fast Reflex filtering
Architecture
Package Structure
.
├── core/ # Core configuration and models
│ ├── config.py # PathConfig dataclass for file paths
│ ├── models.py # AnalysisFilters dataclass
│ └── logging_config.py # Structured logging setup
│
├── cli/ # Command-line interface tools
│ ├── __init__.py
│ └── refresh_pathways.py # CLI to refresh pre-computed pathway data
│
├── data_processing/ # Data layer
│ ├── database.py # SQLite connection management
│ ├── schema.py # Database schema (including pathway tables)
│ ├── pathway_pipeline.py # Pathway processing pipeline (Snowflake → SQLite)
│ ├── loader.py # DataLoader abstraction (CSV/SQLite)
│ ├── patient_data.py # Patient data migration and loading
│ ├── reference_data.py # Reference data migration
│ ├── snowflake_connector.py # Snowflake integration
│ ├── cache.py # Query result caching
│ ├── data_source.py # Data source fallback chain
│ └── diagnosis_lookup.py # GP diagnosis validation
│
├── analysis/ # Analysis pipeline
│ ├── pathway_analyzer.py # prepare_data, calculate_statistics, build_hierarchy
│ └── statistics.py # Statistical calculation functions
│
├── visualization/ # Chart generation
│ └── plotly_generator.py # create_icicle_figure, save_figure_html
│
├── pathways_app/ # Reflex web application
│ ├── pathways_app.py # State class and page components
│ └── components/ # Layout and navigation components
│
├── tools/ # Legacy modules
│ ├── dashboard_gui.py # Original analysis engine (being refactored)
│ └── data.py # Data transformations (UPID, drug names, directory)
│
├── config/ # Configuration files
│ └── snowflake.toml # Snowflake connection settings
│
├── data/ # Reference data and database
│ ├── pathways.db # SQLite database (includes pathway_nodes)
│ └── *.csv # Reference data files
│
└── tests/ # Test suite
├── conftest.py # Pytest fixtures
└── test_*.py # Test modules
Pathway Data Architecture
The application uses a pre-computed pathway architecture for performance:
Architecture: Snowflake → Pathway Processing → SQLite (pre-computed) → Reflex (filter & view)
Key Benefits:
- Performance: Pathway calculation done once during data refresh, not on every filter change
- Simplicity: Reflex filters pre-computed data with simple SQL WHERE clauses
- Full Pathways: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics
Date Filter Combinations:
| ID | Initiated | Last Seen | Default |
|---|---|---|---|
all_6mo |
All years | Last 6 months | Yes |
all_12mo |
All years | Last 12 months | No |
1yr_6mo |
Last 1 year | Last 6 months | No |
1yr_12mo |
Last 1 year | Last 12 months | No |
2yr_6mo |
Last 2 years | Last 6 months | No |
2yr_12mo |
Last 2 years | Last 12 months | No |
Pathway Node Structure:
Each node in pathway_nodes contains:
- Hierarchy:
parents,ids,labels,level(0=Root, 1=Trust, 2=Directory, 3=Drug, 4+=Pathway) - Counts:
value(patient count) - Costs:
cost,costpp,cost_pp_pa(per patient per annum) - Dates:
first_seen,last_seen,first_seen_parent,last_seen_parent - Statistics:
average_spacing,average_administered,avg_days - Denormalized:
trust_name,directory,drug_sequence(for efficient filtering)
Core Module (core/)
- PathConfig - Dataclass encapsulating all file paths, with
validate()method - AnalysisFilters - Dataclass for filter state (dates, drugs, trusts, directories)
- logging_config - Structured logging with file and console output
CLI Module (cli/)
- refresh_pathways.py - Command-line tool to refresh pre-computed pathway data:
refresh_pathways()- Main function orchestrating the full pipelineinsert_pathway_records()- SQLite insertion with parameterized querieslog_refresh_start/complete/failed()- Refresh tracking inpathway_refresh_logget_default_filters()- Load trusts/drugs/directories from CSV files
Data Processing Module (data_processing/)
Database Management:
DatabaseManager- SQLite connection pooling and transaction management- Tables:
ref_drug_names,ref_organizations,ref_directories,ref_drug_directory_map,ref_drug_indication_clusters,fact_interventions,mv_patient_treatment_summary,processed_files - Pathway Tables:
pathway_date_filters,pathway_nodes,pathway_refresh_log
Pathway Pipeline (pathway_pipeline.py):
DateFilterConfig- Dataclass for date filter configurationDATE_FILTER_CONFIGS- All 6 pre-defined date combinationscompute_date_ranges(config, max_date)- Computes actual ISO dates from configfetch_and_transform_data()- Snowflake fetch + UPID/drug/directory transformationsprocess_pathway_for_date_filter()- Processes single date filter usinggenerate_icicle_chart()extract_denormalized_fields()- Parsesidscolumn to extract trust, directory, drug_sequenceconvert_to_records()- Converts ice_df to list of dicts for SQLite insertionprocess_all_date_filters()- Convenience function to process all 6 filters
Data Loaders:
FileDataLoader- Loads from CSV/Parquet filesSQLiteDataLoader- Queries fact_interventions table- Factory function
get_loader()selects appropriate loader
Snowflake Integration:
- SSO authentication via
externalbrowserauthenticator fetch_activity_data(start_date, end_date, provider_codes)method- Query caching with TTL-based invalidation
- Fallback chain: cache → Snowflake → local files
GP Diagnosis Validation:
- Uses pre-built SNOMED clusters from
ClinicalCodingClusterSnomedCodes patient_has_indication(patient_pseudonym, cluster_ids)checks GP recordsvalidate_indication(patient_pseudonym, drug_name)returns full validation result- Adds
Indication_Sourcecolumn: "GP_SNOMED" | "HCD_SNOMED" | "NONE"
Analysis Module (analysis/)
Refactored from the original 267-line generate_graph() function:
- prepare_data() - Filter DataFrame by date range, trusts, drugs, directories
- calculate_statistics() - Compute frequency, cost, duration statistics
- build_hierarchy() - Create Trust → Directory → Drug → Pathway structure
- prepare_chart_data() - Format data for Plotly icicle chart
Visualization Module (visualization/)
- create_icicle_figure() - Generate Plotly icicle chart figure
- save_figure_html() - Save interactive HTML file
- open_figure_in_browser() - Open chart in default browser
Reflex Application (pathways_app/)
The State class manages all application state:
- Filter variables: dates, drugs, trusts, directories
- Reference data: available options loaded from CSV/SQLite
- Analysis state: running flag, status messages, chart data
- Data source state: file path, source type, row counts
Legacy Modules (tools/)
Still used during transition:
-
tools/data.py - Data transformation functions:
patient_id()- Creates UPID = Provider Code (first 3 chars) + PersonKeydrug_names()- Standardizes via drugnames.csv lookupdepartment_identification()- 5-level fallback chain for directory assignment
-
tools/dashboard_gui.py - Original analysis engine (being replaced by
analysis/module)
Data Flow
Pre-Computed Pathway Architecture (Current):
[CLI: python -m cli.refresh_pathways]
Snowflake Data Warehouse
│
▼ (fetch_and_transform_data)
┌──────────────────────────────────────────┐
│ Data Transformations (tools/data.py) │
│ → patient_id() creates UPID │
│ → drug_names() standardizes names │
│ → department_identification() → Dir │
└──────────────────────────────────────────┘
│
▼ (process_all_date_filters)
┌──────────────────────────────────────────┐
│ Pathway Pipeline (pathway_pipeline.py) │
│ → For each of 6 date filter combos: │
│ → generate_icicle_chart() │
│ → extract_denormalized_fields() │
│ → convert_to_records() │
└──────────────────────────────────────────┘
│
▼ (insert_pathway_records)
┌──────────────────────────────────────────┐
│ SQLite: pathway_nodes table │
│ → 293 nodes for all_6mo filter │
│ → Indexed for fast filtering │
└──────────────────────────────────────────┘
[Reflex App: reflex run]
┌──────────────────────────────────────────┐
│ AppState.load_pathway_data() │
│ → Query pathway_nodes WHERE date_filter│
│ → Apply drug/directory filters │
│ → recalculate_parent_totals() │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ AppState.icicle_figure │
│ → Plotly icicle chart │
│ → 10-field customdata structure │
│ → Full hover/text templates │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Reflex UI (rx.plotly component) │
│ → <50ms filter response time │
│ → Treatment statistics in tooltips │
└──────────────────────────────────────────┘
Legacy Data Flow (Original):
Data Sources:
CSV/Parquet file upload
OR SQLite database query
OR Snowflake fetch (with caching)
│
▼
┌──────────────────────────────────────────┐
│ Data Transformations (tools/data.py) │
│ → patient_id() creates UPID │
│ → drug_names() standardizes names │
│ → department_identification() → Dir │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Analysis Pipeline (analysis/) │
│ → prepare_data() - filter by criteria │
│ → calculate_statistics() │
│ → build_hierarchy() │
│ → prepare_chart_data() │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Visualization (visualization/) │
│ → create_icicle_figure() │
│ → Display in rx.plotly() component │
└──────────────────────────────────────────┘
Reference Data Files (data/)
| File | Purpose |
|---|---|
include.csv |
Drug filter list with default selections (Include=1) |
defaultTrusts.csv |
NHS Trust list for filter |
directory_list.csv |
Medical specialties/directories |
drugnames.csv |
Drug name standardization mapping |
org_codes.csv |
Provider code to organization name mapping |
drug_directory_list.csv |
Valid drug-to-directory mappings (pipe-separated) |
treatment_function_codes.csv |
NHS treatment function code mappings |
drug_indication_clusters.csv |
Drug to SNOMED cluster mappings |
ta-recommendations.xlsx |
NICE TA recommendations |
pathways.db |
SQLite database with all tables |
Key Patterns
Department Identification Fallback Chain:
The department_identification() function has 5 levels of fallback:
- SINGLE_VALID_DIR - Drug has only one valid directory
- EXTRACTED - Extracted from Additional Detail/Description fields
- CALCULATED_MOST_FREQ - Most frequent valid directory for UPID/Drug
- UPID_INFERENCE - Inferred from other records with same UPID
- UNDEFINED - No directory could be determined
Indication Validation Workflow:
- Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD)
- Get all SNOMED codes for those clusters
- Check GP records (PrimaryCareClinicalCoding) for matching codes
- Report match/no-match with source tracking
Data Source Fallback Chain:
- Query cache for recent results
- Attempt Snowflake connection
- Fall back to SQLite database
- Fall back to CSV/Parquet files
Database Schema
Reference Tables
ref_drug_names- Drug name standardizationref_organizations- Provider code to name mappingref_directories- Valid directory namesref_drug_directory_map- Valid drug-directory pairsref_drug_indication_clusters- Drug to SNOMED cluster mapping
Fact Tables
fact_interventions- Patient intervention records (UPID, drug, date, cost, directory)
Materialized Views
mv_patient_treatment_summary- Pre-aggregated patient statistics
File Tracking
processed_files- Hash-based tracking for incremental loading
Pathway Tables (New)
pathway_date_filters- 6 pre-defined date filter combinations- Columns:
id,initiated,last_seen,is_default,description - Auto-populated via migration
- Columns:
pathway_nodes- Pre-computed pathway hierarchy nodes- Hierarchy:
parents,ids,labels,level - Metrics:
value,cost,costpp,cost_pp_pa,colour - Dates:
first_seen,last_seen,first_seen_parent,last_seen_parent - Statistics:
average_spacing,average_administered,avg_days - Denormalized:
trust_name,directory,drug_sequence - Foreign key:
date_filter_id→pathway_date_filters.id - Indexed for: date_filter_id, trust_name, directory, level
- Hierarchy:
pathway_refresh_log- Tracks data refresh status- Columns:
refresh_id,started_at,completed_at,status,records_processed,error_message
- Columns:
Input Data Requirements
The input data (CSV/Parquet) must contain columns including:
Provider Code,PersonKey- Used to create UPIDDrug Name,Intervention Date,Price ActualOrganisationName- Various
Additional Detail/Descriptioncolumns for directory extraction Treatment Function Code
Output
Interactive Plotly icicle chart showing:
- Patient counts and percentages at each hierarchy level
- Total and average costs
- Treatment duration and dosing frequency information
- Color gradient based on patient volume
Testing
# Run all tests with coverage
python -m pytest tests/ -v --cov=core --cov=analysis
# Run specific test file
python -m pytest tests/test_config.py -v
# Run specific test class
python -m pytest tests/test_data_transformations.py::TestPatientId -v
Test coverage includes:
- PathConfig validation (23 tests)
- AnalysisFilters validation (26 tests)
- Data transformation functions (23 tests)
- Directory assignment logic (19 tests)
Configuration
Snowflake Connection (config/snowflake.toml)
[snowflake]
account = "your-account"
database = "DATA_HUB"
schema = "CDM"
warehouse = "your-warehouse"
authenticator = "externalbrowser" # Required for NHS SSO
Logging
Logs are written to logs/ directory with structured format.
Configure via core/logging_config.py.
Breaking Changes from Original App
The pre-computed pathway architecture introduces these changes:
Date Filters
- Old: Date pickers for arbitrary
start_dateandend_date - New: Two dropdowns:
- "Treatment Initiated": All years, Last 2 years, Last 1 year
- "Last Seen": Last 6 months, Last 12 months
- Reason: Pre-computed pathways require fixed date combinations for performance
Data Refresh
- Old: Real-time pathway calculation on each filter change
- New: Pre-computed pathways stored in SQLite, refreshed via CLI command
- Impact: Data is as fresh as the last
python -m cli.refresh_pathwaysrun - Benefit: Sub-50ms filter response time vs multi-minute calculations
State Variables
- Removed:
start_date,end_date,set_start_date(),set_end_date() - Added:
selected_initiated,selected_last_seen,date_filter_id - Added:
load_pathway_data()- queries pre-computedpathway_nodes - Added:
recalculate_parent_totals()- adjusts hierarchy after filtering
Icicle Chart
- Enhanced: Now includes full 10-field customdata structure
- Added: Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips
- Added: First/last seen dates for drug nodes
Development
Adding New Data Sources
- Create loader class implementing
DataLoaderprotocol indata_processing/loader.py - Add to factory function
get_loader() - Update
DataSourceManagerfallback chain if needed
Adding New Analysis Features
- Add statistical functions to
analysis/statistics.py - Integrate into pipeline in
analysis/pathway_analyzer.py - Update visualization in
visualization/plotly_generator.py
Adding New Reference Data
- Add CSV file to
data/directory - Define schema in
data_processing/schema.py - Create migration function in
data_processing/reference_data.py - Add path to
PathConfigincore/config.py