Move 6 packages (core, config, data_processing, analysis, visualization, cli) into src/ to reduce root clutter. Merge tools/data.py into data_processing/transforms.py. Move docs to docs/. Path resolution via .pth file (setup_dev.py), pytest pythonpath config, and sys.path bootstrap in rxconfig.py and CLI entry points. Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python). Fix tomllib import for Python 3.10 compatibility. All 113 tests pass.
27 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns as interactive Plotly icicle charts.
Key Features:
- Dual chart types: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle
- Pre-computed pathway architecture: Treatment pathways pre-processed and stored in SQLite for instant filtering
- GP diagnosis matching: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate)
- Data pipeline: Snowflake → pre-computed SQLite pathway nodes (CSV/Parquet file loading retained for legacy compatibility)
- Interactive browser-based UI using Reflex framework
- 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times
Running the Application
# Install dependencies
uv sync
# One-time dev setup: adds src/ to Python path via .pth file
uv run python setup_dev.py
# Initialize/migrate the database (creates pathway tables)
python -m data_processing.migrate
# Refresh pathway data from Snowflake (requires SSO auth)
python -m cli.refresh_pathways
# Run the Reflex web application
reflex run
The application requires Python 3.10+ and runs on http://localhost:3000 by default.
CLI Commands
Refresh Pathway Data:
# Full refresh — both chart types (directory + indication), all date filters
python -m cli.refresh_pathways --chart-type all
# Directory charts only (faster, skips GP diagnosis lookup)
python -m cli.refresh_pathways --chart-type directory
# Indication charts only
python -m cli.refresh_pathways --chart-type indication
# Dry run (test without database changes)
python -m cli.refresh_pathways --chart-type all --dry-run -v
# Custom minimum patient threshold
python -m cli.refresh_pathways --minimum-patients 10
# Help
python -m cli.refresh_pathways --help
The --chart-type argument controls which pathway types are processed:
all(default) — generates both directory and indication charts (~15 minutes)directory— directory-based charts only (~5 minutes)indication— indication-based charts only (~12 minutes, includes GP lookup)
The refresh command:
- Fetches activity data from Snowflake (656K+ records, ~7 seconds)
- Applies UPID, drug name, and directory transformations (~6 minutes)
- For indication charts: queries GP records via SNOMED clusters (~9 minutes for 37K patients)
- Processes 6 date filter combinations × selected chart types
- Inserts pathway nodes to SQLite for fast Reflex filtering
Architecture
Package Structure
.
├── src/ # All application library code
│ ├── core/ # Foundation: paths, models, logging
│ │ ├── config.py # PathConfig dataclass for file paths
│ │ ├── models.py # AnalysisFilters dataclass
│ │ └── logging_config.py # Structured logging setup
│ │
│ ├── config/ # Service configuration
│ │ ├── __init__.py # SnowflakeConfig + loader
│ │ └── snowflake.toml # Connection settings (co-located with loader)
│ │
│ ├── data_processing/ # Data layer
│ │ ├── database.py # SQLite connection management
│ │ ├── schema.py # Database schema (reference + pathway tables)
│ │ ├── pathway_pipeline.py # Pipeline: Snowflake → SQLite
│ │ ├── transforms.py # Data transformations (UPID, drug names, directory)
│ │ ├── loader.py # FileDataLoader for CSV/Parquet files
│ │ ├── reference_data.py # Reference data migration
│ │ ├── snowflake_connector.py # Snowflake integration
│ │ ├── cache.py # Query result caching
│ │ ├── data_source.py # Data source fallback chain
│ │ └── diagnosis_lookup.py # GP diagnosis lookup (SNOMED clusters)
│ │
│ ├── analysis/ # Analysis pipeline
│ │ ├── pathway_analyzer.py # prepare_data, calculate_statistics, build_hierarchy
│ │ └── statistics.py # Statistical calculation functions
│ │
│ ├── visualization/ # Chart generation
│ │ └── plotly_generator.py # create_icicle_figure, save_figure_html
│ │
│ └── cli/ # CLI tools
│ └── refresh_pathways.py # Data refresh command
│
├── pathways_app/ # Reflex web app (stays at root — framework requirement)
│ ├── pathways_app.py # AppState + page components
│ └── components/ # Layout and navigation components
│
├── tests/ # Test suite (113 tests)
├── data/ # Reference data + SQLite DB
├── docs/ # Documentation
├── assets/ # Static assets (logo, favicon)
├── archive/ # Historical/deprecated
└── logs/ # Runtime logs
Path resolution: src/ is added to sys.path via a .pth file (created by setup_dev.py).
All imports use package names directly: from core import ..., from data_processing import ..., etc.
Pathway Data Architecture
The application uses a pre-computed pathway architecture for performance:
Architecture: Snowflake → Pathway Processing → SQLite (pre-computed) → Reflex (filter & view)
Key Benefits:
- Performance: Pathway calculation done once during data refresh, not on every filter change
- Simplicity: Reflex filters pre-computed data with simple SQL WHERE clauses
- Full Pathways: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics
Chart Types:
| Type | Hierarchy | Level 2 Source |
|---|---|---|
directory |
Trust → Directory → Drug → Pathway | Assigned directorate (5-level fallback) |
indication |
Trust → GP Diagnosis → Drug → Pathway | SNOMED cluster Search_Term from GP records |
For indication charts, ~93% of patients are matched to a GP diagnosis (Search_Term). Unmatched patients use their directorate as a fallback label (e.g., "RHEUMATOLOGY (no GP dx)").
Date Filter Combinations:
| ID | Initiated | Last Seen | Default |
|---|---|---|---|
all_6mo |
All years | Last 6 months | Yes |
all_12mo |
All years | Last 12 months | No |
1yr_6mo |
Last 1 year | Last 6 months | No |
1yr_12mo |
Last 1 year | Last 12 months | No |
2yr_6mo |
Last 2 years | Last 6 months | No |
2yr_12mo |
Last 2 years | Last 12 months | No |
Total pre-computed datasets: 6 date filters × 2 chart types = 12 datasets (~3,600 pathway nodes).
Pathway Node Structure:
Each node in pathway_nodes contains:
- Routing:
chart_type("directory" or "indication"),date_filter_id - Hierarchy:
parents,ids,labels,level(0=Root, 1=Trust, 2=Directory/Indication, 3=Drug, 4+=Pathway) - Counts:
value(patient count) - Costs:
cost,costpp,cost_pp_pa(per patient per annum) - Dates:
first_seen,last_seen,first_seen_parent,last_seen_parent - Statistics:
average_spacing,average_administered,avg_days - Denormalized:
trust_name,directory,drug_sequence(for efficient filtering) - Unique constraint:
UNIQUE(date_filter_id, chart_type, ids)
Core Module (core/)
- PathConfig - Dataclass encapsulating all file paths, with
validate()method - AnalysisFilters - Dataclass for filter state (dates, drugs, trusts, directories)
- logging_config - Structured logging with file and console output
CLI Module (cli/)
- refresh_pathways.py - Command-line tool to refresh pre-computed pathway data:
refresh_pathways()- Main function orchestrating the full pipelineinsert_pathway_records()- SQLite insertion with parameterized querieslog_refresh_start/complete/failed()- Refresh tracking inpathway_refresh_logget_default_filters()- Load trusts/drugs/directories from CSV files
Data Processing Module (data_processing/)
Database Management:
DatabaseManager- SQLite connection pooling and transaction management- Reference Tables:
ref_drug_names,ref_organizations,ref_directories,ref_drug_directory_map,ref_drug_indication_clusters - Pathway Tables:
pathway_date_filters,pathway_nodes,pathway_refresh_log
Pathway Pipeline (pathway_pipeline.py):
DateFilterConfig- Dataclass for date filter configurationDATE_FILTER_CONFIGS- All 6 pre-defined date combinationscompute_date_ranges(config, max_date)- Computes actual ISO dates from configfetch_and_transform_data()- Snowflake fetch + UPID/drug/directory transformations- Directory chart functions:
process_pathway_for_date_filter()- Processes single date filter usinggenerate_icicle_chart()extract_denormalized_fields()- Parsesidscolumn to extract trust, directory, drug_sequence
- Indication chart functions:
process_indication_pathway_for_date_filter()- Processes single date filter usinggenerate_icicle_chart_indication()extract_indication_fields()- Parsesidsfor indication charts (trust, search_term, drug_sequence)
- Shared functions:
convert_to_records(ice_df, chart_type)- Converts ice_df to list of dicts withchart_typecolumnprocess_all_date_filters()- Convenience function to process all 6 filters
Data Loaders:
FileDataLoader- Loads from CSV/Parquet files (used by legacy pipeline, not by Reflex app)- Factory function
get_loader()creates aFileDataLoader
Snowflake Integration:
- SSO authentication via
externalbrowserauthenticator fetch_activity_data(start_date, end_date, provider_codes)method- Query caching with TTL-based invalidation
GP Diagnosis Lookup (diagnosis_lookup.py):
CLUSTER_MAPPING_SQL- Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codesget_patient_indication_groups(patient_pseudonyms)- Batch queries Snowflake to match patients to GP diagnoses:- Embeds cluster mapping as CTE, joins with
PrimaryCareClinicalCoding - Uses
PseudoNHSNoLinked(not PersonKey) to matchPatientPseudonymin GP records - Returns most recent match per patient via
QUALIFY ROW_NUMBER() - Batches 500 patients per query, returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- Embeds cluster mapping as CTE, joins with
patient_has_indication(patient_pseudonym, cluster_ids)- Single-patient GP record check (legacy)validate_indication(patient_pseudonym, drug_name)- Full validation result with source tracking (legacy)
Analysis Module (analysis/)
Refactored from the original 267-line generate_graph() function:
- prepare_data() - Filter DataFrame by date range, trusts, drugs, directories (copies df to prevent mutation)
- calculate_statistics() - Compute frequency, cost, duration statistics
- build_hierarchy() - Create Trust → Directory → Drug → Pathway structure
- prepare_chart_data() - Format data for Plotly icicle chart
- generate_icicle_chart_indication(df, indication_df, ...) - Build indication-based hierarchy using Search_Term instead of Directory. Takes an
indication_df(UPID → Search_Term mapping) alongside the main activity DataFrame.
Visualization Module (visualization/)
- create_icicle_figure() - Generate Plotly icicle chart figure
- save_figure_html() - Save interactive HTML file
- open_figure_in_browser() - Open chart in default browser
Reflex Application (pathways_app/)
The AppState class manages all application state:
- Chart type:
selected_chart_type("directory" or "indication"), toggled viaset_chart_type() - Computed vars:
chart_hierarchy_label(dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."),chart_type_label - Filter variables: dates, drugs, trusts, directories
- Reference data: available options loaded from pathway_nodes and CSV files
- Analysis state: running flag, status messages, chart data
load_data()sources available drugs/directorates frompathway_nodesandtotal_recordsfrompathway_refresh_log.source_row_count
Chart Type Toggle (chart_type_toggle() component):
- Segmented control with "By Directory" and "By Indication" pill buttons
- Placed first in the filter strip before date filters
- Switching reloads pathway data from SQLite filtered by
chart_type - Note: Directory filter only applies to directory charts (indication charts store Search_Terms in the directory column)
Data Transformations (data_processing/transforms.py)
Core data transformation functions used by the pipeline:
patient_id()- Creates UPID = Provider Code (first 3 chars) + PersonKeydrug_names()- Standardizes via drugnames.csv lookupdepartment_identification()- 5-level fallback chain for directory assignment
Data Flow
Pre-Computed Pathway Architecture (Current):
[CLI: python -m cli.refresh_pathways --chart-type all]
Snowflake Data Warehouse
│
▼ (fetch_and_transform_data)
┌──────────────────────────────────────────┐
│ Data Transformations (data_processing/transforms.py) │
│ → patient_id() creates UPID │
│ → drug_names() standardizes names │
│ → department_identification() → Dir │
└──────────────────────────────────────────┘
│
├─── Directory Charts ──────────────────────────────────────┐
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ For each of 6 date filter combos: │ │
│ │ → generate_icicle_chart() │ │
│ │ → extract_denormalized_fields() │ │
│ │ → convert_to_records("directory") │ │
│ └──────────────────────────────────────────┘ │
│ │
├─── Indication Charts ─────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ GP Diagnosis Lookup (diagnosis_lookup.py)│ │
│ │ → Extract PseudoNHSNoLinked from HCD │ │
│ │ → get_patient_indication_groups() │ │
│ │ (SNOMED cluster CTE + GP records) │ │
│ │ → Build indication_df: UPID → Search │ │
│ │ Term (matched) or Directorate (no GP)│ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ For each of 6 date filter combos: │ │
│ │ → generate_icicle_chart_indication() │ │
│ │ → extract_indication_fields() │ │
│ │ → convert_to_records("indication") │ │
│ └──────────────────────────────────────────┘ │
│ │
└───────────────────────┬───────────────────────────────────┘
│
▼ (insert_pathway_records)
┌──────────────────────────────────────────┐
│ SQLite: pathway_nodes table │
│ → ~3,600 nodes across 12 datasets │
│ → UNIQUE(date_filter_id, chart_type, │
│ ids) prevents cross-type overwrites │
│ → Indexed for fast filtering │
└──────────────────────────────────────────┘
[Reflex App: reflex run]
┌──────────────────────────────────────────┐
│ Chart Type Toggle (segmented control) │
│ → "By Directory" | "By Indication" │
│ → Triggers set_chart_type() handler │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ AppState.load_pathway_data() │
│ → Query pathway_nodes WHERE │
│ date_filter AND chart_type │
│ → Apply drug/directory filters │
│ → recalculate_parent_totals() │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ AppState.icicle_figure │
│ → Plotly icicle chart │
│ → 10-field customdata structure │
│ → Full hover/text templates │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Reflex UI (rx.plotly component) │
│ → <50ms filter response time │
│ → Treatment statistics in tooltips │
│ → Dynamic hierarchy label updates │
└──────────────────────────────────────────┘
Reference Data Files (data/)
| File | Purpose |
|---|---|
include.csv |
Drug filter list with default selections (Include=1) |
defaultTrusts.csv |
NHS Trust list for filter |
directory_list.csv |
Medical specialties/directories |
drugnames.csv |
Drug name standardization mapping |
org_codes.csv |
Provider code to organization name mapping |
drug_directory_list.csv |
Valid drug-to-directory mappings (pipe-separated) |
treatment_function_codes.csv |
NHS treatment function code mappings |
drug_indication_clusters.csv |
Drug to SNOMED cluster mappings |
ta-recommendations.xlsx |
NICE TA recommendations |
pathways.db |
SQLite database (~3.5 MB: reference tables + pathway nodes) |
Key Patterns
Department Identification Fallback Chain:
The department_identification() function has 5 levels of fallback:
- SINGLE_VALID_DIR - Drug has only one valid directory
- EXTRACTED - Extracted from Additional Detail/Description fields
- CALCULATED_MOST_FREQ - Most frequent valid directory for UPID/Drug
- UPID_INFERENCE - Inferred from other records with same UPID
- UNDEFINED - No directory could be determined
Indication Lookup Workflow (for indication charts):
- Extract unique
PseudoNHSNoLinkedvalues from HCD activity data - Query Snowflake in batches of 500 patients:
- Embed
CLUSTER_MAPPING_SQL(~148 Search_Term → Cluster_ID mappings) as CTE - Join
ClinicalCodingClusterSnomedCodesto get SNOMED codes per cluster - Join
PrimaryCareClinicalCodingonPatientPseudonym=PseudoNHSNoLinked - Use
QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1for most recent match
- Embed
- Build
indication_dfmapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched) - Pass to
generate_icicle_chart_indication()for pathway hierarchy building
Data Source Fallback Chain (for raw data loading, not used by Reflex app):
- Query cache for recent results
- Attempt Snowflake connection
- Fall back to CSV/Parquet files
Database Schema (~3.5 MB)
Reference Tables
ref_drug_names- Drug name standardizationref_organizations- Provider code to name mappingref_directories- Valid directory namesref_drug_directory_map- Valid drug-directory pairsref_drug_indication_clusters- Drug to SNOMED cluster mapping
Pathway Tables
pathway_date_filters- 6 pre-defined date filter combinations- Columns:
id,initiated,last_seen,is_default,description - Auto-populated via migration
- Columns:
pathway_nodes- Pre-computed pathway hierarchy nodes (~3,600 rows for 12 datasets)- Routing:
chart_type("directory" or "indication"),date_filter_id - Hierarchy:
parents,ids,labels,level - Metrics:
value,cost,costpp,cost_pp_pa,colour - Dates:
first_seen,last_seen,first_seen_parent,last_seen_parent - Statistics:
average_spacing,average_administered,avg_days - Denormalized:
trust_name,directory,drug_sequence - Foreign key:
date_filter_id→pathway_date_filters.id - Unique constraint:
UNIQUE(date_filter_id, chart_type, ids)— critical for INSERT OR REPLACE correctness - Indexed for: date_filter_id, chart_type, trust_name, directory, level
- Routing:
pathway_refresh_log- Tracks data refresh status- Columns:
refresh_id,started_at,completed_at,status,records_processed,error_message,source_row_count
- Columns:
Input Data Requirements
The input data (CSV/Parquet) must contain columns including:
Provider Code,PersonKey- Used to create UPIDPseudoNHSNoLinked- NHS pseudonym for GP record matching (indication charts)Drug Name,Intervention Date,Price ActualOrganisationName- Various
Additional Detail/Descriptioncolumns for directory extraction Treatment Function Code
Output
Interactive Plotly icicle chart with toggleable views:
- Directory view: Trust → Directorate → Drug → Patient Pathway
- Indication view: Trust → GP Diagnosis (Search_Term) → Drug → Patient Pathway
- Patient counts and percentages at each hierarchy level
- Total and average costs
- Treatment duration and dosing frequency information
- Color gradient based on patient volume
Testing
# Run all tests with coverage
python -m pytest tests/ -v --cov=core --cov=analysis
# Run specific test file
python -m pytest tests/test_config.py -v
# Run specific test class
python -m pytest tests/test_data_transformations.py::TestPatientId -v
Test coverage includes:
- PathConfig validation (23 tests)
- AnalysisFilters validation (26 tests)
- Data transformation functions (23 tests)
- Directory assignment logic (19 tests)
Configuration
Snowflake Connection (src/config/snowflake.toml)
[snowflake]
account = "your-account"
database = "DATA_HUB"
schema = "CDM"
warehouse = "your-warehouse"
authenticator = "externalbrowser" # Required for NHS SSO
Logging
Logs are written to logs/ directory with structured format.
Configure via src/core/logging_config.py.
Breaking Changes from Original App
The pre-computed pathway architecture introduces these changes:
Date Filters
- Old: Date pickers for arbitrary
start_dateandend_date - New: Two dropdowns:
- "Treatment Initiated": All years, Last 2 years, Last 1 year
- "Last Seen": Last 6 months, Last 12 months
- Reason: Pre-computed pathways require fixed date combinations for performance
Data Refresh
- Old: Real-time pathway calculation on each filter change
- New: Pre-computed pathways stored in SQLite, refreshed via CLI command
- Impact: Data is as fresh as the last
python -m cli.refresh_pathwaysrun - Benefit: Sub-50ms filter response time vs multi-minute calculations
State Variables
- Removed:
start_date,end_date,set_start_date(),set_end_date() - Added:
selected_initiated,selected_last_seen,date_filter_id - Added:
selected_chart_type("directory" or "indication"),chart_type_options - Added:
set_chart_type()- switches chart type and reloads data - Added:
chart_hierarchy_label,chart_type_label- computed vars for dynamic UI text - Added:
load_pathway_data()- queries pre-computedpathway_nodesfiltered bydate_filter_idANDchart_type - Added:
recalculate_parent_totals()- adjusts hierarchy after filtering
Chart Type Toggle
- New: Segmented control ("By Directory" | "By Indication") in filter strip
- Added:
selected_chart_typestate variable,set_chart_type()handler - Added: Dynamic hierarchy label ("Trust → Directorate → ..." or "Trust → Indication → ...")
- Note: Directory filter only applies to directory charts; for indication charts the
directorycolumn stores Search_Terms
Icicle Chart
- Enhanced: Now includes full 10-field customdata structure
- Added: Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips
- Added: First/last seen dates for drug nodes
- Added: Indication chart uses
generate_icicle_chart_indication()with Search_Term hierarchy
Development
Adding New Analysis Features
- Add statistical functions to
src/analysis/statistics.py - Integrate into pipeline in
src/analysis/pathway_analyzer.py - Update visualization in
src/visualization/plotly_generator.py
Adding New Reference Data
- Add CSV file to
data/directory - Define schema in
src/data_processing/schema.py - Create migration function in
src/data_processing/reference_data.py - Add path to
PathConfiginsrc/core/config.py