docs: update CLAUDE.md with indication chart architecture and CLI docs (Task 5.2)

This commit is contained in:
Andrew Charlwood
2026-02-05 20:49:49 +00:00
parent 4884e0a8cc
commit fed909481e
3 changed files with 185 additions and 44 deletions
+141 -40
View File
@@ -4,14 +4,15 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Project Overview
NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns (Trust → Directory/Specialty → Drug → Patient pathway) as interactive Plotly icicle charts.
NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns as interactive Plotly icicle charts.
**Key Features:**
- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
- **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle
- **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering
- GP diagnosis integration for indication validation via SNOMED clusters
- **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate)
- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
- Interactive browser-based UI using Reflex framework
- 6 pre-defined date filter combinations with sub-50ms response times
- 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times
## Running the Application
@@ -37,11 +38,17 @@ The application requires Python 3.10+ and runs on http://localhost:3000 by defau
**Refresh Pathway Data:**
```bash
# Full refresh with default filters (all trusts, default drugs)
python -m cli.refresh_pathways
# Full refresh — both chart types (directory + indication), all date filters
python -m cli.refresh_pathways --chart-type all
# Directory charts only (faster, skips GP diagnosis lookup)
python -m cli.refresh_pathways --chart-type directory
# Indication charts only
python -m cli.refresh_pathways --chart-type indication
# Dry run (test without database changes)
python -m cli.refresh_pathways --dry-run -v
python -m cli.refresh_pathways --chart-type all --dry-run -v
# Custom minimum patient threshold
python -m cli.refresh_pathways --minimum-patients 10
@@ -50,11 +57,17 @@ python -m cli.refresh_pathways --minimum-patients 10
python -m cli.refresh_pathways --help
```
The `--chart-type` argument controls which pathway types are processed:
- `all` (default) — generates both directory and indication charts (~15 minutes)
- `directory` — directory-based charts only (~5 minutes)
- `indication` — indication-based charts only (~12 minutes, includes GP lookup)
The refresh command:
1. Fetches activity data from Snowflake (656K+ records, ~7 seconds)
2. Applies UPID, drug name, and directory transformations (~6 minutes)
3. Processes 6 date filter combinations (all_6mo, all_12mo, 1yr_6mo, etc.)
4. Inserts pathway nodes to SQLite for fast Reflex filtering
3. For indication charts: queries GP records via SNOMED clusters (~9 minutes for 37K patients)
4. Processes 6 date filter combinations × selected chart types
5. Inserts pathway nodes to SQLite for fast Reflex filtering
## Architecture
@@ -121,6 +134,15 @@ The application uses a pre-computed pathway architecture for performance:
- **Simplicity**: Reflex filters pre-computed data with simple SQL WHERE clauses
- **Full Pathways**: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics
**Chart Types:**
| Type | Hierarchy | Level 2 Source |
|------|-----------|----------------|
| `directory` | Trust → Directory → Drug → Pathway | Assigned directorate (5-level fallback) |
| `indication` | Trust → GP Diagnosis → Drug → Pathway | SNOMED cluster Search_Term from GP records |
For indication charts, ~93% of patients are matched to a GP diagnosis (Search_Term). Unmatched patients use their directorate as a fallback label (e.g., "RHEUMATOLOGY (no GP dx)").
**Date Filter Combinations:**
| ID | Initiated | Last Seen | Default |
|----|-----------|-----------|---------|
@@ -131,14 +153,18 @@ The application uses a pre-computed pathway architecture for performance:
| `2yr_6mo` | Last 2 years | Last 6 months | No |
| `2yr_12mo` | Last 2 years | Last 12 months | No |
Total pre-computed datasets: 6 date filters × 2 chart types = 12 datasets (~3,600 pathway nodes).
**Pathway Node Structure:**
Each node in `pathway_nodes` contains:
- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory, 3=Drug, 4+=Pathway)
- Routing: `chart_type` ("directory" or "indication"), `date_filter_id`
- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory/Indication, 3=Drug, 4+=Pathway)
- Counts: `value` (patient count)
- Costs: `cost`, `costpp`, `cost_pp_pa` (per patient per annum)
- Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent`
- Statistics: `average_spacing`, `average_administered`, `avg_days`
- Denormalized: `trust_name`, `directory`, `drug_sequence` (for efficient filtering)
- Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)`
### Core Module (`core/`)
@@ -166,10 +192,15 @@ Each node in `pathway_nodes` contains:
- `DATE_FILTER_CONFIGS` - All 6 pre-defined date combinations
- `compute_date_ranges(config, max_date)` - Computes actual ISO dates from config
- `fetch_and_transform_data()` - Snowflake fetch + UPID/drug/directory transformations
- `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()`
- `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence
- `convert_to_records()` - Converts ice_df to list of dicts for SQLite insertion
- `process_all_date_filters()` - Convenience function to process all 6 filters
- Directory chart functions:
- `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()`
- `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence
- Indication chart functions:
- `process_indication_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart_indication()`
- `extract_indication_fields()` - Parses `ids` for indication charts (trust, search_term, drug_sequence)
- Shared functions:
- `convert_to_records(ice_df, chart_type)` - Converts ice_df to list of dicts with `chart_type` column
- `process_all_date_filters()` - Convenience function to process all 6 filters
**Data Loaders:**
- `FileDataLoader` - Loads from CSV/Parquet files
@@ -182,20 +213,25 @@ Each node in `pathway_nodes` contains:
- Query caching with TTL-based invalidation
- Fallback chain: cache → Snowflake → local files
**GP Diagnosis Validation:**
- Uses pre-built SNOMED clusters from `ClinicalCodingClusterSnomedCodes`
- `patient_has_indication(patient_pseudonym, cluster_ids)` checks GP records
- `validate_indication(patient_pseudonym, drug_name)` returns full validation result
- Adds `Indication_Source` column: "GP_SNOMED" | "HCD_SNOMED" | "NONE"
**GP Diagnosis Lookup (`diagnosis_lookup.py`):**
- `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes
- `get_patient_indication_groups(patient_pseudonyms)` - Batch queries Snowflake to match patients to GP diagnoses:
- Embeds cluster mapping as CTE, joins with `PrimaryCareClinicalCoding`
- Uses `PseudoNHSNoLinked` (not PersonKey) to match `PatientPseudonym` in GP records
- Returns most recent match per patient via `QUALIFY ROW_NUMBER()`
- Batches 500 patients per query, returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- `patient_has_indication(patient_pseudonym, cluster_ids)` - Single-patient GP record check (legacy)
- `validate_indication(patient_pseudonym, drug_name)` - Full validation result with source tracking (legacy)
### Analysis Module (`analysis/`)
Refactored from the original 267-line `generate_graph()` function:
- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories
- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories (copies df to prevent mutation)
- **calculate_statistics()** - Compute frequency, cost, duration statistics
- **build_hierarchy()** - Create Trust → Directory → Drug → Pathway structure
- **prepare_chart_data()** - Format data for Plotly icicle chart
- **generate_icicle_chart_indication(df, indication_df, ...)** - Build indication-based hierarchy using Search_Term instead of Directory. Takes an `indication_df` (UPID → Search_Term mapping) alongside the main activity DataFrame.
### Visualization Module (`visualization/`)
@@ -205,12 +241,20 @@ Refactored from the original 267-line `generate_graph()` function:
### Reflex Application (`pathways_app/`)
The `State` class manages all application state:
The `AppState` class manages all application state:
- **Chart type**: `selected_chart_type` ("directory" or "indication"), toggled via `set_chart_type()`
- **Computed vars**: `chart_hierarchy_label` (dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."), `chart_type_label`
- Filter variables: dates, drugs, trusts, directories
- Reference data: available options loaded from CSV/SQLite
- Analysis state: running flag, status messages, chart data
- Data source state: file path, source type, row counts
**Chart Type Toggle** (`chart_type_toggle()` component):
- Segmented control with "By Directory" and "By Indication" pill buttons
- Placed first in the filter strip before date filters
- Switching reloads pathway data from SQLite filtered by `chart_type`
- Note: Directory filter only applies to directory charts (indication charts store Search_Terms in the directory column)
### Legacy Modules (`tools/`)
Still used during transition:
@@ -227,7 +271,7 @@ Still used during transition:
**Pre-Computed Pathway Architecture (Current):**
```
[CLI: python -m cli.refresh_pathways]
[CLI: python -m cli.refresh_pathways --chart-type all]
Snowflake Data Warehouse
@@ -239,19 +283,42 @@ Still used during transition:
│ → department_identification() → Dir │
└──────────────────────────────────────────┘
▼ (process_all_date_filters)
┌──────────────────────────────────────────┐
│ Pathway Pipeline (pathway_pipeline.py)
For each of 6 date filter combos: │
│ → generate_icicle_chart() │
│ → extract_denormalized_fields() │
│ → convert_to_records() │
└──────────────────────────────────────────┘
▼ (insert_pathway_records)
├─── Directory Charts ──────────────────────────────────────┐
│ │
│ ┌──────────────────────────────────────────┐
For each of 6 date filter combos:
→ generate_icicle_chart()
→ extract_denormalized_fields()
→ convert_to_records("directory")
└──────────────────────────────────────────┘
├─── Indication Charts ─────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ GP Diagnosis Lookup (diagnosis_lookup.py)│ │
│ │ → Extract PseudoNHSNoLinked from HCD │ │
│ │ → get_patient_indication_groups() │ │
│ │ (SNOMED cluster CTE + GP records) │ │
│ │ → Build indication_df: UPID → Search │ │
│ │ Term (matched) or Directorate (no GP)│ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ For each of 6 date filter combos: │ │
│ │ → generate_icicle_chart_indication() │ │
│ │ → extract_indication_fields() │ │
│ │ → convert_to_records("indication") │ │
│ └──────────────────────────────────────────┘ │
│ │
└───────────────────────┬───────────────────────────────────┘
▼ (insert_pathway_records)
┌──────────────────────────────────────────┐
│ SQLite: pathway_nodes table │
│ → 293 nodes for all_6mo filter
│ → ~3,600 nodes across 12 datasets
│ → UNIQUE(date_filter_id, chart_type, │
│ ids) prevents cross-type overwrites │
│ → Indexed for fast filtering │
└──────────────────────────────────────────┘
@@ -259,8 +326,16 @@ Still used during transition:
[Reflex App: reflex run]
┌──────────────────────────────────────────┐
│ Chart Type Toggle (segmented control) │
│ → "By Directory" | "By Indication" │
│ → Triggers set_chart_type() handler │
└──────────────────────────────────────────┘
┌──────────────────────────────────────────┐
│ AppState.load_pathway_data() │
│ → Query pathway_nodes WHERE date_filter
│ → Query pathway_nodes WHERE
│ date_filter AND chart_type │
│ → Apply drug/directory filters │
│ → recalculate_parent_totals() │
└──────────────────────────────────────────┘
@@ -278,6 +353,7 @@ Still used during transition:
│ Reflex UI (rx.plotly component) │
│ → <50ms filter response time │
│ → Treatment statistics in tooltips │
│ → Dynamic hierarchy label updates │
└──────────────────────────────────────────┘
```
@@ -339,7 +415,17 @@ The `department_identification()` function has 5 levels of fallback:
4. **UPID_INFERENCE** - Inferred from other records with same UPID
5. **UNDEFINED** - No directory could be determined
**Indication Validation Workflow:**
**Indication Lookup Workflow (for indication charts):**
1. Extract unique `PseudoNHSNoLinked` values from HCD activity data
2. Query Snowflake in batches of 500 patients:
- Embed `CLUSTER_MAPPING_SQL` (~148 Search_Term → Cluster_ID mappings) as CTE
- Join `ClinicalCodingClusterSnomedCodes` to get SNOMED codes per cluster
- Join `PrimaryCareClinicalCoding` on `PatientPseudonym` = `PseudoNHSNoLinked`
- Use `QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1` for most recent match
3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched)
4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building
**Indication Validation Workflow (legacy, per-patient):**
1. Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD)
2. Get all SNOMED codes for those clusters
3. Check GP records (PrimaryCareClinicalCoding) for matching codes
@@ -369,18 +455,20 @@ The `department_identification()` function has 5 levels of fallback:
### File Tracking
- `processed_files` - Hash-based tracking for incremental loading
### Pathway Tables (New)
### Pathway Tables
- `pathway_date_filters` - 6 pre-defined date filter combinations
- Columns: `id`, `initiated`, `last_seen`, `is_default`, `description`
- Auto-populated via migration
- `pathway_nodes` - Pre-computed pathway hierarchy nodes
- `pathway_nodes` - Pre-computed pathway hierarchy nodes (~3,600 rows for 12 datasets)
- Routing: `chart_type` ("directory" or "indication"), `date_filter_id`
- Hierarchy: `parents`, `ids`, `labels`, `level`
- Metrics: `value`, `cost`, `costpp`, `cost_pp_pa`, `colour`
- Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent`
- Statistics: `average_spacing`, `average_administered`, `avg_days`
- Denormalized: `trust_name`, `directory`, `drug_sequence`
- Foreign key: `date_filter_id``pathway_date_filters.id`
- Indexed for: date_filter_id, trust_name, directory, level
- Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness
- Indexed for: date_filter_id, chart_type, trust_name, directory, level
- `pathway_refresh_log` - Tracks data refresh status
- Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`
@@ -388,6 +476,7 @@ The `department_identification()` function has 5 levels of fallback:
The input data (CSV/Parquet) must contain columns including:
- `Provider Code`, `PersonKey` - Used to create UPID
- `PseudoNHSNoLinked` - NHS pseudonym for GP record matching (indication charts)
- `Drug Name`, `Intervention Date`, `Price Actual`
- `OrganisationName`
- Various `Additional Detail/Description` columns for directory extraction
@@ -395,7 +484,9 @@ The input data (CSV/Parquet) must contain columns including:
## Output
Interactive Plotly icicle chart showing:
Interactive Plotly icicle chart with toggleable views:
- **Directory view**: Trust → Directorate → Drug → Patient Pathway
- **Indication view**: Trust → GP Diagnosis (Search_Term) → Drug → Patient Pathway
- Patient counts and percentages at each hierarchy level
- Total and average costs
- Treatment duration and dosing frequency information
@@ -458,13 +549,23 @@ The pre-computed pathway architecture introduces these changes:
### State Variables
- **Removed**: `start_date`, `end_date`, `set_start_date()`, `set_end_date()`
- **Added**: `selected_initiated`, `selected_last_seen`, `date_filter_id`
- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes`
- **Added**: `selected_chart_type` ("directory" or "indication"), `chart_type_options`
- **Added**: `set_chart_type()` - switches chart type and reloads data
- **Added**: `chart_hierarchy_label`, `chart_type_label` - computed vars for dynamic UI text
- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes` filtered by `date_filter_id` AND `chart_type`
- **Added**: `recalculate_parent_totals()` - adjusts hierarchy after filtering
### Chart Type Toggle
- **New**: Segmented control ("By Directory" | "By Indication") in filter strip
- **Added**: `selected_chart_type` state variable, `set_chart_type()` handler
- **Added**: Dynamic hierarchy label ("Trust → Directorate → ..." or "Trust → Indication → ...")
- **Note**: Directory filter only applies to directory charts; for indication charts the `directory` column stores Search_Terms
### Icicle Chart
- **Enhanced**: Now includes full 10-field customdata structure
- **Added**: Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips
- **Added**: First/last seen dates for drug nodes
- **Added**: Indication chart uses `generate_icicle_chart_indication()` with Search_Term hierarchy
## Development