From fed909481e4f6681d18895db5db33ee4200f4baa Mon Sep 17 00:00:00 2001 From: Andrew Charlwood Date: Thu, 5 Feb 2026 20:49:49 +0000 Subject: [PATCH] docs: update CLAUDE.md with indication chart architecture and CLI docs (Task 5.2) --- CLAUDE.md | 181 ++++++++++++++++++++++++++++++++--------- IMPLEMENTATION_PLAN.md | 8 +- progress.txt | 40 +++++++++ 3 files changed, 185 insertions(+), 44 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index c991afb..b0daee8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,14 +4,15 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project Overview -NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns (Trust → Directory/Specialty → Drug → Patient pathway) as interactive Plotly icicle charts. +NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns as interactive Plotly icicle charts. **Key Features:** -- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse +- **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle - **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering -- GP diagnosis integration for indication validation via SNOMED clusters +- **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate) +- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse - Interactive browser-based UI using Reflex framework -- 6 pre-defined date filter combinations with sub-50ms response times +- 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times ## Running the Application @@ -37,11 +38,17 @@ The application requires Python 3.10+ and runs on http://localhost:3000 by defau **Refresh Pathway Data:** ```bash -# Full refresh with default filters (all trusts, default drugs) -python -m cli.refresh_pathways +# Full refresh — both chart types (directory + indication), all date filters +python -m cli.refresh_pathways --chart-type all + +# Directory charts only (faster, skips GP diagnosis lookup) +python -m cli.refresh_pathways --chart-type directory + +# Indication charts only +python -m cli.refresh_pathways --chart-type indication # Dry run (test without database changes) -python -m cli.refresh_pathways --dry-run -v +python -m cli.refresh_pathways --chart-type all --dry-run -v # Custom minimum patient threshold python -m cli.refresh_pathways --minimum-patients 10 @@ -50,11 +57,17 @@ python -m cli.refresh_pathways --minimum-patients 10 python -m cli.refresh_pathways --help ``` +The `--chart-type` argument controls which pathway types are processed: +- `all` (default) — generates both directory and indication charts (~15 minutes) +- `directory` — directory-based charts only (~5 minutes) +- `indication` — indication-based charts only (~12 minutes, includes GP lookup) + The refresh command: 1. Fetches activity data from Snowflake (656K+ records, ~7 seconds) 2. Applies UPID, drug name, and directory transformations (~6 minutes) -3. Processes 6 date filter combinations (all_6mo, all_12mo, 1yr_6mo, etc.) -4. Inserts pathway nodes to SQLite for fast Reflex filtering +3. For indication charts: queries GP records via SNOMED clusters (~9 minutes for 37K patients) +4. Processes 6 date filter combinations × selected chart types +5. Inserts pathway nodes to SQLite for fast Reflex filtering ## Architecture @@ -121,6 +134,15 @@ The application uses a pre-computed pathway architecture for performance: - **Simplicity**: Reflex filters pre-computed data with simple SQL WHERE clauses - **Full Pathways**: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics +**Chart Types:** + +| Type | Hierarchy | Level 2 Source | +|------|-----------|----------------| +| `directory` | Trust → Directory → Drug → Pathway | Assigned directorate (5-level fallback) | +| `indication` | Trust → GP Diagnosis → Drug → Pathway | SNOMED cluster Search_Term from GP records | + +For indication charts, ~93% of patients are matched to a GP diagnosis (Search_Term). Unmatched patients use their directorate as a fallback label (e.g., "RHEUMATOLOGY (no GP dx)"). + **Date Filter Combinations:** | ID | Initiated | Last Seen | Default | |----|-----------|-----------|---------| @@ -131,14 +153,18 @@ The application uses a pre-computed pathway architecture for performance: | `2yr_6mo` | Last 2 years | Last 6 months | No | | `2yr_12mo` | Last 2 years | Last 12 months | No | +Total pre-computed datasets: 6 date filters × 2 chart types = 12 datasets (~3,600 pathway nodes). + **Pathway Node Structure:** Each node in `pathway_nodes` contains: -- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory, 3=Drug, 4+=Pathway) +- Routing: `chart_type` ("directory" or "indication"), `date_filter_id` +- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory/Indication, 3=Drug, 4+=Pathway) - Counts: `value` (patient count) - Costs: `cost`, `costpp`, `cost_pp_pa` (per patient per annum) - Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent` - Statistics: `average_spacing`, `average_administered`, `avg_days` - Denormalized: `trust_name`, `directory`, `drug_sequence` (for efficient filtering) +- Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` ### Core Module (`core/`) @@ -166,10 +192,15 @@ Each node in `pathway_nodes` contains: - `DATE_FILTER_CONFIGS` - All 6 pre-defined date combinations - `compute_date_ranges(config, max_date)` - Computes actual ISO dates from config - `fetch_and_transform_data()` - Snowflake fetch + UPID/drug/directory transformations -- `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()` -- `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence -- `convert_to_records()` - Converts ice_df to list of dicts for SQLite insertion -- `process_all_date_filters()` - Convenience function to process all 6 filters +- Directory chart functions: + - `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()` + - `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence +- Indication chart functions: + - `process_indication_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart_indication()` + - `extract_indication_fields()` - Parses `ids` for indication charts (trust, search_term, drug_sequence) +- Shared functions: + - `convert_to_records(ice_df, chart_type)` - Converts ice_df to list of dicts with `chart_type` column + - `process_all_date_filters()` - Convenience function to process all 6 filters **Data Loaders:** - `FileDataLoader` - Loads from CSV/Parquet files @@ -182,20 +213,25 @@ Each node in `pathway_nodes` contains: - Query caching with TTL-based invalidation - Fallback chain: cache → Snowflake → local files -**GP Diagnosis Validation:** -- Uses pre-built SNOMED clusters from `ClinicalCodingClusterSnomedCodes` -- `patient_has_indication(patient_pseudonym, cluster_ids)` checks GP records -- `validate_indication(patient_pseudonym, drug_name)` returns full validation result -- Adds `Indication_Source` column: "GP_SNOMED" | "HCD_SNOMED" | "NONE" +**GP Diagnosis Lookup (`diagnosis_lookup.py`):** +- `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes +- `get_patient_indication_groups(patient_pseudonyms)` - Batch queries Snowflake to match patients to GP diagnoses: + - Embeds cluster mapping as CTE, joins with `PrimaryCareClinicalCoding` + - Uses `PseudoNHSNoLinked` (not PersonKey) to match `PatientPseudonym` in GP records + - Returns most recent match per patient via `QUALIFY ROW_NUMBER()` + - Batches 500 patients per query, returns DataFrame with PatientPseudonym, Search_Term, EventDateTime +- `patient_has_indication(patient_pseudonym, cluster_ids)` - Single-patient GP record check (legacy) +- `validate_indication(patient_pseudonym, drug_name)` - Full validation result with source tracking (legacy) ### Analysis Module (`analysis/`) Refactored from the original 267-line `generate_graph()` function: -- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories +- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories (copies df to prevent mutation) - **calculate_statistics()** - Compute frequency, cost, duration statistics - **build_hierarchy()** - Create Trust → Directory → Drug → Pathway structure - **prepare_chart_data()** - Format data for Plotly icicle chart +- **generate_icicle_chart_indication(df, indication_df, ...)** - Build indication-based hierarchy using Search_Term instead of Directory. Takes an `indication_df` (UPID → Search_Term mapping) alongside the main activity DataFrame. ### Visualization Module (`visualization/`) @@ -205,12 +241,20 @@ Refactored from the original 267-line `generate_graph()` function: ### Reflex Application (`pathways_app/`) -The `State` class manages all application state: +The `AppState` class manages all application state: +- **Chart type**: `selected_chart_type` ("directory" or "indication"), toggled via `set_chart_type()` +- **Computed vars**: `chart_hierarchy_label` (dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."), `chart_type_label` - Filter variables: dates, drugs, trusts, directories - Reference data: available options loaded from CSV/SQLite - Analysis state: running flag, status messages, chart data - Data source state: file path, source type, row counts +**Chart Type Toggle** (`chart_type_toggle()` component): +- Segmented control with "By Directory" and "By Indication" pill buttons +- Placed first in the filter strip before date filters +- Switching reloads pathway data from SQLite filtered by `chart_type` +- Note: Directory filter only applies to directory charts (indication charts store Search_Terms in the directory column) + ### Legacy Modules (`tools/`) Still used during transition: @@ -227,7 +271,7 @@ Still used during transition: **Pre-Computed Pathway Architecture (Current):** ``` -[CLI: python -m cli.refresh_pathways] +[CLI: python -m cli.refresh_pathways --chart-type all] Snowflake Data Warehouse │ @@ -239,19 +283,42 @@ Still used during transition: │ → department_identification() → Dir │ └──────────────────────────────────────────┘ │ - ▼ (process_all_date_filters) - ┌──────────────────────────────────────────┐ - │ Pathway Pipeline (pathway_pipeline.py) │ - │ → For each of 6 date filter combos: │ - │ → generate_icicle_chart() │ - │ → extract_denormalized_fields() │ - │ → convert_to_records() │ - └──────────────────────────────────────────┘ - │ - ▼ (insert_pathway_records) + ├─── Directory Charts ──────────────────────────────────────┐ + │ │ + │ ┌──────────────────────────────────────────┐ │ + │ │ For each of 6 date filter combos: │ │ + │ │ → generate_icicle_chart() │ │ + │ │ → extract_denormalized_fields() │ │ + │ │ → convert_to_records("directory") │ │ + │ └──────────────────────────────────────────┘ │ + │ │ + ├─── Indication Charts ─────────────────────────────────────┤ + │ │ + │ ┌──────────────────────────────────────────┐ │ + │ │ GP Diagnosis Lookup (diagnosis_lookup.py)│ │ + │ │ → Extract PseudoNHSNoLinked from HCD │ │ + │ │ → get_patient_indication_groups() │ │ + │ │ (SNOMED cluster CTE + GP records) │ │ + │ │ → Build indication_df: UPID → Search │ │ + │ │ Term (matched) or Directorate (no GP)│ │ + │ └──────────────────────────────────────────┘ │ + │ │ │ + │ ▼ │ + │ ┌──────────────────────────────────────────┐ │ + │ │ For each of 6 date filter combos: │ │ + │ │ → generate_icicle_chart_indication() │ │ + │ │ → extract_indication_fields() │ │ + │ │ → convert_to_records("indication") │ │ + │ └──────────────────────────────────────────┘ │ + │ │ + └───────────────────────┬───────────────────────────────────┘ + │ + ▼ (insert_pathway_records) ┌──────────────────────────────────────────┐ │ SQLite: pathway_nodes table │ - │ → 293 nodes for all_6mo filter │ + │ → ~3,600 nodes across 12 datasets │ + │ → UNIQUE(date_filter_id, chart_type, │ + │ ids) prevents cross-type overwrites │ │ → Indexed for fast filtering │ └──────────────────────────────────────────┘ @@ -259,8 +326,16 @@ Still used during transition: [Reflex App: reflex run] ┌──────────────────────────────────────────┐ + │ Chart Type Toggle (segmented control) │ + │ → "By Directory" | "By Indication" │ + │ → Triggers set_chart_type() handler │ + └──────────────────────────────────────────┘ + │ + ▼ + ┌──────────────────────────────────────────┐ │ AppState.load_pathway_data() │ - │ → Query pathway_nodes WHERE date_filter│ + │ → Query pathway_nodes WHERE │ + │ date_filter AND chart_type │ │ → Apply drug/directory filters │ │ → recalculate_parent_totals() │ └──────────────────────────────────────────┘ @@ -278,6 +353,7 @@ Still used during transition: │ Reflex UI (rx.plotly component) │ │ → <50ms filter response time │ │ → Treatment statistics in tooltips │ + │ → Dynamic hierarchy label updates │ └──────────────────────────────────────────┘ ``` @@ -339,7 +415,17 @@ The `department_identification()` function has 5 levels of fallback: 4. **UPID_INFERENCE** - Inferred from other records with same UPID 5. **UNDEFINED** - No directory could be determined -**Indication Validation Workflow:** +**Indication Lookup Workflow (for indication charts):** +1. Extract unique `PseudoNHSNoLinked` values from HCD activity data +2. Query Snowflake in batches of 500 patients: + - Embed `CLUSTER_MAPPING_SQL` (~148 Search_Term → Cluster_ID mappings) as CTE + - Join `ClinicalCodingClusterSnomedCodes` to get SNOMED codes per cluster + - Join `PrimaryCareClinicalCoding` on `PatientPseudonym` = `PseudoNHSNoLinked` + - Use `QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1` for most recent match +3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched) +4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building + +**Indication Validation Workflow (legacy, per-patient):** 1. Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD) 2. Get all SNOMED codes for those clusters 3. Check GP records (PrimaryCareClinicalCoding) for matching codes @@ -369,18 +455,20 @@ The `department_identification()` function has 5 levels of fallback: ### File Tracking - `processed_files` - Hash-based tracking for incremental loading -### Pathway Tables (New) +### Pathway Tables - `pathway_date_filters` - 6 pre-defined date filter combinations - Columns: `id`, `initiated`, `last_seen`, `is_default`, `description` - Auto-populated via migration -- `pathway_nodes` - Pre-computed pathway hierarchy nodes +- `pathway_nodes` - Pre-computed pathway hierarchy nodes (~3,600 rows for 12 datasets) + - Routing: `chart_type` ("directory" or "indication"), `date_filter_id` - Hierarchy: `parents`, `ids`, `labels`, `level` - Metrics: `value`, `cost`, `costpp`, `cost_pp_pa`, `colour` - Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent` - Statistics: `average_spacing`, `average_administered`, `avg_days` - Denormalized: `trust_name`, `directory`, `drug_sequence` - Foreign key: `date_filter_id` → `pathway_date_filters.id` - - Indexed for: date_filter_id, trust_name, directory, level + - Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness + - Indexed for: date_filter_id, chart_type, trust_name, directory, level - `pathway_refresh_log` - Tracks data refresh status - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message` @@ -388,6 +476,7 @@ The `department_identification()` function has 5 levels of fallback: The input data (CSV/Parquet) must contain columns including: - `Provider Code`, `PersonKey` - Used to create UPID +- `PseudoNHSNoLinked` - NHS pseudonym for GP record matching (indication charts) - `Drug Name`, `Intervention Date`, `Price Actual` - `OrganisationName` - Various `Additional Detail/Description` columns for directory extraction @@ -395,7 +484,9 @@ The input data (CSV/Parquet) must contain columns including: ## Output -Interactive Plotly icicle chart showing: +Interactive Plotly icicle chart with toggleable views: +- **Directory view**: Trust → Directorate → Drug → Patient Pathway +- **Indication view**: Trust → GP Diagnosis (Search_Term) → Drug → Patient Pathway - Patient counts and percentages at each hierarchy level - Total and average costs - Treatment duration and dosing frequency information @@ -458,13 +549,23 @@ The pre-computed pathway architecture introduces these changes: ### State Variables - **Removed**: `start_date`, `end_date`, `set_start_date()`, `set_end_date()` - **Added**: `selected_initiated`, `selected_last_seen`, `date_filter_id` -- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes` +- **Added**: `selected_chart_type` ("directory" or "indication"), `chart_type_options` +- **Added**: `set_chart_type()` - switches chart type and reloads data +- **Added**: `chart_hierarchy_label`, `chart_type_label` - computed vars for dynamic UI text +- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes` filtered by `date_filter_id` AND `chart_type` - **Added**: `recalculate_parent_totals()` - adjusts hierarchy after filtering +### Chart Type Toggle +- **New**: Segmented control ("By Directory" | "By Indication") in filter strip +- **Added**: `selected_chart_type` state variable, `set_chart_type()` handler +- **Added**: Dynamic hierarchy label ("Trust → Directorate → ..." or "Trust → Indication → ...") +- **Note**: Directory filter only applies to directory charts; for indication charts the `directory` column stores Search_Terms + ### Icicle Chart - **Enhanced**: Now includes full 10-field customdata structure - **Added**: Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips - **Added**: First/last seen dates for drug nodes +- **Added**: Indication chart uses `generate_icicle_chart_indication()` with Search_Term hierarchy ## Development diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md index c09842a..088a201 100644 --- a/IMPLEMENTATION_PLAN.md +++ b/IMPLEMENTATION_PLAN.md @@ -158,10 +158,10 @@ python -m reflex compile - reflex run crashes on Windows due to Granian/watchfiles FileNotFoundError (environment issue, not code) ### 5.2 Update Documentation -- [ ] Update CLAUDE.md with new architecture -- [ ] Document new CLI arguments -- [ ] Document chart_type toggle behavior -- [ ] Update data flow diagrams +- [x] Update CLAUDE.md with new architecture +- [x] Document new CLI arguments +- [x] Document chart_type toggle behavior +- [x] Update data flow diagrams --- diff --git a/progress.txt b/progress.txt index 70f7ec2..f3554c3 100644 --- a/progress.txt +++ b/progress.txt @@ -416,3 +416,43 @@ The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py - The app is feature-complete — only documentation and optional visual polish remain ### Blocked items: - Visual testing at multiple viewport sizes blocked by Granian/watchfiles Windows crash + +## Iteration 8 — 2026-02-05 +### Task: 5.2 Update Documentation +### Why this task: +- This is the only remaining non-blocked task in the implementation plan +- Previous iteration explicitly recommended completing Task 5.2 next +- All code is complete and validated — only documentation remains +### Status: COMPLETE +### What was done: +Updated CLAUDE.md comprehensively to document all indication chart features: + +1. **Project Overview** — Added dual chart types as key feature, GP diagnosis matching with ~93% match rate +2. **CLI Commands** — Documented `--chart-type` argument (all/directory/indication) with timing estimates +3. **Pathway Data Architecture** — Added chart types table, indication hierarchy, chart_type in node structure, UNIQUE constraint +4. **Data Processing Module** — Documented `get_patient_indication_groups()`, `CLUSTER_MAPPING_SQL`, indication pipeline functions +5. **Analysis Module** — Added `generate_icicle_chart_indication()` documentation +6. **Reflex Application** — Documented chart type state, toggle component, known limitation (directory filter) +7. **Data Flow Diagram** — Rewrote to show dual chart type processing with GP diagnosis lookup branch +8. **Database Schema** — Added `chart_type` column, updated UNIQUE constraint, indexes +9. **Key Patterns** — Added indication lookup workflow, kept legacy validation workflow +10. **Breaking Changes** — Added chart type toggle section, updated state variables +11. **Input Data** — Added `PseudoNHSNoLinked` column requirement +12. **Output** — Updated to describe both chart views + +### Validation results: +- Tier 1 (Code): N/A (documentation only) +- Tier 2 (Data): N/A +- Tier 3 (Functional): N/A +### Files changed: +- `CLAUDE.md` — comprehensive documentation updates for indication chart features +- `IMPLEMENTATION_PLAN.md` — marked Task 5.2 items complete +### Committed: 315065c "docs: update CLAUDE.md with indication chart architecture and CLI docs (Task 5.2)" +### Patterns discovered: +- None (documentation task) +### Next iteration should: +- All implementation plan tasks are now complete except the deferred viewport testing (blocked by Granian crash) +- Check if `COMPLETE` signal can be issued (all tasks [x] except one deferred sub-item) +- If the deferred viewport testing sub-item prevents completion, consider whether it should be marked [B] (blocked) +### Blocked items: +- Task 5.1 sub-item "Test at multiple viewport sizes" still blocked by Granian/watchfiles Windows crash