From fed909481e4f6681d18895db5db33ee4200f4baa Mon Sep 17 00:00:00 2001
From: Andrew Charlwood <andrew.charlwood@nhs.net>
Date: Thu, 5 Feb 2026 20:49:49 +0000
Subject: [PATCH] docs: update CLAUDE.md with indication chart architecture and
 CLI docs (Task 5.2)

---
 CLAUDE.md              | 181 ++++++++++++++++++++++++++++++++---------
 IMPLEMENTATION_PLAN.md |   8 +-
 progress.txt           |  40 +++++++++
 3 files changed, 185 insertions(+), 44 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index c991afb..b0daee8 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,14 +4,15 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 ## Project Overview
 
-NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns (Trust → Directory/Specialty → Drug → Patient pathway) as interactive Plotly icicle charts.
+NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that analyzes secondary care patient treatment pathways. It processes clinical activity data to visualize hierarchical treatment patterns as interactive Plotly icicle charts.
 
 **Key Features:**
-- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
+- **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle
 - **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering
-- GP diagnosis integration for indication validation via SNOMED clusters
+- **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate)
+- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
 - Interactive browser-based UI using Reflex framework
-- 6 pre-defined date filter combinations with sub-50ms response times
+- 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times
 
 ## Running the Application
 
@@ -37,11 +38,17 @@ The application requires Python 3.10+ and runs on http://localhost:3000 by defau
 
 **Refresh Pathway Data:**
 ```bash
-# Full refresh with default filters (all trusts, default drugs)
-python -m cli.refresh_pathways
+# Full refresh — both chart types (directory + indication), all date filters
+python -m cli.refresh_pathways --chart-type all
+
+# Directory charts only (faster, skips GP diagnosis lookup)
+python -m cli.refresh_pathways --chart-type directory
+
+# Indication charts only
+python -m cli.refresh_pathways --chart-type indication
 
 # Dry run (test without database changes)
-python -m cli.refresh_pathways --dry-run -v
+python -m cli.refresh_pathways --chart-type all --dry-run -v
 
 # Custom minimum patient threshold
 python -m cli.refresh_pathways --minimum-patients 10
@@ -50,11 +57,17 @@ python -m cli.refresh_pathways --minimum-patients 10
 python -m cli.refresh_pathways --help
 ```
 
+The `--chart-type` argument controls which pathway types are processed:
+- `all` (default) — generates both directory and indication charts (~15 minutes)
+- `directory` — directory-based charts only (~5 minutes)
+- `indication` — indication-based charts only (~12 minutes, includes GP lookup)
+
 The refresh command:
 1. Fetches activity data from Snowflake (656K+ records, ~7 seconds)
 2. Applies UPID, drug name, and directory transformations (~6 minutes)
-3. Processes 6 date filter combinations (all_6mo, all_12mo, 1yr_6mo, etc.)
-4. Inserts pathway nodes to SQLite for fast Reflex filtering
+3. For indication charts: queries GP records via SNOMED clusters (~9 minutes for 37K patients)
+4. Processes 6 date filter combinations × selected chart types
+5. Inserts pathway nodes to SQLite for fast Reflex filtering
 
 ## Architecture
 
@@ -121,6 +134,15 @@ The application uses a pre-computed pathway architecture for performance:
 - **Simplicity**: Reflex filters pre-computed data with simple SQL WHERE clauses
 - **Full Pathways**: Sequential treatment pathways (drug_0 → drug_1 → drug_2...) with statistics
 
+**Chart Types:**
+
+| Type | Hierarchy | Level 2 Source |
+|------|-----------|----------------|
+| `directory` | Trust → Directory → Drug → Pathway | Assigned directorate (5-level fallback) |
+| `indication` | Trust → GP Diagnosis → Drug → Pathway | SNOMED cluster Search_Term from GP records |
+
+For indication charts, ~93% of patients are matched to a GP diagnosis (Search_Term). Unmatched patients use their directorate as a fallback label (e.g., "RHEUMATOLOGY (no GP dx)").
+
 **Date Filter Combinations:**
 | ID | Initiated | Last Seen | Default |
 |----|-----------|-----------|---------|
@@ -131,14 +153,18 @@ The application uses a pre-computed pathway architecture for performance:
 | `2yr_6mo` | Last 2 years | Last 6 months | No |
 | `2yr_12mo` | Last 2 years | Last 12 months | No |
 
+Total pre-computed datasets: 6 date filters × 2 chart types = 12 datasets (~3,600 pathway nodes).
+
 **Pathway Node Structure:**
 Each node in `pathway_nodes` contains:
-- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory, 3=Drug, 4+=Pathway)
+- Routing: `chart_type` ("directory" or "indication"), `date_filter_id`
+- Hierarchy: `parents`, `ids`, `labels`, `level` (0=Root, 1=Trust, 2=Directory/Indication, 3=Drug, 4+=Pathway)
 - Counts: `value` (patient count)
 - Costs: `cost`, `costpp`, `cost_pp_pa` (per patient per annum)
 - Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent`
 - Statistics: `average_spacing`, `average_administered`, `avg_days`
 - Denormalized: `trust_name`, `directory`, `drug_sequence` (for efficient filtering)
+- Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)`
 
 ### Core Module (`core/`)
 
@@ -166,10 +192,15 @@ Each node in `pathway_nodes` contains:
 - `DATE_FILTER_CONFIGS` - All 6 pre-defined date combinations
 - `compute_date_ranges(config, max_date)` - Computes actual ISO dates from config
 - `fetch_and_transform_data()` - Snowflake fetch + UPID/drug/directory transformations
-- `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()`
-- `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence
-- `convert_to_records()` - Converts ice_df to list of dicts for SQLite insertion
-- `process_all_date_filters()` - Convenience function to process all 6 filters
+- Directory chart functions:
+  - `process_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart()`
+  - `extract_denormalized_fields()` - Parses `ids` column to extract trust, directory, drug_sequence
+- Indication chart functions:
+  - `process_indication_pathway_for_date_filter()` - Processes single date filter using `generate_icicle_chart_indication()`
+  - `extract_indication_fields()` - Parses `ids` for indication charts (trust, search_term, drug_sequence)
+- Shared functions:
+  - `convert_to_records(ice_df, chart_type)` - Converts ice_df to list of dicts with `chart_type` column
+  - `process_all_date_filters()` - Convenience function to process all 6 filters
 
 **Data Loaders:**
 - `FileDataLoader` - Loads from CSV/Parquet files
@@ -182,20 +213,25 @@ Each node in `pathway_nodes` contains:
 - Query caching with TTL-based invalidation
 - Fallback chain: cache → Snowflake → local files
 
-**GP Diagnosis Validation:**
-- Uses pre-built SNOMED clusters from `ClinicalCodingClusterSnomedCodes`
-- `patient_has_indication(patient_pseudonym, cluster_ids)` checks GP records
-- `validate_indication(patient_pseudonym, drug_name)` returns full validation result
-- Adds `Indication_Source` column: "GP_SNOMED" | "HCD_SNOMED" | "NONE"
+**GP Diagnosis Lookup (`diagnosis_lookup.py`):**
+- `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes
+- `get_patient_indication_groups(patient_pseudonyms)` - Batch queries Snowflake to match patients to GP diagnoses:
+  - Embeds cluster mapping as CTE, joins with `PrimaryCareClinicalCoding`
+  - Uses `PseudoNHSNoLinked` (not PersonKey) to match `PatientPseudonym` in GP records
+  - Returns most recent match per patient via `QUALIFY ROW_NUMBER()`
+  - Batches 500 patients per query, returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
+- `patient_has_indication(patient_pseudonym, cluster_ids)` - Single-patient GP record check (legacy)
+- `validate_indication(patient_pseudonym, drug_name)` - Full validation result with source tracking (legacy)
 
 ### Analysis Module (`analysis/`)
 
 Refactored from the original 267-line `generate_graph()` function:
 
-- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories
+- **prepare_data()** - Filter DataFrame by date range, trusts, drugs, directories (copies df to prevent mutation)
 - **calculate_statistics()** - Compute frequency, cost, duration statistics
 - **build_hierarchy()** - Create Trust → Directory → Drug → Pathway structure
 - **prepare_chart_data()** - Format data for Plotly icicle chart
+- **generate_icicle_chart_indication(df, indication_df, ...)** - Build indication-based hierarchy using Search_Term instead of Directory. Takes an `indication_df` (UPID → Search_Term mapping) alongside the main activity DataFrame.
 
 ### Visualization Module (`visualization/`)
 
@@ -205,12 +241,20 @@ Refactored from the original 267-line `generate_graph()` function:
 
 ### Reflex Application (`pathways_app/`)
 
-The `State` class manages all application state:
+The `AppState` class manages all application state:
+- **Chart type**: `selected_chart_type` ("directory" or "indication"), toggled via `set_chart_type()`
+- **Computed vars**: `chart_hierarchy_label` (dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."), `chart_type_label`
 - Filter variables: dates, drugs, trusts, directories
 - Reference data: available options loaded from CSV/SQLite
 - Analysis state: running flag, status messages, chart data
 - Data source state: file path, source type, row counts
 
+**Chart Type Toggle** (`chart_type_toggle()` component):
+- Segmented control with "By Directory" and "By Indication" pill buttons
+- Placed first in the filter strip before date filters
+- Switching reloads pathway data from SQLite filtered by `chart_type`
+- Note: Directory filter only applies to directory charts (indication charts store Search_Terms in the directory column)
+
 ### Legacy Modules (`tools/`)
 
 Still used during transition:
@@ -227,7 +271,7 @@ Still used during transition:
 **Pre-Computed Pathway Architecture (Current):**
 
 ```
-[CLI: python -m cli.refresh_pathways]
+[CLI: python -m cli.refresh_pathways --chart-type all]
 
     Snowflake Data Warehouse
            │
@@ -239,19 +283,42 @@ Still used during transition:
     │   → department_identification() → Dir    │
     └──────────────────────────────────────────┘
            │
-           ▼ (process_all_date_filters)
-    ┌──────────────────────────────────────────┐
-    │ Pathway Pipeline (pathway_pipeline.py)   │
-    │   → For each of 6 date filter combos:    │
-    │     → generate_icicle_chart()            │
-    │     → extract_denormalized_fields()      │
-    │     → convert_to_records()               │
-    └──────────────────────────────────────────┘
-           │
-           ▼ (insert_pathway_records)
+           ├─── Directory Charts ──────────────────────────────────────┐
+           │                                                           │
+           │    ┌──────────────────────────────────────────┐           │
+           │    │ For each of 6 date filter combos:        │           │
+           │    │   → generate_icicle_chart()              │           │
+           │    │   → extract_denormalized_fields()        │           │
+           │    │   → convert_to_records("directory")      │           │
+           │    └──────────────────────────────────────────┘           │
+           │                                                           │
+           ├─── Indication Charts ─────────────────────────────────────┤
+           │                                                           │
+           │    ┌──────────────────────────────────────────┐           │
+           │    │ GP Diagnosis Lookup (diagnosis_lookup.py)│           │
+           │    │   → Extract PseudoNHSNoLinked from HCD   │           │
+           │    │   → get_patient_indication_groups()      │           │
+           │    │     (SNOMED cluster CTE + GP records)    │           │
+           │    │   → Build indication_df: UPID → Search   │           │
+           │    │     Term (matched) or Directorate (no GP)│           │
+           │    └──────────────────────────────────────────┘           │
+           │                        │                                  │
+           │                        ▼                                  │
+           │    ┌──────────────────────────────────────────┐           │
+           │    │ For each of 6 date filter combos:        │           │
+           │    │   → generate_icicle_chart_indication()   │           │
+           │    │   → extract_indication_fields()          │           │
+           │    │   → convert_to_records("indication")     │           │
+           │    └──────────────────────────────────────────┘           │
+           │                                                           │
+           └───────────────────────┬───────────────────────────────────┘
+                                   │
+                                   ▼ (insert_pathway_records)
     ┌──────────────────────────────────────────┐
     │ SQLite: pathway_nodes table              │
-    │   → 293 nodes for all_6mo filter         │
+    │   → ~3,600 nodes across 12 datasets      │
+    │   → UNIQUE(date_filter_id, chart_type,   │
+    │     ids) prevents cross-type overwrites  │
     │   → Indexed for fast filtering           │
     └──────────────────────────────────────────┘
 
@@ -259,8 +326,16 @@ Still used during transition:
 [Reflex App: reflex run]
 
     ┌──────────────────────────────────────────┐
+    │ Chart Type Toggle (segmented control)    │
+    │   → "By Directory" | "By Indication"     │
+    │   → Triggers set_chart_type() handler    │
+    └──────────────────────────────────────────┘
+           │
+           ▼
+    ┌──────────────────────────────────────────┐
     │ AppState.load_pathway_data()             │
-    │   → Query pathway_nodes WHERE date_filter│
+    │   → Query pathway_nodes WHERE            │
+    │     date_filter AND chart_type            │
     │   → Apply drug/directory filters         │
     │   → recalculate_parent_totals()          │
     └──────────────────────────────────────────┘
@@ -278,6 +353,7 @@ Still used during transition:
     │ Reflex UI (rx.plotly component)          │
     │   → <50ms filter response time           │
     │   → Treatment statistics in tooltips     │
+    │   → Dynamic hierarchy label updates      │
     └──────────────────────────────────────────┘
 ```
 
@@ -339,7 +415,17 @@ The `department_identification()` function has 5 levels of fallback:
 4. **UPID_INFERENCE** - Inferred from other records with same UPID
 5. **UNDEFINED** - No directory could be determined
 
-**Indication Validation Workflow:**
+**Indication Lookup Workflow (for indication charts):**
+1. Extract unique `PseudoNHSNoLinked` values from HCD activity data
+2. Query Snowflake in batches of 500 patients:
+   - Embed `CLUSTER_MAPPING_SQL` (~148 Search_Term → Cluster_ID mappings) as CTE
+   - Join `ClinicalCodingClusterSnomedCodes` to get SNOMED codes per cluster
+   - Join `PrimaryCareClinicalCoding` on `PatientPseudonym` = `PseudoNHSNoLinked`
+   - Use `QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1` for most recent match
+3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched)
+4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building
+
+**Indication Validation Workflow (legacy, per-patient):**
 1. Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD)
 2. Get all SNOMED codes for those clusters
 3. Check GP records (PrimaryCareClinicalCoding) for matching codes
@@ -369,18 +455,20 @@ The `department_identification()` function has 5 levels of fallback:
 ### File Tracking
 - `processed_files` - Hash-based tracking for incremental loading
 
-### Pathway Tables (New)
+### Pathway Tables
 - `pathway_date_filters` - 6 pre-defined date filter combinations
   - Columns: `id`, `initiated`, `last_seen`, `is_default`, `description`
   - Auto-populated via migration
-- `pathway_nodes` - Pre-computed pathway hierarchy nodes
+- `pathway_nodes` - Pre-computed pathway hierarchy nodes (~3,600 rows for 12 datasets)
+  - Routing: `chart_type` ("directory" or "indication"), `date_filter_id`
   - Hierarchy: `parents`, `ids`, `labels`, `level`
   - Metrics: `value`, `cost`, `costpp`, `cost_pp_pa`, `colour`
   - Dates: `first_seen`, `last_seen`, `first_seen_parent`, `last_seen_parent`
   - Statistics: `average_spacing`, `average_administered`, `avg_days`
   - Denormalized: `trust_name`, `directory`, `drug_sequence`
   - Foreign key: `date_filter_id` → `pathway_date_filters.id`
-  - Indexed for: date_filter_id, trust_name, directory, level
+  - Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness
+  - Indexed for: date_filter_id, chart_type, trust_name, directory, level
 - `pathway_refresh_log` - Tracks data refresh status
   - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`
 
@@ -388,6 +476,7 @@ The `department_identification()` function has 5 levels of fallback:
 
 The input data (CSV/Parquet) must contain columns including:
 - `Provider Code`, `PersonKey` - Used to create UPID
+- `PseudoNHSNoLinked` - NHS pseudonym for GP record matching (indication charts)
 - `Drug Name`, `Intervention Date`, `Price Actual`
 - `OrganisationName`
 - Various `Additional Detail/Description` columns for directory extraction
@@ -395,7 +484,9 @@ The input data (CSV/Parquet) must contain columns including:
 
 ## Output
 
-Interactive Plotly icicle chart showing:
+Interactive Plotly icicle chart with toggleable views:
+- **Directory view**: Trust → Directorate → Drug → Patient Pathway
+- **Indication view**: Trust → GP Diagnosis (Search_Term) → Drug → Patient Pathway
 - Patient counts and percentages at each hierarchy level
 - Total and average costs
 - Treatment duration and dosing frequency information
@@ -458,13 +549,23 @@ The pre-computed pathway architecture introduces these changes:
 ### State Variables
 - **Removed**: `start_date`, `end_date`, `set_start_date()`, `set_end_date()`
 - **Added**: `selected_initiated`, `selected_last_seen`, `date_filter_id`
-- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes`
+- **Added**: `selected_chart_type` ("directory" or "indication"), `chart_type_options`
+- **Added**: `set_chart_type()` - switches chart type and reloads data
+- **Added**: `chart_hierarchy_label`, `chart_type_label` - computed vars for dynamic UI text
+- **Added**: `load_pathway_data()` - queries pre-computed `pathway_nodes` filtered by `date_filter_id` AND `chart_type`
 - **Added**: `recalculate_parent_totals()` - adjusts hierarchy after filtering
 
+### Chart Type Toggle
+- **New**: Segmented control ("By Directory" | "By Indication") in filter strip
+- **Added**: `selected_chart_type` state variable, `set_chart_type()` handler
+- **Added**: Dynamic hierarchy label ("Trust → Directorate → ..." or "Trust → Indication → ...")
+- **Note**: Directory filter only applies to directory charts; for indication charts the `directory` column stores Search_Terms
+
 ### Icicle Chart
 - **Enhanced**: Now includes full 10-field customdata structure
 - **Added**: Treatment statistics (average_spacing, cost_pp_pa) in hover tooltips
 - **Added**: First/last seen dates for drug nodes
+- **Added**: Indication chart uses `generate_icicle_chart_indication()` with Search_Term hierarchy
 
 ## Development
 
diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md
index c09842a..088a201 100644
--- a/IMPLEMENTATION_PLAN.md
+++ b/IMPLEMENTATION_PLAN.md
@@ -158,10 +158,10 @@ python -m reflex compile
   - reflex run crashes on Windows due to Granian/watchfiles FileNotFoundError (environment issue, not code)
 
 ### 5.2 Update Documentation
-- [ ] Update CLAUDE.md with new architecture
-- [ ] Document new CLI arguments
-- [ ] Document chart_type toggle behavior
-- [ ] Update data flow diagrams
+- [x] Update CLAUDE.md with new architecture
+- [x] Document new CLI arguments
+- [x] Document chart_type toggle behavior
+- [x] Update data flow diagrams
 
 ---
 
diff --git a/progress.txt b/progress.txt
index 70f7ec2..f3554c3 100644
--- a/progress.txt
+++ b/progress.txt
@@ -416,3 +416,43 @@ The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py
 - The app is feature-complete — only documentation and optional visual polish remain
 ### Blocked items:
 - Visual testing at multiple viewport sizes blocked by Granian/watchfiles Windows crash
+
+## Iteration 8 — 2026-02-05
+### Task: 5.2 Update Documentation
+### Why this task:
+- This is the only remaining non-blocked task in the implementation plan
+- Previous iteration explicitly recommended completing Task 5.2 next
+- All code is complete and validated — only documentation remains
+### Status: COMPLETE
+### What was done:
+Updated CLAUDE.md comprehensively to document all indication chart features:
+
+1. **Project Overview** — Added dual chart types as key feature, GP diagnosis matching with ~93% match rate
+2. **CLI Commands** — Documented `--chart-type` argument (all/directory/indication) with timing estimates
+3. **Pathway Data Architecture** — Added chart types table, indication hierarchy, chart_type in node structure, UNIQUE constraint
+4. **Data Processing Module** — Documented `get_patient_indication_groups()`, `CLUSTER_MAPPING_SQL`, indication pipeline functions
+5. **Analysis Module** — Added `generate_icicle_chart_indication()` documentation
+6. **Reflex Application** — Documented chart type state, toggle component, known limitation (directory filter)
+7. **Data Flow Diagram** — Rewrote to show dual chart type processing with GP diagnosis lookup branch
+8. **Database Schema** — Added `chart_type` column, updated UNIQUE constraint, indexes
+9. **Key Patterns** — Added indication lookup workflow, kept legacy validation workflow
+10. **Breaking Changes** — Added chart type toggle section, updated state variables
+11. **Input Data** — Added `PseudoNHSNoLinked` column requirement
+12. **Output** — Updated to describe both chart views
+
+### Validation results:
+- Tier 1 (Code): N/A (documentation only)
+- Tier 2 (Data): N/A
+- Tier 3 (Functional): N/A
+### Files changed:
+- `CLAUDE.md` — comprehensive documentation updates for indication chart features
+- `IMPLEMENTATION_PLAN.md` — marked Task 5.2 items complete
+### Committed: 315065c "docs: update CLAUDE.md with indication chart architecture and CLI docs (Task 5.2)"
+### Patterns discovered:
+- None (documentation task)
+### Next iteration should:
+- All implementation plan tasks are now complete except the deferred viewport testing (blocked by Granian crash)
+- Check if `<promise>COMPLETE</promise>` signal can be issued (all tasks [x] except one deferred sub-item)
+- If the deferred viewport testing sub-item prevents completion, consider whether it should be marked [B] (blocked)
+### Blocked items:
+- Task 5.1 sub-item "Test at multiple viewport sizes" still blocked by Granian/watchfiles Windows crash