From 1581b1d3dd4e19cdd55686638a18e00dc5243a7a Mon Sep 17 00:00:00 2001 From: Andrew Charlwood Date: Fri, 6 Feb 2026 09:39:19 +0000 Subject: [PATCH] docs: update CLAUDE.md to reflect slimmed database architecture Remove references to deleted tables (fact_interventions, mv_patient_treatment_summary, ref_drug_snomed_mapping, processed_files), deleted files (patient_data.py, load_snomed_mapping.py), and removed classes (SQLiteDataLoader). Update package structure, data loaders, database schema, fallback chain, and AppState descriptions. --- CLAUDE.md | 88 ++++++++++--------------------------------------------- 1 file changed, 15 insertions(+), 73 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index b0daee8..d376ec6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -10,7 +10,7 @@ NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that - **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle - **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering - **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate) -- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse +- Data pipeline: Snowflake → pre-computed SQLite pathway nodes (CSV/Parquet file loading retained for legacy compatibility) - Interactive browser-based UI using Reflex framework - 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times @@ -86,15 +86,14 @@ The refresh command: │ ├── data_processing/ # Data layer │ ├── database.py # SQLite connection management -│ ├── schema.py # Database schema (including pathway tables) +│ ├── schema.py # Database schema (reference + pathway tables) │ ├── pathway_pipeline.py # Pathway processing pipeline (Snowflake → SQLite) -│ ├── loader.py # DataLoader abstraction (CSV/SQLite) -│ ├── patient_data.py # Patient data migration and loading +│ ├── loader.py # FileDataLoader for CSV/Parquet files │ ├── reference_data.py # Reference data migration │ ├── snowflake_connector.py # Snowflake integration │ ├── cache.py # Query result caching -│ ├── data_source.py # Data source fallback chain -│ └── diagnosis_lookup.py # GP diagnosis validation +│ ├── data_source.py # Data source fallback chain (Snowflake/file) +│ └── diagnosis_lookup.py # GP diagnosis lookup and drug-indication mapping │ ├── analysis/ # Analysis pipeline │ ├── pathway_analyzer.py # prepare_data, calculate_statistics, build_hierarchy @@ -184,7 +183,7 @@ Each node in `pathway_nodes` contains: **Database Management:** - `DatabaseManager` - SQLite connection pooling and transaction management -- Tables: `ref_drug_names`, `ref_organizations`, `ref_directories`, `ref_drug_directory_map`, `ref_drug_indication_clusters`, `fact_interventions`, `mv_patient_treatment_summary`, `processed_files` +- **Reference Tables**: `ref_drug_names`, `ref_organizations`, `ref_directories`, `ref_drug_directory_map`, `ref_drug_indication_clusters` - **Pathway Tables**: `pathway_date_filters`, `pathway_nodes`, `pathway_refresh_log` **Pathway Pipeline (`pathway_pipeline.py`):** @@ -203,15 +202,13 @@ Each node in `pathway_nodes` contains: - `process_all_date_filters()` - Convenience function to process all 6 filters **Data Loaders:** -- `FileDataLoader` - Loads from CSV/Parquet files -- `SQLiteDataLoader` - Queries fact_interventions table -- Factory function `get_loader()` selects appropriate loader +- `FileDataLoader` - Loads from CSV/Parquet files (used by legacy pipeline, not by Reflex app) +- Factory function `get_loader()` creates a `FileDataLoader` **Snowflake Integration:** - SSO authentication via `externalbrowser` authenticator - `fetch_activity_data(start_date, end_date, provider_codes)` method - Query caching with TTL-based invalidation -- Fallback chain: cache → Snowflake → local files **GP Diagnosis Lookup (`diagnosis_lookup.py`):** - `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes @@ -245,9 +242,9 @@ The `AppState` class manages all application state: - **Chart type**: `selected_chart_type` ("directory" or "indication"), toggled via `set_chart_type()` - **Computed vars**: `chart_hierarchy_label` (dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."), `chart_type_label` - Filter variables: dates, drugs, trusts, directories -- Reference data: available options loaded from CSV/SQLite +- Reference data: available options loaded from pathway_nodes and CSV files - Analysis state: running flag, status messages, chart data -- Data source state: file path, source type, row counts +- `load_data()` sources available drugs/directorates from `pathway_nodes` and `total_records` from `pathway_refresh_log.source_row_count` **Chart Type Toggle** (`chart_type_toggle()` component): - Segmented control with "By Directory" and "By Indication" pill buttons @@ -357,39 +354,6 @@ Still used during transition: └──────────────────────────────────────────┘ ``` -**Legacy Data Flow (Original):** - -``` -Data Sources: - CSV/Parquet file upload - OR SQLite database query - OR Snowflake fetch (with caching) - │ - ▼ - ┌──────────────────────────────────────────┐ - │ Data Transformations (tools/data.py) │ - │ → patient_id() creates UPID │ - │ → drug_names() standardizes names │ - │ → department_identification() → Dir │ - └──────────────────────────────────────────┘ - │ - ▼ - ┌──────────────────────────────────────────┐ - │ Analysis Pipeline (analysis/) │ - │ → prepare_data() - filter by criteria │ - │ → calculate_statistics() │ - │ → build_hierarchy() │ - │ → prepare_chart_data() │ - └──────────────────────────────────────────┘ - │ - ▼ - ┌──────────────────────────────────────────┐ - │ Visualization (visualization/) │ - │ → create_icicle_figure() │ - │ → Display in rx.plotly() component │ - └──────────────────────────────────────────┘ -``` - ### Reference Data Files (`data/`) | File | Purpose | @@ -403,7 +367,7 @@ Data Sources: | `treatment_function_codes.csv` | NHS treatment function code mappings | | `drug_indication_clusters.csv` | Drug to SNOMED cluster mappings | | `ta-recommendations.xlsx` | NICE TA recommendations | -| `pathways.db` | SQLite database with all tables | +| `pathways.db` | SQLite database (~3.5 MB: reference tables + pathway nodes) | ### Key Patterns @@ -425,19 +389,12 @@ The `department_identification()` function has 5 levels of fallback: 3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched) 4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building -**Indication Validation Workflow (legacy, per-patient):** -1. Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD) -2. Get all SNOMED codes for those clusters -3. Check GP records (PrimaryCareClinicalCoding) for matching codes -4. Report match/no-match with source tracking - -**Data Source Fallback Chain:** +**Data Source Fallback Chain** (for raw data loading, not used by Reflex app): 1. Query cache for recent results 2. Attempt Snowflake connection -3. Fall back to SQLite database -4. Fall back to CSV/Parquet files +3. Fall back to CSV/Parquet files -## Database Schema +## Database Schema (~3.5 MB) ### Reference Tables - `ref_drug_names` - Drug name standardization @@ -446,15 +403,6 @@ The `department_identification()` function has 5 levels of fallback: - `ref_drug_directory_map` - Valid drug-directory pairs - `ref_drug_indication_clusters` - Drug to SNOMED cluster mapping -### Fact Tables -- `fact_interventions` - Patient intervention records (UPID, drug, date, cost, directory) - -### Materialized Views -- `mv_patient_treatment_summary` - Pre-aggregated patient statistics - -### File Tracking -- `processed_files` - Hash-based tracking for incremental loading - ### Pathway Tables - `pathway_date_filters` - 6 pre-defined date filter combinations - Columns: `id`, `initiated`, `last_seen`, `is_default`, `description` @@ -470,7 +418,7 @@ The `department_identification()` function has 5 levels of fallback: - Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness - Indexed for: date_filter_id, chart_type, trust_name, directory, level - `pathway_refresh_log` - Tracks data refresh status - - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message` + - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`, `source_row_count` ## Input Data Requirements @@ -569,12 +517,6 @@ The pre-computed pathway architecture introduces these changes: ## Development -### Adding New Data Sources - -1. Create loader class implementing `DataLoader` protocol in `data_processing/loader.py` -2. Add to factory function `get_loader()` -3. Update `DataSourceManager` fallback chain if needed - ### Adding New Analysis Features 1. Add statistical functions to `analysis/statistics.py`