docs: update CLAUDE.md to reflect slimmed database architecture

Remove references to deleted tables (fact_interventions, mv_patient_treatment_summary, ref_drug_snomed_mapping, processed_files), deleted files (patient_data.py, load_snomed_mapping.py), and removed classes (SQLiteDataLoader). Update package structure, data loaders, database schema, fallback chain, and AppState descriptions.
2026-02-06 09:39:19 +00:00
parent 778ed99ef6
commit 1581b1d3dd
1 changed files with 15 additions and 73 deletions
@@ -10,7 +10,7 @@ NHS High-Cost Drug Patient Pathway Analysis Tool - a web-based application that
 - **Dual chart types**: Directory-based (Trust → Directory → Drug → Pathway) and Indication-based (Trust → GP Diagnosis → Drug → Pathway) views with toggle
 - **Pre-computed pathway architecture**: Treatment pathways pre-processed and stored in SQLite for instant filtering
 - **GP diagnosis matching**: Patient indications matched from GP records using SNOMED cluster codes queried directly from Snowflake (~93% match rate)
- Multi-source data loading: CSV/Parquet files, SQLite database, Snowflake data warehouse
+- Data pipeline: Snowflake → pre-computed SQLite pathway nodes (CSV/Parquet file loading retained for legacy compatibility)
 - Interactive browser-based UI using Reflex framework
 - 6 pre-defined date filter combinations × 2 chart types = 12 pre-computed datasets with sub-50ms response times

@@ -86,15 +86,14 @@ The refresh command:
 │
 ├── data_processing/         # Data layer
 │   ├── database.py         # SQLite connection management
-│   ├── schema.py           # Database schema (including pathway tables)
+│   ├── schema.py           # Database schema (reference + pathway tables)
 │   ├── pathway_pipeline.py # Pathway processing pipeline (Snowflake → SQLite)
-│   ├── loader.py           # DataLoader abstraction (CSV/SQLite)
-│   ├── patient_data.py     # Patient data migration and loading
+│   ├── loader.py           # FileDataLoader for CSV/Parquet files
 │   ├── reference_data.py   # Reference data migration
 │   ├── snowflake_connector.py  # Snowflake integration
 │   ├── cache.py            # Query result caching
-│   ├── data_source.py      # Data source fallback chain
-│   └── diagnosis_lookup.py # GP diagnosis validation
+│   ├── data_source.py      # Data source fallback chain (Snowflake/file)
+│   └── diagnosis_lookup.py # GP diagnosis lookup and drug-indication mapping
 │
 ├── analysis/                # Analysis pipeline
 │   ├── pathway_analyzer.py # prepare_data, calculate_statistics, build_hierarchy
@@ -184,7 +183,7 @@ Each node in `pathway_nodes` contains:

 **Database Management:**
 - `DatabaseManager` - SQLite connection pooling and transaction management
- Tables: `ref_drug_names`, `ref_organizations`, `ref_directories`, `ref_drug_directory_map`, `ref_drug_indication_clusters`, `fact_interventions`, `mv_patient_treatment_summary`, `processed_files`
+- **Reference Tables**: `ref_drug_names`, `ref_organizations`, `ref_directories`, `ref_drug_directory_map`, `ref_drug_indication_clusters`
 - **Pathway Tables**: `pathway_date_filters`, `pathway_nodes`, `pathway_refresh_log`

 **Pathway Pipeline (`pathway_pipeline.py`):**
@@ -203,15 +202,13 @@ Each node in `pathway_nodes` contains:
  - `process_all_date_filters()` - Convenience function to process all 6 filters

 **Data Loaders:**
- `FileDataLoader` - Loads from CSV/Parquet files
- `SQLiteDataLoader` - Queries fact_interventions table
- Factory function `get_loader()` selects appropriate loader
+- `FileDataLoader` - Loads from CSV/Parquet files (used by legacy pipeline, not by Reflex app)
+- Factory function `get_loader()` creates a `FileDataLoader`

 **Snowflake Integration:**
 - SSO authentication via `externalbrowser` authenticator
 - `fetch_activity_data(start_date, end_date, provider_codes)` method
 - Query caching with TTL-based invalidation
- Fallback chain: cache → Snowflake → local files

 **GP Diagnosis Lookup (`diagnosis_lookup.py`):**
 - `CLUSTER_MAPPING_SQL` - Embedded SQL constant with ~148 Search_Term → Cluster_ID mappings plus explicit SNOMED codes
@@ -245,9 +242,9 @@ The `AppState` class manages all application state:
 - **Chart type**: `selected_chart_type` ("directory" or "indication"), toggled via `set_chart_type()`
 - **Computed vars**: `chart_hierarchy_label` (dynamic "Trust → Directorate → ..." or "Trust → Indication → ..."), `chart_type_label`
 - Filter variables: dates, drugs, trusts, directories
- Reference data: available options loaded from CSV/SQLite
+- Reference data: available options loaded from pathway_nodes and CSV files
 - Analysis state: running flag, status messages, chart data
- Data source state: file path, source type, row counts
+- `load_data()` sources available drugs/directorates from `pathway_nodes` and `total_records` from `pathway_refresh_log.source_row_count`

 **Chart Type Toggle** (`chart_type_toggle()` component):
 - Segmented control with "By Directory" and "By Indication" pill buttons
@@ -357,39 +354,6 @@ Still used during transition:
    └──────────────────────────────────────────┘
 ```

-**Legacy Data Flow (Original):**
-
-```
-Data Sources:
-    CSV/Parquet file upload
-    OR SQLite database query
-    OR Snowflake fetch (with caching)
-           │
-           ▼
-    ┌──────────────────────────────────────────┐
-    │ Data Transformations (tools/data.py)     │
-    │   → patient_id() creates UPID            │
-    │   → drug_names() standardizes names      │
-    │   → department_identification() → Dir    │
-    └──────────────────────────────────────────┘
-           │
-           ▼
-    ┌──────────────────────────────────────────┐
-    │ Analysis Pipeline (analysis/)            │
-    │   → prepare_data() - filter by criteria  │
-    │   → calculate_statistics()               │
-    │   → build_hierarchy()                    │
-    │   → prepare_chart_data()                 │
-    └──────────────────────────────────────────┘
-           │
-           ▼
-    ┌──────────────────────────────────────────┐
-    │ Visualization (visualization/)           │
-    │   → create_icicle_figure()               │
-    │   → Display in rx.plotly() component     │
-    └──────────────────────────────────────────┘
-```
-
 ### Reference Data Files (`data/`)

 | File | Purpose |
@@ -403,7 +367,7 @@ Data Sources:
 | `treatment_function_codes.csv` | NHS treatment function code mappings |
 | `drug_indication_clusters.csv` | Drug to SNOMED cluster mappings |
 | `ta-recommendations.xlsx` | NICE TA recommendations |
-| `pathways.db` | SQLite database with all tables |
+| `pathways.db` | SQLite database (~3.5 MB: reference tables + pathway nodes) |

 ### Key Patterns

@@ -425,19 +389,12 @@ The `department_identification()` function has 5 levels of fallback:
 3. Build `indication_df` mapping UPID → Search_Term (matched) or Directorate + " (no GP dx)" (unmatched)
 4. Pass to `generate_icicle_chart_indication()` for pathway hierarchy building

-**Indication Validation Workflow (legacy, per-patient):**
-1. Map drug → SNOMED cluster IDs (e.g., ADALIMUMAB → RARTH_COD, PSORIASIS_COD)
-2. Get all SNOMED codes for those clusters
-3. Check GP records (PrimaryCareClinicalCoding) for matching codes
-4. Report match/no-match with source tracking
-
-**Data Source Fallback Chain:**
+**Data Source Fallback Chain** (for raw data loading, not used by Reflex app):
 1. Query cache for recent results
 2. Attempt Snowflake connection
-3. Fall back to SQLite database
-4. Fall back to CSV/Parquet files
+3. Fall back to CSV/Parquet files

-## Database Schema
+## Database Schema (~3.5 MB)

 ### Reference Tables
 - `ref_drug_names` - Drug name standardization
@@ -446,15 +403,6 @@ The `department_identification()` function has 5 levels of fallback:
 - `ref_drug_directory_map` - Valid drug-directory pairs
 - `ref_drug_indication_clusters` - Drug to SNOMED cluster mapping

-### Fact Tables
- `fact_interventions` - Patient intervention records (UPID, drug, date, cost, directory)
-
-### Materialized Views
- `mv_patient_treatment_summary` - Pre-aggregated patient statistics
-
-### File Tracking
- `processed_files` - Hash-based tracking for incremental loading
-
 ### Pathway Tables
 - `pathway_date_filters` - 6 pre-defined date filter combinations
  - Columns: `id`, `initiated`, `last_seen`, `is_default`, `description`
@@ -470,7 +418,7 @@ The `department_identification()` function has 5 levels of fallback:
  - Unique constraint: `UNIQUE(date_filter_id, chart_type, ids)` — critical for INSERT OR REPLACE correctness
  - Indexed for: date_filter_id, chart_type, trust_name, directory, level
 - `pathway_refresh_log` - Tracks data refresh status
-  - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`
+  - Columns: `refresh_id`, `started_at`, `completed_at`, `status`, `records_processed`, `error_message`, `source_row_count`

 ## Input Data Requirements

@@ -569,12 +517,6 @@ The pre-computed pathway architecture introduces these changes:

 ## Development

-### Adding New Data Sources
-
-1. Create loader class implementing `DataLoader` protocol in `data_processing/loader.py`
-2. Add to factory function `get_loader()`
-3. Update `DataSourceManager` fallback chain if needed
-
 ### Adding New Analysis Features

 1. Add statistical functions to `analysis/statistics.py`