Initial commit before Ralph loop

2026-02-04 13:04:29 +00:00
commit fdd33a67af
89 changed files with 20660 additions and 0 deletions
@@ -0,0 +1,859 @@
+# Patient Pathway Analysis - Improvement Recommendations
+
+This document outlines recommended improvements to modernize the Patient Pathway Analysis application, based on multi-domain expert analysis.
+
+---
+
+## Executive Summary
+
+| Area | Current State | Recommended Change | Priority |
+|------|--------------|-------------------|----------|
+| **GUI Framework** | CustomTkinter | **Reflex** (browser-based, native Plotly) | High |
+| **Data Storage** | CSV files (90MB+) | SQLite with caching | High |
+| **Data Source** | Manual CSV export | Direct Snowflake connection | Medium |
+| **Directory Assignment** | Multi-stage fallback | GP diagnosis codes as primary | Medium |
+| **Code Quality** | Monolithic, no types | Modular, typed, tested | Low |
+
+---
+
+## 1. GUI Framework: Replace CustomTkinter with Reflex or Flet
+
+### What
+Replace the CustomTkinter-based GUI with a modern Python framework. Two strong options:
+- **[Reflex](https://reflex.dev)** - React-based, runs in browser
+- **[Flet](https://flet.dev)** - Flutter-based, native desktop or browser
+
+### Why
+
+Since Python is approved and standalone `.exe` distribution isn't required, **both frameworks are viable**.
+
+| Criterion | CustomTkinter | Reflex | Flet |
+|-----------|---------------|--------|------|
+| UI paradigm | Native desktop | Browser (localhost) | Desktop or browser |
+| Component richness | Limited | 60+ React components | Material Design |
+| Styling | Manual/limited | Full CSS/Tailwind | Flutter theming |
+| Plotly integration | External HTML | **Native embed** | WebView needed |
+| State management | Manual | Automatic re-render | Manual updates |
+| Learning curve | Low | Moderate (React-like) | Low-moderate |
+| Community | Small | 22k+ GitHub stars | 12k+ GitHub stars |
+| Maturity | Stable | Active (v0.6+) | Active (v0.80+) |
+
+### Recommendation: **Reflex**
+
+Given that:
+1. Python is approved for users
+2. Standalone `.exe` not required
+3. **Interactive Plotly is required** (Reflex has native `rx.plotly()` component)
+
+Reflex is now the better choice because:
+- **Native Plotly support** - no need to open external browser windows
+- **Modern React-based UI** - cleaner, more customizable
+- **Simpler state management** - automatic re-rendering on state changes
+- **Better for data apps** - designed for dashboards and data visualization
+
+### How (Reflex)
+
+**Basic app structure:**
+
+```python
+import reflex as rx
+
+class State(rx.State):
+    """Application state."""
+    start_date: str = "2019-04-01"
+    end_date: str = "2025-04-30"
+    selected_drugs: list[str] = []
+    selected_trusts: list[str] = []
+    analysis_running: bool = False
+    chart_data: dict = {}
+
+    async def run_analysis(self):
+        self.analysis_running = True
+        yield  # Update UI
+
+        # Run analysis (async)
+        df = await self.load_and_process_data()
+        self.chart_data = generate_plotly_figure(df)
+
+        self.analysis_running = False
+
+def index() -> rx.Component:
+    return rx.box(
+        rx.hstack(
+            # Sidebar with filters
+            rx.vstack(
+                rx.date_picker(
+                    value=State.start_date,
+                    on_change=State.set_start_date,
+                ),
+                rx.checkbox_group(
+                    items=drug_list,
+                    value=State.selected_drugs,
+                    on_change=State.set_selected_drugs,
+                ),
+                rx.button(
+                    "Run Analysis",
+                    on_click=State.run_analysis,
+                    loading=State.analysis_running,
+                ),
+                width="300px",
+            ),
+            # Main content - interactive Plotly chart
+            rx.plotly(data=State.chart_data, layout=chart_layout),
+            width="100%",
+        )
+    )
+
+app = rx.App()
+app.add_page(index)
+```
+
+**Key components mapping:**
+
+| Current Component | Reflex Equivalent |
+|-------------------|-------------------|
+| `CTkFrame` | `rx.box`, `rx.vstack`, `rx.hstack` |
+| `CTkButton` | `rx.button` |
+| `CTkCheckBox` | `rx.checkbox` |
+| `CTkSlider` | `rx.slider` |
+| `DateEntry` | `rx.date_picker` |
+| `CTkScrollableFrame` | `rx.scroll_area` |
+| `filedialog` | `rx.upload` |
+| Plotly HTML file | **`rx.plotly()`** - native embed! |
+
+**Running the app:**
+
+```bash
+# Install
+pip install reflex
+
+# Initialize (first time)
+reflex init
+
+# Run development server
+reflex run
+# Opens http://localhost:3000 in browser
+```
+
+**Background tasks with progress:**
+
+```python
+class State(rx.State):
+    progress: int = 0
+    status: str = ""
+
+    async def run_analysis(self):
+        self.status = "Loading data..."
+        self.progress = 10
+        yield
+
+        df = load_data()
+        self.status = "Processing..."
+        self.progress = 50
+        yield
+
+        result = process_data(df)
+        self.status = "Complete"
+        self.progress = 100
+        yield
+```
+
+### Alternative: Flet
+
+If you prefer a more desktop-like feel, Flet remains a good option:
+
+```python
+import flet as ft
+
+def main(page: ft.Page):
+    page.title = "HCD Analysis"
+
+    async def run_analysis(e):
+        # Background task
+        page.run_task(do_analysis)
+
+    page.add(
+        ft.Row([
+            # Sidebar
+            ft.Column([
+                ft.DatePicker(),
+                ft.ElevatedButton("Run", on_click=run_analysis),
+            ]),
+            # Chart area (opens in browser for interactivity)
+            ft.ElevatedButton("View Chart", on_click=open_chart),
+        ])
+    )
+
+ft.app(target=main)  # Desktop window
+# OR
+ft.app(target=main, view=ft.WEB_BROWSER)  # Browser
+```
+
+### Effort Estimate
+- Learning Reflex basics: 2-3 days
+- Rewriting GUI: 1-2 weeks
+- Testing and polish: 3-5 days
+
+---
+
+## 2. Data Storage: SQLite Architecture
+
+### What
+Replace CSV-based data loading with a SQLite database that stores reference data in normalized tables and caches processed patient data.
+
+### Why
+
+| Aspect | Current (CSV) | SQLite |
+|--------|---------------|--------|
+| Startup time | 90MB+ file read + full processing | Load reference data once (< 1MB) |
+| Memory usage | Entire dataset in memory | Incremental queries |
+| Incremental updates | Full reprocess required | Only process new/changed records |
+| Query performance | Pandas groupby/merge | Indexed SQL with CTEs |
+| Data consistency | Multiple CSVs can drift | Single source of truth with FK constraints |
+| Caching | None | Materialized views |
+
+**Expected improvements:**
+- 60-80% faster startup
+- 50-70% memory reduction
+- 90%+ time savings on incremental updates
+
+### How
+
+**Recommended schema (simplified):**
+
+```sql
+-- Reference tables
+CREATE TABLE ref_drug_names (
+    drug_name_raw TEXT PRIMARY KEY,
+    drug_name_std TEXT NOT NULL
+);
+
+CREATE TABLE ref_organizations (
+    org_code TEXT PRIMARY KEY,
+    org_name TEXT NOT NULL
+);
+
+CREATE TABLE ref_directories (
+    directory_id INTEGER PRIMARY KEY,
+    directory_name TEXT UNIQUE NOT NULL
+);
+
+CREATE TABLE ref_drug_directory_map (
+    drug_name_std TEXT,
+    directory_id INTEGER,
+    is_single_valid BOOLEAN DEFAULT FALSE,
+    PRIMARY KEY (drug_name_std, directory_id)
+);
+
+-- Patient data (fact table)
+CREATE TABLE fact_interventions (
+    intervention_id INTEGER PRIMARY KEY,
+    upid TEXT NOT NULL,
+    provider_code TEXT,
+    drug_name_std TEXT NOT NULL,
+    intervention_date DATE NOT NULL,
+    price_actual REAL,
+    directory_id INTEGER,
+    directory_assignment_method TEXT,
+    data_load_batch_id INTEGER
+);
+
+-- Critical indexes
+CREATE INDEX idx_upid ON fact_interventions(upid);
+CREATE INDEX idx_upid_drug ON fact_interventions(upid, drug_name_std);
+CREATE INDEX idx_intervention_date ON fact_interventions(intervention_date);
+
+-- Materialized view for patient summaries (cached aggregations)
+CREATE TABLE mv_patient_treatment_summary (
+    upid TEXT PRIMARY KEY,
+    first_seen DATE,
+    last_seen DATE,
+    total_cost REAL,
+    drug_count INTEGER,
+    last_refresh TIMESTAMP
+);
+
+-- File tracking for incremental updates
+CREATE TABLE processed_files (
+    file_path TEXT PRIMARY KEY,
+    file_hash TEXT NOT NULL,
+    last_processed TIMESTAMP
+);
+```
+
+**Migration strategy:**
+
+1. **Phase 1**: Create schema, load reference tables from existing CSVs
+2. **Phase 2**: Develop incremental load scripts for patient data
+3. **Phase 3**: Build materialized views for aggregations
+4. **Phase 4**: Modify `dashboard_gui.py` to query SQLite instead of processing CSVs
+
+**Key query replacing pandas aggregation:**
+
+```sql
+-- Replaces ~200 lines of pandas groupby/merge
+WITH patient_drugs AS (
+    SELECT
+        upid,
+        drug_name_std,
+        MIN(intervention_date) as first_date,
+        MAX(intervention_date) as last_date,
+        COUNT(*) as intervention_count,
+        SUM(price_actual) as drug_cost
+    FROM fact_interventions
+    WHERE intervention_date BETWEEN :start_date AND :end_date
+        AND provider_code IN (:trust_filters)
+    GROUP BY upid, drug_name_std
+)
+SELECT * FROM patient_drugs;
+```
+
+### Effort Estimate
+- Schema design and setup: 2-3 days
+- Migration scripts: 3-4 days
+- Query optimization: 2-3 days
+- Integration testing: 2-3 days
+
+---
+
+## 3. Snowflake Integration
+
+### What
+Enable direct download of HCD activity data from Snowflake servers, replacing manual CSV exports.
+
+### Why
+- Eliminates manual export step
+- Enables date-range filtering at query level (faster)
+- Automatic caching with TTL
+- Graceful fallback to local files if Snowflake unavailable
+
+### How
+
+**Authentication: SSO Browser Login**
+
+Using `externalbrowser` authenticator - opens system browser for SSO authentication:
+
+```python
+import snowflake.connector
+
+conn = snowflake.connector.connect(
+    account="your_account.region",
+    user="your.email@nhs.net",
+    authenticator="externalbrowser",
+    warehouse="ANALYTICS_WH",
+    database="data_hub",
+    schema="dwh"
+)
+```
+
+**Note**: User will see browser popup on first connection each session.
+
+**Configuration (`config/snowflake.toml`):**
+
+```toml
+[snowflake]
+account = "your_account.region"
+warehouse = "ANALYTICS_WH"
+database = "DataWarehouse"
+schema = "dwh"
+
+[query]
+default_timeout = 300
+chunk_size = 100000
+
+[cache]
+enabled = true
+ttl_hours = 24
+directory = "./data/cache"
+```
+
+**Core connector pattern:**
+
+```python
+from snowflake.connector import connect
+
+class SnowflakeConnector:
+    def fetch_activity_data(self, start_date, end_date, provider_codes=None):
+        query = """
+        SELECT
+            "Provider Code",
+            "PersonKey",
+            "ProductDescription" as "Drug Name",
+            "Intervention Date",
+            "Price Actual",
+            -- ... other columns
+        FROM DataWarehouse.dwh.FactHighCostDrugs
+        WHERE "Intervention Date" BETWEEN :start_date AND :end_date
+        """
+
+        with self.connect() as conn:
+            cursor = conn.cursor()
+            cursor.execute(query, {'start_date': start_date, 'end_date': end_date})
+            return cursor.fetch_pandas_all()
+```
+
+**Caching strategy:**
+
+| Scenario | Action |
+|----------|--------|
+| Same date range within 24 hours | Use cache |
+| Date range includes today | Query Snowflake (data may be updating) |
+| User clicks "Refresh" | Query Snowflake |
+| Snowflake unavailable | Fallback to local CSV/Parquet |
+
+**Data loader with fallback:**
+
+```python
+class DataLoader:
+    def load_data(self, start_date, end_date, force_refresh=False):
+        # 1. Try cache
+        if self.cache and not force_refresh:
+            cached = self.cache.get(start_date, end_date)
+            if cached is not None:
+                return cached, "cache"
+
+        # 2. Try Snowflake
+        try:
+            df = self.snowflake.fetch_activity_data(start_date, end_date)
+            self.cache.set(df, start_date, end_date)
+            return df, "snowflake"
+        except SnowflakeConnectionError:
+            pass
+
+        # 3. Fallback to local files
+        if self.fallback_file.exists():
+            return pd.read_parquet(self.fallback_file), "local_file"
+
+        raise RuntimeError("No data source available")
+```
+
+**Dependencies to add:**
+
+```toml
+dependencies = [
+    "snowflake-connector-python[pandas]>=3.12.0",
+    "cryptography>=42.0.0",
+]
+```
+
+### Effort Estimate
+- Snowflake connector setup: 2-3 days
+- Caching layer: 1-2 days
+- GUI integration (data source selector): 1-2 days
+- Testing with real data: 2-3 days
+
+---
+
+## 4. GP Diagnosis Code Integration
+
+### What
+Use GP diagnosis codes as the **primary source** for directory/specialty assignment, with existing logic as fallback.
+
+### Why
+- More accurate: Diagnosis directly indicates specialty
+- Reduces "Undefined" assignments
+- Leverages existing NHS data linkage
+- Maintains current logic as safety net
+
+### How
+
+**NHS diagnosis code landscape:**
+
+| Code System | Usage | Notes |
+|-------------|-------|-------|
+| **SNOMED CT** | GP systems (mandatory since 2018) | Primary source |
+| **ICD-10** | Secondary care | Maps FROM SNOMED CT |
+| **Read Codes** | Legacy only | Historical records |
+
+**New priority chain:**
+
+```
+1. Drug has single valid directory → use that (unchanged)
+2. [NEW] GP diagnosis available → map SNOMED/ICD-10 to directory
+3. Extract from clinical data fields (existing)
+4. Most frequent for same patient/drug (existing)
+5. UPID-based inference (existing)
+6. Default to "Undefined" (existing)
+```
+
+**ICD-10 to Directory mapping (examples):**
+
+```python
+ICD10_TO_DIRECTORY = {
+    # Neoplasms (Chapter II)
+    "C": ["MEDICAL ONCOLOGY", "CLINICAL ONCOLOGY", "CLINICAL HAEMATOLOGY"],
+
+    # Blood diseases (Chapter III)
+    "D5": ["CLINICAL HAEMATOLOGY"],
+    "D6": ["CLINICAL HAEMATOLOGY"],
+
+    # Endocrine (Chapter IV)
+    "E10": ["DIABETIC MEDICINE"],  # Type 1 diabetes
+    "E11": ["DIABETIC MEDICINE"],  # Type 2 diabetes
+
+    # Eye (Chapter VII)
+    "H0": ["OPHTHALMOLOGY"],
+    "H1": ["OPHTHALMOLOGY"],
+    "H2": ["OPHTHALMOLOGY"],
+    "H3": ["OPHTHALMOLOGY"],
+
+    # Musculoskeletal (Chapter XIII)
+    "M05": ["RHEUMATOLOGY"],  # Rheumatoid arthritis
+    "M06": ["RHEUMATOLOGY"],
+    "M32": ["RHEUMATOLOGY"],  # SLE
+
+    # Genitourinary (Chapter XIV)
+    "N0": ["NEPHROLOGY"],
+    "N1": ["NEPHROLOGY"],
+    "N18": ["NEPHROLOGY"],  # CKD
+}
+```
+
+**Multi-diagnosis resolution:**
+
+```python
+def resolve_directory_from_diagnoses(diagnoses, drug_valid_dirs):
+    """
+    When patient has multiple diagnoses:
+    1. Filter to diagnoses mapping to directories valid for this drug
+    2. Oncology diagnoses take priority (ICD-10 chapter C)
+    3. Use most recent active diagnosis
+    4. Default to first alphabetically (deterministic)
+    """
+    valid_matches = []
+
+    for dx in diagnoses:
+        icd10_prefix = dx.icd10_code[:3]
+        possible_dirs = ICD10_TO_DIRECTORY.get(icd10_prefix, [])
+        matching = set(possible_dirs) & set(drug_valid_dirs)
+
+        if matching:
+            valid_matches.append({
+                'directories': matching,
+                'is_oncology': dx.icd10_code.startswith('C'),
+                'date': dx.diagnosis_date
+            })
+
+    if not valid_matches:
+        return None  # Fall back to existing logic
+
+    # Oncology priority
+    oncology = [m for m in valid_matches if m['is_oncology']]
+    if oncology:
+        return sorted(oncology[0]['directories'])[0]
+
+    # Most recent
+    valid_matches.sort(key=lambda x: x['date'], reverse=True)
+    return sorted(valid_matches[0]['directories'])[0]
+```
+
+**Data source options:**
+
+1. **Snowflake linked data** (recommended): Query `data_hub.dwh.DimClinicalCoding` joined via `PatientPseudo`
+2. **Local CSV cache**: Pre-extracted GP diagnosis data for offline use
+3. **Hybrid**: Cache with Snowflake refresh
+
+**GP Diagnosis Query (confirm column names via Snowflake MCP):**
+
+```sql
+SELECT
+    PatientPseudo,
+    SNOMEDCode,           -- or similar
+    ICD10Code,            -- may need mapping from SNOMED
+    DiagnosisDate,
+    DiagnosisStatus       -- Active/Resolved if available
+FROM data_hub.dwh.DimClinicalCoding
+WHERE PatientPseudo IN (:patient_pseudo_list)
+ORDER BY DiagnosisDate DESC
+```
+
+**New reference file needed (`./data/diagnosis_directory_map.csv`):**
+
+```csv
+icd10_prefix,directory,priority,notes
+C,MEDICAL ONCOLOGY,1,All malignancies
+C81,CLINICAL HAEMATOLOGY,1,Hodgkin lymphoma
+C90,CLINICAL HAEMATOLOGY,1,Multiple myeloma
+E10,DIABETIC MEDICINE,1,Type 1 diabetes
+E11,DIABETIC MEDICINE,1,Type 2 diabetes
+G35,NEUROLOGY,1,Multiple sclerosis
+H0,OPHTHALMOLOGY,1,Eye disorders
+M05,RHEUMATOLOGY,1,Rheumatoid arthritis
+N18,NEPHROLOGY,1,Chronic kidney disease
+```
+
+**Tracking assignment source (for audit):**
+
+```python
+df['Directory_Source'] = pd.NA  # New column
+
+# After each assignment step:
+df.loc[assigned_mask, 'Directory_Source'] = 'DRUG_SINGLE'      # Step 1
+df.loc[assigned_mask, 'Directory_Source'] = 'GP_DIAGNOSIS'     # Step 2 (NEW)
+df.loc[assigned_mask, 'Directory_Source'] = 'CLINICAL_EXTRACT' # Step 3
+# ... etc
+```
+
+### Prerequisites
+- Explore `data_hub.dwh.DimClinicalCoding` schema to confirm exact column names (use Snowflake MCP)
+- Map `PatientPseudo` to your HCD data (may need to add PatientPseudo to your data extract)
+- Obtain SNOMED CT to ICD-10 mapping table from NHS TRUD (if DimClinicalCoding only has SNOMED)
+
+### Effort Estimate
+- Mapping table creation: 2-3 days
+- Snowflake GP query development: 2-3 days
+- Integration with existing logic: 2-3 days
+- Validation and testing: 3-5 days
+
+---
+
+## 5. Code Quality Improvements
+
+### What
+Modernize the codebase with better structure, type hints, error handling, and testing.
+
+### Why
+- `generate_graph()` is 267 lines with complexity >30
+- Zero type hints across entire codebase
+- Global variables create hidden state
+- No automated tests
+- Print statements instead of logging
+
+### How
+
+**Quick wins (implement first):**
+
+1. **Replace global variables** with dataclass:
+```python
+@dataclass
+class AnalysisFilters:
+    start_date: date
+    end_date: date
+    last_seen: date
+    minimum_patients: int
+    selected_trusts: list[str]
+    selected_drugs: list[str]
+    selected_directories: list[str]
+    custom_title: str = ""
+
+    def validate(self) -> list[str]:
+        errors = []
+        if self.start_date >= self.end_date:
+            errors.append("Start date must be before end date")
+        return errors
+```
+
+2. **Externalize configuration:**
+```python
+@dataclass
+class PathConfig:
+    data_dir: Path = Path("./data")
+
+    @property
+    def drug_names_file(self) -> Path:
+        return self.data_dir / "include.csv"
+
+    @property
+    def org_codes_file(self) -> Path:
+        return self.data_dir / "org_codes.csv"
+
+    # ... etc for all 7 reference files
+
+    def validate(self) -> list[str]:
+        """Check all required files exist at startup."""
+        errors = []
+        for file_path in [self.drug_names_file, self.org_codes_file, ...]:
+            if not file_path.exists():
+                errors.append(f"Required file not found: {file_path}")
+        return errors
+```
+
+3. **Add logging:**
+```python
+import logging
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler("./logs/analysis.log"),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger("PatientPathway")
+
+# Replace all print() with:
+logger.info("Starting analysis...")
+logger.error(f"Failed to load file: {e}")
+```
+
+4. **Extract `generate_graph()` into smaller functions:**
+```python
+def generate_graph(df, filters: AnalysisFilters, config: PathConfig):
+    df = prepare_data(df, filters)           # ~50 lines
+    stats = calculate_statistics(df)          # ~80 lines
+    hierarchy = build_hierarchy(df, stats)    # ~60 lines
+    chart_data = prepare_chart_data(hierarchy) # ~40 lines
+    return render_icicle_chart(chart_data, filters.custom_title)  # ~40 lines
+```
+
+**Recommended project structure:**
+
+```
+project/
+├── gui.py                    # Entry point only
+├── core/
+│   ├── config.py            # PathConfig, AnalysisFilters
+│   ├── models.py            # Data models
+│   └── exceptions.py        # Custom exceptions
+├── data_processing/
+│   ├── loader.py            # File/Snowflake loading
+│   ├── transformer.py       # Data transformations
+│   └── validator.py         # Data validation
+├── analysis/
+│   ├── pathway_analyzer.py  # Patient pathway calculations
+│   └── statistics.py        # Statistical calculations
+├── visualization/
+│   └── plotly_generator.py  # Graph generation
+└── tests/
+    ├── test_data_processing.py
+    ├── test_analysis.py
+    └── test_config.py
+```
+
+**Add development dependencies:**
+
+```toml
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.1.0",
+    "mypy>=1.8.0",
+    "black>=24.0.0",
+    "ruff>=0.2.0",
+]
+```
+
+**Priority tests to write:**
+
+```python
+# tests/test_data_processing.py
+def test_drop_duplicate_treatments_ascending():
+    """Verify first intervention kept when ascending=True."""
+    # ...
+
+def test_drop_duplicate_treatments_descending():
+    """Verify last intervention kept when ascending=False."""
+    # ...
+
+# tests/test_config.py
+def test_path_config_validates_missing_files():
+    """Verify validation catches missing reference files."""
+    # ...
+
+def test_analysis_filters_validates_date_range():
+    """Verify start date must be before end date."""
+    # ...
+```
+
+### Effort Estimate
+- Dataclasses and config: 1-2 days
+- Logging setup: 0.5 days
+- Extract `generate_graph()`: 2-3 days
+- Add type hints (public API): 1-2 days
+- Basic test coverage: 2-3 days
+
+---
+
+## Implementation Roadmap
+
+### Phase 1: Foundation (2-3 weeks)
+1. Create `PathConfig` and `AnalysisFilters` dataclasses
+2. Set up logging infrastructure
+3. Design and create SQLite schema
+4. Migrate reference data CSVs to SQLite
+
+### Phase 2: Data Layer (2-3 weeks)
+1. Implement Snowflake connector with SSO browser auth
+2. Build caching layer with TTL
+3. Create data loader with fallback chain
+4. Migrate `dashboard_gui.py` to use SQLite queries
+
+### Phase 3: Diagnosis Integration (2-3 weeks)
+1. Explore `data_hub.dwh.DimClinicalCoding` schema via Snowflake MCP
+2. Create ICD-10 to directory mapping table
+3. Implement GP diagnosis lookup using `PatientPseudo` linkage
+4. Integrate into `department_identification()` as step 2
+5. Add `Directory_Source` tracking column
+
+### Phase 4: GUI Modernization (3-4 weeks)
+1. Learn Reflex fundamentals
+2. Recreate main window and navigation with `rx.vstack`/`rx.hstack`
+3. Implement filter panels (date pickers, checkbox groups)
+4. Integrate Plotly charts with native `rx.plotly()` component
+5. Test with `reflex run`
+
+### Phase 5: Quality & Polish (1-2 weeks)
+1. Add type hints to public API
+2. Write priority unit tests
+3. Extract `generate_graph()` into smaller functions
+4. Documentation and cleanup
+
+---
+
+## Configuration Decisions
+
+Based on requirements, the following decisions have been made:
+
+| Question | Decision |
+|----------|----------|
+| **Snowflake auth** | SSO browser login (`authenticator='externalbrowser'`) |
+| **GP diagnosis data** | `data_hub.dwh.DimClinicalCoding` |
+| **Patient linkage** | Use `PatientPseudo` (anonymized identifier) - NOT UPID |
+| **Plotly interactivity** | Must be interactive - **Reflex has native `rx.plotly()` component** |
+| **Distribution** | Python script (`reflex run`) - no .exe needed |
+
+### Implications
+
+**Snowflake SSO**: Connection code becomes:
+```python
+conn = snowflake.connector.connect(
+    account="your_account.region",
+    user=os.environ.get("SNOWFLAKE_USER"),
+    authenticator="externalbrowser",  # Opens browser for SSO
+    warehouse="ANALYTICS_WH",
+    database="data_hub",
+    schema="dwh"
+)
+```
+
+**Patient Linkage**: The GP diagnosis query needs to join on `PatientPseudo`, not UPID:
+```sql
+SELECT
+    cc.PatientPseudo,
+    cc.SNOMEDCode,      -- Confirm actual column names
+    cc.ICD10Code,
+    cc.DiagnosisDate
+FROM data_hub.dwh.DimClinicalCoding cc
+WHERE cc.PatientPseudo IN (:patient_list)
+```
+
+**Note**: You'll need to confirm the exact column names in `DimClinicalCoding` - explore via Snowflake MCP or SQL client.
+
+**Plotly Interactivity**: Reflex solves this elegantly with native embedding:
+```python
+# Interactive Plotly chart directly in the Reflex app
+rx.plotly(data=State.chart_data, layout=chart_layout)
+```
+Full interactivity (zoom, pan, hover tooltips) works in the browser-based app - no external HTML files needed.
+
+---
+
+## References
+
+- [Reflex Documentation](https://reflex.dev/docs/)
+- [Reflex Plotly Component](https://reflex.dev/docs/library/graphing/plotly/)
+- [Flet Documentation](https://flet.dev/docs/) (alternative)
+- [Snowflake Python Connector](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector)
+- [NHS SNOMED CT](https://digital.nhs.uk/services/terminology-and-classifications/snomed-ct)
+- [NHS ICD-10 Classifications](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/28)