# Patient Pathway Analysis - Improvement Recommendations This document outlines recommended improvements to modernize the Patient Pathway Analysis application, based on multi-domain expert analysis. --- ## Executive Summary | Area | Current State | Recommended Change | Priority | |------|--------------|-------------------|----------| | **GUI Framework** | CustomTkinter | **Reflex** (browser-based, native Plotly) | High | | **Data Storage** | CSV files (90MB+) | SQLite with caching | High | | **Data Source** | Manual CSV export | Direct Snowflake connection | Medium | | **Directory Assignment** | Multi-stage fallback | GP diagnosis codes as primary | Medium | | **Code Quality** | Monolithic, no types | Modular, typed, tested | Low | --- ## 1. GUI Framework: Replace CustomTkinter with Reflex or Flet ### What Replace the CustomTkinter-based GUI with a modern Python framework. Two strong options: - **[Reflex](https://reflex.dev)** - React-based, runs in browser - **[Flet](https://flet.dev)** - Flutter-based, native desktop or browser ### Why Since Python is approved and standalone `.exe` distribution isn't required, **both frameworks are viable**. | Criterion | CustomTkinter | Reflex | Flet | |-----------|---------------|--------|------| | UI paradigm | Native desktop | Browser (localhost) | Desktop or browser | | Component richness | Limited | 60+ React components | Material Design | | Styling | Manual/limited | Full CSS/Tailwind | Flutter theming | | Plotly integration | External HTML | **Native embed** | WebView needed | | State management | Manual | Automatic re-render | Manual updates | | Learning curve | Low | Moderate (React-like) | Low-moderate | | Community | Small | 22k+ GitHub stars | 12k+ GitHub stars | | Maturity | Stable | Active (v0.6+) | Active (v0.80+) | ### Recommendation: **Reflex** Given that: 1. Python is approved for users 2. Standalone `.exe` not required 3. **Interactive Plotly is required** (Reflex has native `rx.plotly()` component) Reflex is now the better choice because: - **Native Plotly support** - no need to open external browser windows - **Modern React-based UI** - cleaner, more customizable - **Simpler state management** - automatic re-rendering on state changes - **Better for data apps** - designed for dashboards and data visualization ### How (Reflex) **Basic app structure:** ```python import reflex as rx class State(rx.State): """Application state.""" start_date: str = "2019-04-01" end_date: str = "2025-04-30" selected_drugs: list[str] = [] selected_trusts: list[str] = [] analysis_running: bool = False chart_data: dict = {} async def run_analysis(self): self.analysis_running = True yield # Update UI # Run analysis (async) df = await self.load_and_process_data() self.chart_data = generate_plotly_figure(df) self.analysis_running = False def index() -> rx.Component: return rx.box( rx.hstack( # Sidebar with filters rx.vstack( rx.date_picker( value=State.start_date, on_change=State.set_start_date, ), rx.checkbox_group( items=drug_list, value=State.selected_drugs, on_change=State.set_selected_drugs, ), rx.button( "Run Analysis", on_click=State.run_analysis, loading=State.analysis_running, ), width="300px", ), # Main content - interactive Plotly chart rx.plotly(data=State.chart_data, layout=chart_layout), width="100%", ) ) app = rx.App() app.add_page(index) ``` **Key components mapping:** | Current Component | Reflex Equivalent | |-------------------|-------------------| | `CTkFrame` | `rx.box`, `rx.vstack`, `rx.hstack` | | `CTkButton` | `rx.button` | | `CTkCheckBox` | `rx.checkbox` | | `CTkSlider` | `rx.slider` | | `DateEntry` | `rx.date_picker` | | `CTkScrollableFrame` | `rx.scroll_area` | | `filedialog` | `rx.upload` | | Plotly HTML file | **`rx.plotly()`** - native embed! | **Running the app:** ```bash # Install pip install reflex # Initialize (first time) reflex init # Run development server reflex run # Opens http://localhost:3000 in browser ``` **Background tasks with progress:** ```python class State(rx.State): progress: int = 0 status: str = "" async def run_analysis(self): self.status = "Loading data..." self.progress = 10 yield df = load_data() self.status = "Processing..." self.progress = 50 yield result = process_data(df) self.status = "Complete" self.progress = 100 yield ``` ### Alternative: Flet If you prefer a more desktop-like feel, Flet remains a good option: ```python import flet as ft def main(page: ft.Page): page.title = "HCD Analysis" async def run_analysis(e): # Background task page.run_task(do_analysis) page.add( ft.Row([ # Sidebar ft.Column([ ft.DatePicker(), ft.ElevatedButton("Run", on_click=run_analysis), ]), # Chart area (opens in browser for interactivity) ft.ElevatedButton("View Chart", on_click=open_chart), ]) ) ft.app(target=main) # Desktop window # OR ft.app(target=main, view=ft.WEB_BROWSER) # Browser ``` ### Effort Estimate - Learning Reflex basics: 2-3 days - Rewriting GUI: 1-2 weeks - Testing and polish: 3-5 days --- ## 2. Data Storage: SQLite Architecture ### What Replace CSV-based data loading with a SQLite database that stores reference data in normalized tables and caches processed patient data. ### Why | Aspect | Current (CSV) | SQLite | |--------|---------------|--------| | Startup time | 90MB+ file read + full processing | Load reference data once (< 1MB) | | Memory usage | Entire dataset in memory | Incremental queries | | Incremental updates | Full reprocess required | Only process new/changed records | | Query performance | Pandas groupby/merge | Indexed SQL with CTEs | | Data consistency | Multiple CSVs can drift | Single source of truth with FK constraints | | Caching | None | Materialized views | **Expected improvements:** - 60-80% faster startup - 50-70% memory reduction - 90%+ time savings on incremental updates ### How **Recommended schema (simplified):** ```sql -- Reference tables CREATE TABLE ref_drug_names ( drug_name_raw TEXT PRIMARY KEY, drug_name_std TEXT NOT NULL ); CREATE TABLE ref_organizations ( org_code TEXT PRIMARY KEY, org_name TEXT NOT NULL ); CREATE TABLE ref_directories ( directory_id INTEGER PRIMARY KEY, directory_name TEXT UNIQUE NOT NULL ); CREATE TABLE ref_drug_directory_map ( drug_name_std TEXT, directory_id INTEGER, is_single_valid BOOLEAN DEFAULT FALSE, PRIMARY KEY (drug_name_std, directory_id) ); -- Patient data (fact table) CREATE TABLE fact_interventions ( intervention_id INTEGER PRIMARY KEY, upid TEXT NOT NULL, provider_code TEXT, drug_name_std TEXT NOT NULL, intervention_date DATE NOT NULL, price_actual REAL, directory_id INTEGER, directory_assignment_method TEXT, data_load_batch_id INTEGER ); -- Critical indexes CREATE INDEX idx_upid ON fact_interventions(upid); CREATE INDEX idx_upid_drug ON fact_interventions(upid, drug_name_std); CREATE INDEX idx_intervention_date ON fact_interventions(intervention_date); -- Materialized view for patient summaries (cached aggregations) CREATE TABLE mv_patient_treatment_summary ( upid TEXT PRIMARY KEY, first_seen DATE, last_seen DATE, total_cost REAL, drug_count INTEGER, last_refresh TIMESTAMP ); -- File tracking for incremental updates CREATE TABLE processed_files ( file_path TEXT PRIMARY KEY, file_hash TEXT NOT NULL, last_processed TIMESTAMP ); ``` **Migration strategy:** 1. **Phase 1**: Create schema, load reference tables from existing CSVs 2. **Phase 2**: Develop incremental load scripts for patient data 3. **Phase 3**: Build materialized views for aggregations 4. **Phase 4**: Modify `dashboard_gui.py` to query SQLite instead of processing CSVs **Key query replacing pandas aggregation:** ```sql -- Replaces ~200 lines of pandas groupby/merge WITH patient_drugs AS ( SELECT upid, drug_name_std, MIN(intervention_date) as first_date, MAX(intervention_date) as last_date, COUNT(*) as intervention_count, SUM(price_actual) as drug_cost FROM fact_interventions WHERE intervention_date BETWEEN :start_date AND :end_date AND provider_code IN (:trust_filters) GROUP BY upid, drug_name_std ) SELECT * FROM patient_drugs; ``` ### Effort Estimate - Schema design and setup: 2-3 days - Migration scripts: 3-4 days - Query optimization: 2-3 days - Integration testing: 2-3 days --- ## 3. Snowflake Integration ### What Enable direct download of HCD activity data from Snowflake servers, replacing manual CSV exports. ### Why - Eliminates manual export step - Enables date-range filtering at query level (faster) - Automatic caching with TTL - Graceful fallback to local files if Snowflake unavailable ### How **Authentication: SSO Browser Login** Using `externalbrowser` authenticator - opens system browser for SSO authentication: ```python import snowflake.connector conn = snowflake.connector.connect( account="your_account.region", user="your.email@nhs.net", authenticator="externalbrowser", warehouse="ANALYTICS_WH", database="data_hub", schema="dwh" ) ``` **Note**: User will see browser popup on first connection each session. **Configuration (`config/snowflake.toml`):** ```toml [snowflake] account = "your_account.region" warehouse = "ANALYTICS_WH" database = "DataWarehouse" schema = "dwh" [query] default_timeout = 300 chunk_size = 100000 [cache] enabled = true ttl_hours = 24 directory = "./data/cache" ``` **Core connector pattern:** ```python from snowflake.connector import connect class SnowflakeConnector: def fetch_activity_data(self, start_date, end_date, provider_codes=None): query = """ SELECT "Provider Code", "PersonKey", "ProductDescription" as "Drug Name", "Intervention Date", "Price Actual", -- ... other columns FROM DataWarehouse.dwh.FactHighCostDrugs WHERE "Intervention Date" BETWEEN :start_date AND :end_date """ with self.connect() as conn: cursor = conn.cursor() cursor.execute(query, {'start_date': start_date, 'end_date': end_date}) return cursor.fetch_pandas_all() ``` **Caching strategy:** | Scenario | Action | |----------|--------| | Same date range within 24 hours | Use cache | | Date range includes today | Query Snowflake (data may be updating) | | User clicks "Refresh" | Query Snowflake | | Snowflake unavailable | Fallback to local CSV/Parquet | **Data loader with fallback:** ```python class DataLoader: def load_data(self, start_date, end_date, force_refresh=False): # 1. Try cache if self.cache and not force_refresh: cached = self.cache.get(start_date, end_date) if cached is not None: return cached, "cache" # 2. Try Snowflake try: df = self.snowflake.fetch_activity_data(start_date, end_date) self.cache.set(df, start_date, end_date) return df, "snowflake" except SnowflakeConnectionError: pass # 3. Fallback to local files if self.fallback_file.exists(): return pd.read_parquet(self.fallback_file), "local_file" raise RuntimeError("No data source available") ``` **Dependencies to add:** ```toml dependencies = [ "snowflake-connector-python[pandas]>=3.12.0", "cryptography>=42.0.0", ] ``` ### Effort Estimate - Snowflake connector setup: 2-3 days - Caching layer: 1-2 days - GUI integration (data source selector): 1-2 days - Testing with real data: 2-3 days --- ## 4. GP Diagnosis Code Integration ### What Use GP diagnosis codes as the **primary source** for directory/specialty assignment, with existing logic as fallback. ### Why - More accurate: Diagnosis directly indicates specialty - Reduces "Undefined" assignments - Leverages existing NHS data linkage - Maintains current logic as safety net ### How **NHS diagnosis code landscape:** | Code System | Usage | Notes | |-------------|-------|-------| | **SNOMED CT** | GP systems (mandatory since 2018) | Primary source | | **ICD-10** | Secondary care | Maps FROM SNOMED CT | | **Read Codes** | Legacy only | Historical records | **New priority chain:** ``` 1. Drug has single valid directory → use that (unchanged) 2. [NEW] GP diagnosis available → map SNOMED/ICD-10 to directory 3. Extract from clinical data fields (existing) 4. Most frequent for same patient/drug (existing) 5. UPID-based inference (existing) 6. Default to "Undefined" (existing) ``` **ICD-10 to Directory mapping (examples):** ```python ICD10_TO_DIRECTORY = { # Neoplasms (Chapter II) "C": ["MEDICAL ONCOLOGY", "CLINICAL ONCOLOGY", "CLINICAL HAEMATOLOGY"], # Blood diseases (Chapter III) "D5": ["CLINICAL HAEMATOLOGY"], "D6": ["CLINICAL HAEMATOLOGY"], # Endocrine (Chapter IV) "E10": ["DIABETIC MEDICINE"], # Type 1 diabetes "E11": ["DIABETIC MEDICINE"], # Type 2 diabetes # Eye (Chapter VII) "H0": ["OPHTHALMOLOGY"], "H1": ["OPHTHALMOLOGY"], "H2": ["OPHTHALMOLOGY"], "H3": ["OPHTHALMOLOGY"], # Musculoskeletal (Chapter XIII) "M05": ["RHEUMATOLOGY"], # Rheumatoid arthritis "M06": ["RHEUMATOLOGY"], "M32": ["RHEUMATOLOGY"], # SLE # Genitourinary (Chapter XIV) "N0": ["NEPHROLOGY"], "N1": ["NEPHROLOGY"], "N18": ["NEPHROLOGY"], # CKD } ``` **Multi-diagnosis resolution:** ```python def resolve_directory_from_diagnoses(diagnoses, drug_valid_dirs): """ When patient has multiple diagnoses: 1. Filter to diagnoses mapping to directories valid for this drug 2. Oncology diagnoses take priority (ICD-10 chapter C) 3. Use most recent active diagnosis 4. Default to first alphabetically (deterministic) """ valid_matches = [] for dx in diagnoses: icd10_prefix = dx.icd10_code[:3] possible_dirs = ICD10_TO_DIRECTORY.get(icd10_prefix, []) matching = set(possible_dirs) & set(drug_valid_dirs) if matching: valid_matches.append({ 'directories': matching, 'is_oncology': dx.icd10_code.startswith('C'), 'date': dx.diagnosis_date }) if not valid_matches: return None # Fall back to existing logic # Oncology priority oncology = [m for m in valid_matches if m['is_oncology']] if oncology: return sorted(oncology[0]['directories'])[0] # Most recent valid_matches.sort(key=lambda x: x['date'], reverse=True) return sorted(valid_matches[0]['directories'])[0] ``` **Data source options:** 1. **Snowflake linked data** (recommended): Query `data_hub.dwh.DimClinicalCoding` joined via `PatientPseudo` 2. **Local CSV cache**: Pre-extracted GP diagnosis data for offline use 3. **Hybrid**: Cache with Snowflake refresh **GP Diagnosis Query (confirm column names via Snowflake MCP):** ```sql SELECT PatientPseudo, SNOMEDCode, -- or similar ICD10Code, -- may need mapping from SNOMED DiagnosisDate, DiagnosisStatus -- Active/Resolved if available FROM data_hub.dwh.DimClinicalCoding WHERE PatientPseudo IN (:patient_pseudo_list) ORDER BY DiagnosisDate DESC ``` **New reference file needed (`./data/diagnosis_directory_map.csv`):** ```csv icd10_prefix,directory,priority,notes C,MEDICAL ONCOLOGY,1,All malignancies C81,CLINICAL HAEMATOLOGY,1,Hodgkin lymphoma C90,CLINICAL HAEMATOLOGY,1,Multiple myeloma E10,DIABETIC MEDICINE,1,Type 1 diabetes E11,DIABETIC MEDICINE,1,Type 2 diabetes G35,NEUROLOGY,1,Multiple sclerosis H0,OPHTHALMOLOGY,1,Eye disorders M05,RHEUMATOLOGY,1,Rheumatoid arthritis N18,NEPHROLOGY,1,Chronic kidney disease ``` **Tracking assignment source (for audit):** ```python df['Directory_Source'] = pd.NA # New column # After each assignment step: df.loc[assigned_mask, 'Directory_Source'] = 'DRUG_SINGLE' # Step 1 df.loc[assigned_mask, 'Directory_Source'] = 'GP_DIAGNOSIS' # Step 2 (NEW) df.loc[assigned_mask, 'Directory_Source'] = 'CLINICAL_EXTRACT' # Step 3 # ... etc ``` ### Prerequisites - Explore `data_hub.dwh.DimClinicalCoding` schema to confirm exact column names (use Snowflake MCP) - Map `PatientPseudo` to your HCD data (may need to add PatientPseudo to your data extract) - Obtain SNOMED CT to ICD-10 mapping table from NHS TRUD (if DimClinicalCoding only has SNOMED) ### Effort Estimate - Mapping table creation: 2-3 days - Snowflake GP query development: 2-3 days - Integration with existing logic: 2-3 days - Validation and testing: 3-5 days --- ## 5. Code Quality Improvements ### What Modernize the codebase with better structure, type hints, error handling, and testing. ### Why - `generate_graph()` is 267 lines with complexity >30 - Zero type hints across entire codebase - Global variables create hidden state - No automated tests - Print statements instead of logging ### How **Quick wins (implement first):** 1. **Replace global variables** with dataclass: ```python @dataclass class AnalysisFilters: start_date: date end_date: date last_seen: date minimum_patients: int selected_trusts: list[str] selected_drugs: list[str] selected_directories: list[str] custom_title: str = "" def validate(self) -> list[str]: errors = [] if self.start_date >= self.end_date: errors.append("Start date must be before end date") return errors ``` 2. **Externalize configuration:** ```python @dataclass class PathConfig: data_dir: Path = Path("./data") @property def drug_names_file(self) -> Path: return self.data_dir / "include.csv" @property def org_codes_file(self) -> Path: return self.data_dir / "org_codes.csv" # ... etc for all 7 reference files def validate(self) -> list[str]: """Check all required files exist at startup.""" errors = [] for file_path in [self.drug_names_file, self.org_codes_file, ...]: if not file_path.exists(): errors.append(f"Required file not found: {file_path}") return errors ``` 3. **Add logging:** ```python import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("./logs/analysis.log"), logging.StreamHandler() ] ) logger = logging.getLogger("PatientPathway") # Replace all print() with: logger.info("Starting analysis...") logger.error(f"Failed to load file: {e}") ``` 4. **Extract `generate_graph()` into smaller functions:** ```python def generate_graph(df, filters: AnalysisFilters, config: PathConfig): df = prepare_data(df, filters) # ~50 lines stats = calculate_statistics(df) # ~80 lines hierarchy = build_hierarchy(df, stats) # ~60 lines chart_data = prepare_chart_data(hierarchy) # ~40 lines return render_icicle_chart(chart_data, filters.custom_title) # ~40 lines ``` **Recommended project structure:** ``` project/ ├── gui.py # Entry point only ├── core/ │ ├── config.py # PathConfig, AnalysisFilters │ ├── models.py # Data models │ └── exceptions.py # Custom exceptions ├── data_processing/ │ ├── loader.py # File/Snowflake loading │ ├── transformer.py # Data transformations │ └── validator.py # Data validation ├── analysis/ │ ├── pathway_analyzer.py # Patient pathway calculations │ └── statistics.py # Statistical calculations ├── visualization/ │ └── plotly_generator.py # Graph generation └── tests/ ├── test_data_processing.py ├── test_analysis.py └── test_config.py ``` **Add development dependencies:** ```toml [project.optional-dependencies] dev = [ "pytest>=8.0.0", "pytest-cov>=4.1.0", "mypy>=1.8.0", "black>=24.0.0", "ruff>=0.2.0", ] ``` **Priority tests to write:** ```python # tests/test_data_processing.py def test_drop_duplicate_treatments_ascending(): """Verify first intervention kept when ascending=True.""" # ... def test_drop_duplicate_treatments_descending(): """Verify last intervention kept when ascending=False.""" # ... # tests/test_config.py def test_path_config_validates_missing_files(): """Verify validation catches missing reference files.""" # ... def test_analysis_filters_validates_date_range(): """Verify start date must be before end date.""" # ... ``` ### Effort Estimate - Dataclasses and config: 1-2 days - Logging setup: 0.5 days - Extract `generate_graph()`: 2-3 days - Add type hints (public API): 1-2 days - Basic test coverage: 2-3 days --- ## Implementation Roadmap ### Phase 1: Foundation (2-3 weeks) 1. Create `PathConfig` and `AnalysisFilters` dataclasses 2. Set up logging infrastructure 3. Design and create SQLite schema 4. Migrate reference data CSVs to SQLite ### Phase 2: Data Layer (2-3 weeks) 1. Implement Snowflake connector with SSO browser auth 2. Build caching layer with TTL 3. Create data loader with fallback chain 4. Migrate `dashboard_gui.py` to use SQLite queries ### Phase 3: Diagnosis Integration (2-3 weeks) 1. Explore `data_hub.dwh.DimClinicalCoding` schema via Snowflake MCP 2. Create ICD-10 to directory mapping table 3. Implement GP diagnosis lookup using `PatientPseudo` linkage 4. Integrate into `department_identification()` as step 2 5. Add `Directory_Source` tracking column ### Phase 4: GUI Modernization (3-4 weeks) 1. Learn Reflex fundamentals 2. Recreate main window and navigation with `rx.vstack`/`rx.hstack` 3. Implement filter panels (date pickers, checkbox groups) 4. Integrate Plotly charts with native `rx.plotly()` component 5. Test with `reflex run` ### Phase 5: Quality & Polish (1-2 weeks) 1. Add type hints to public API 2. Write priority unit tests 3. Extract `generate_graph()` into smaller functions 4. Documentation and cleanup --- ## Configuration Decisions Based on requirements, the following decisions have been made: | Question | Decision | |----------|----------| | **Snowflake auth** | SSO browser login (`authenticator='externalbrowser'`) | | **GP diagnosis data** | `data_hub.dwh.DimClinicalCoding` | | **Patient linkage** | Use `PatientPseudo` (anonymized identifier) - NOT UPID | | **Plotly interactivity** | Must be interactive - **Reflex has native `rx.plotly()` component** | | **Distribution** | Python script (`reflex run`) - no .exe needed | ### Implications **Snowflake SSO**: Connection code becomes: ```python conn = snowflake.connector.connect( account="your_account.region", user=os.environ.get("SNOWFLAKE_USER"), authenticator="externalbrowser", # Opens browser for SSO warehouse="ANALYTICS_WH", database="data_hub", schema="dwh" ) ``` **Patient Linkage**: The GP diagnosis query needs to join on `PatientPseudo`, not UPID: ```sql SELECT cc.PatientPseudo, cc.SNOMEDCode, -- Confirm actual column names cc.ICD10Code, cc.DiagnosisDate FROM data_hub.dwh.DimClinicalCoding cc WHERE cc.PatientPseudo IN (:patient_list) ``` **Note**: You'll need to confirm the exact column names in `DimClinicalCoding` - explore via Snowflake MCP or SQL client. **Plotly Interactivity**: Reflex solves this elegantly with native embedding: ```python # Interactive Plotly chart directly in the Reflex app rx.plotly(data=State.chart_data, layout=chart_layout) ``` Full interactivity (zoom, pan, hover tooltips) works in the browser-based app - no external HTML files needed. --- ## References - [Reflex Documentation](https://reflex.dev/docs/) - [Reflex Plotly Component](https://reflex.dev/docs/library/graphing/plotly/) - [Flet Documentation](https://flet.dev/docs/) (alternative) - [Snowflake Python Connector](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector) - [NHS SNOMED CT](https://digital.nhs.uk/services/terminology-and-classifications/snomed-ct) - [NHS ICD-10 Classifications](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/28)