Rewrote README.md, USER_GUIDE.md, and DEPLOYMENT.md to reflect
the Dash application. Updated RALPH_PROMPT.md, guardrails.md, and
DESIGN_SYSTEM.md to remove Reflex references. All non-archive
documentation now reflects the current Dash + DMC architecture.
Extract load_data() and load_pathway_data() logic from Reflex AppState
into standalone functions in src/data_processing/pathway_queries.py.
Create thin dash_app/data/queries.py wrapper with DB_PATH resolution.
Dry run test revealed GP lookup queries timing out at 30s (connection_timeout
in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to
5000 — query time is ~40s regardless of batch size (CTE compilation overhead),
so larger batches reduce total time from ~50min to ~6min for 36K patients.
Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate,
42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
Replace old per-patient indication matching in refresh_pathways.py with
drug-aware matching via assign_drug_indications(). Each drug is now
cross-referenced against both the patient's GP diagnoses AND the
DimSearchTerm.csv drug mapping. GP codes restricted to HCD data window
via earliest_hcd_date parameter.
- Replace QUALIFY ROW_NUMBER()=1 with GROUP BY + COUNT(*) to return all matching
Search_Terms per patient instead of just the most recent
- Add earliest_hcd_date parameter to restrict GP codes to HCD data window
- Return code_frequency column (count of matching SNOMED codes per Search_Term)
for use as tiebreaker in drug-aware indication matching
- Update empty DataFrame returns to match new column format
Merge 'allergic asthma' and 'severe persistent allergic asthma' into
canonical 'asthma' in both CLUSTER_MAPPING_SQL (Snowflake CTE) and
load_drug_indication_mapping() (DimSearchTerm.csv loader).
- CLUSTER_MAPPING_SQL: 3 Cluster_IDs (AST_COD, eFI2_Asthma, SEVAST_COD) now
all map to Search_Term = 'asthma'
- Added SEARCH_TERM_MERGE_MAP constant for reusable normalization
- load_drug_indication_mapping() applies merge at CSV load time
- urticaria (XSAL_COD) stays separate — not merged with asthma
- Combined asthma drug list: BENRALIZUMAB, DUPILUMAB, INHALED, MEPOLIZUMAB,
OMALIZUMAB, RESLIZUMAB
The UNIQUE constraint was UNIQUE(date_filter_id, ids) instead of
UNIQUE(date_filter_id, chart_type, ids), causing INSERT OR REPLACE
to overwrite directory chart root/trust nodes when indication nodes
were inserted. Dropped and recreated the table, re-ran full refresh.
Validation: both chart types have all hierarchy levels (0-5),
all 12 date filters produce valid icicle charts, KPIs correct.
prepare_data() mapped Provider Code → Name in-place. When called for directory
charts first, then indication charts, the second call re-mapped already-mapped
values to NaN, silently dropping all data. Added df.copy() to prevent mutation.
Also fixes directory charts only generating data for the first date filter.
Results: 3,633 pathway nodes now generated (1,101 directory + 2,532 indication)
across all 12 datasets (6 date filters × 2 chart types).
- Add selected_chart_type state variable and set_chart_type() handler
- Add chart_type filter to load_pathway_data() WHERE clause
- Create segmented control toggle component in filter strip
- Add dynamic hierarchy label (Directorate vs Indication)
- Update chart title to include chart type prefix
- Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql
- Add get_patient_indication_groups() function that queries Snowflake directly
- Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime
- Handles edge cases: empty list, Snowflake unavailable
- Batch processing with configurable batch_size (default 500)
- Comprehensive logging for match statistics