feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)
- Add CLUSTER_MAPPING_SQL constant embedding full snomed_indication_mapping_query.sql - Add get_patient_indication_groups() function that queries Snowflake directly - Uses QUALIFY ROW_NUMBER() to get most recent diagnosis per patient - Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime - Handles edge cases: empty list, Snowflake unavailable - Batch processing with configurable batch_size (default 500) - Comprehensive logging for match statistics
This commit is contained in:
+92
-111
@@ -1,25 +1,25 @@
|
|||||||
# Implementation Plan - Direct SNOMED Indication Mapping
|
# Implementation Plan - Indication-Based Pathway Charts
|
||||||
|
|
||||||
## Project Overview
|
## Project Overview
|
||||||
|
|
||||||
Extend the pathway analysis application to use direct SNOMED code matching from GP records to:
|
Extend the pathway analysis application to show indication-based icicle charts alongside directory-based charts. Patient diagnoses are matched from GP records using SNOMED cluster codes.
|
||||||
1. **Improve directorate assignment** - Use diagnosis-based directorate as primary method
|
|
||||||
2. **Add indication-based icicle chart** - New chart type showing Trust → Search_Term → Drug → Pathway
|
|
||||||
|
|
||||||
### Data Source
|
|
||||||
`data/drug_snomed_mapping_enriched.csv` - 163K rows mapping:
|
|
||||||
- Drug → Indication → TA_ID → Search_Term → SNOMEDCode → PrimaryDirectorate
|
|
||||||
|
|
||||||
### Key Design Decisions
|
### Key Design Decisions
|
||||||
| Aspect | Decision |
|
| Aspect | Decision |
|
||||||
|--------|----------|
|
|--------|----------|
|
||||||
| Primary directorate method | Diagnosis-based (SNOMED match → PrimaryDirectorate) |
|
| SNOMED source | Query `ClinicalCodingClusterSnomedCodes` clusters directly in Snowflake |
|
||||||
| Fallback | department_identification() chain |
|
| Grouping level | `Search_Term` from cluster mapping (~148 conditions) |
|
||||||
| Grouping level | `Search_Term` column (187 unique values) |
|
| Chart types | Two: "By Directory" (existing) and "By Indication" (new toggle) |
|
||||||
| Chart types | Two: "By Directory" and "By Indication" (user toggle) |
|
|
||||||
| No-match display | Show assigned directorate in indication chart (mixed labels) |
|
| No-match display | Show assigned directorate in indication chart (mixed labels) |
|
||||||
| Multiple matches | Use most recent SNOMED code by GP record date |
|
| Multiple matches | Use most recent SNOMED code by GP record date |
|
||||||
| Data storage | SQLite table `ref_drug_snomed_mapping`, accessed at ingestion |
|
| Data storage | No local SNOMED mapping — query Snowflake at refresh time |
|
||||||
|
|
||||||
|
### SNOMED Cluster Query
|
||||||
|
The `snomed_indication_mapping_query.sql` file contains the master query:
|
||||||
|
- Maps Search_Term → Cluster_ID for ~148 conditions
|
||||||
|
- Joins `ClinicalCodingClusterSnomedCodes` to get SNOMED codes per cluster
|
||||||
|
- Includes explicit manual mappings for conditions not in clusters
|
||||||
|
- Returns: Search_Term, SNOMEDCode, SNOMEDDescription
|
||||||
|
|
||||||
## Quality Checks
|
## Quality Checks
|
||||||
|
|
||||||
@@ -39,101 +39,62 @@ python -m reflex compile
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Phase 1: Data Infrastructure
|
## Phase 1: Snowflake Integration
|
||||||
|
|
||||||
### 1.1 Create SQLite Table for SNOMED Mapping
|
### 1.1 Create Indication Lookup Query
|
||||||
- [x] Add `REF_DRUG_SNOMED_MAPPING_SCHEMA` to `data_processing/schema.py`:
|
- [x] Add `get_patient_indication_groups()` function to `data_processing/diagnosis_lookup.py`:
|
||||||
- Columns: drug_name, indication, ta_id, search_term, snomed_code, snomed_description, cleaned_drug_name, primary_directorate, all_directorates
|
- Takes: list of patient pseudonyms (PseudoNHSNoLinked values)
|
||||||
- Index on: cleaned_drug_name, snomed_code, search_term
|
- Uses the cluster query from `snomed_indication_mapping_query.sql` as a CTE
|
||||||
- [x] Add `create_drug_snomed_mapping_table()` helper function
|
- Joins with `PrimaryCareClinicalCoding` to find patients with matching diagnoses
|
||||||
- [x] Add to `ALL_TABLES_SCHEMA` and migration
|
- Returns: DataFrame with PatientPseudonym, Search_Term, EventDateTime
|
||||||
- [x] Verify: `python -m data_processing.migrate` creates table
|
- Uses most recent match per patient (ORDER BY EventDateTime DESC)
|
||||||
|
- [x] Handle edge cases: Snowflake unavailable, empty patient list
|
||||||
|
- [ ] Verify: Function returns expected Search_Terms for test patients
|
||||||
|
|
||||||
### 1.2 Load Enriched Mapping Data
|
### 1.2 Update Data Pipeline to Include Indications
|
||||||
- [x] Create `data_processing/load_snomed_mapping.py` script:
|
- [ ] Modify `cli/refresh_pathways.py` to call indication lookup during refresh:
|
||||||
- Read `data/drug_snomed_mapping_enriched.csv`
|
- After fetching HCD data, extract unique PseudoNHSNoLinked values
|
||||||
- Insert into `ref_drug_snomed_mapping` table
|
- Call `get_patient_indication_groups()` with patient list
|
||||||
- Log: row count, unique drugs, unique search terms
|
- Create `indication_df` mapping UPID → Indication_Group
|
||||||
- [x] Add CLI entry point: `python -m data_processing.load_snomed_mapping`
|
- For patients with no GP match: Indication_Group = fallback directorate
|
||||||
- [x] Verify: Query confirms 163K+ rows, 187 search terms
|
- [ ] Log coverage: X% diagnosis-matched, Y% fallback
|
||||||
|
- [ ] Verify: indication_df has correct structure for pathway processing
|
||||||
### 1.3 Extend Diagnosis Lookup Module
|
|
||||||
- [x] Add `get_drug_snomed_codes(drug_name)` to `diagnosis_lookup.py`:
|
|
||||||
- Query `ref_drug_snomed_mapping` for all SNOMED codes for a drug
|
|
||||||
- Return list of DrugSnomedMapping(snomed_code, snomed_description, search_term, primary_directorate, indication, ta_id)
|
|
||||||
- [x] Add `patient_has_indication_direct(patient_pseudonym, snomed_codes, connector)`:
|
|
||||||
- Query `PrimaryCareClinicalCoding` directly for exact SNOMED code matches
|
|
||||||
- Return most recent match by EventDateTime
|
|
||||||
- Return: DirectSnomedMatchResult(matched_code, search_term, primary_directorate, event_date) or unmatched
|
|
||||||
- [x] Verify: Tested with ADALIMUMAB (1320 mappings, 10 Search_Terms), RANIBIZUMAB (104 mappings), case-insensitivity
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Phase 2: Pathway Processing Updates
|
## Phase 2: Schema & Processing Updates
|
||||||
|
|
||||||
### 2.1 Update Directorate Assignment Logic
|
### 2.1 Add Chart Type Support to Schema
|
||||||
- [x] Modify `tools/data.py` `department_identification()` or create wrapper:
|
- [x] Add `chart_type` column to `pathway_nodes` table (ALREADY DONE)
|
||||||
- Add `get_directorate_from_diagnosis(upid, drug_name, connector)` function
|
- [x] Update UNIQUE constraint to include chart_type (ALREADY DONE)
|
||||||
- Logic: Try diagnosis-based first → fallback to department_identification()
|
- [x] Add indexes for chart_type filtering (ALREADY DONE)
|
||||||
- Return: (directorate, source) where source is "DIAGNOSIS" or "FALLBACK"
|
- [ ] Verify: Existing migration works correctly
|
||||||
- [x] Track assignment source for metrics (how many diagnosis-based vs fallback)
|
|
||||||
- [x] Verify: Test with sample patient data
|
|
||||||
|
|
||||||
### 2.2 Add Chart Type Support to Schema
|
### 2.2 Create Indication Pathway Processing
|
||||||
- [x] Add `chart_type` column to `pathway_nodes` table:
|
- [x] Add `generate_icicle_chart_indication()` to `pathway_analyzer.py` (ALREADY DONE)
|
||||||
- Values: "directory" (existing), "indication" (new)
|
- [x] Add `process_indication_pathway_for_date_filter()` to `pathway_pipeline.py` (ALREADY DONE)
|
||||||
- Update schema in `data_processing/schema.py`
|
- [x] Add `extract_indication_fields()` for denormalized columns (ALREADY DONE)
|
||||||
- [x] Update UNIQUE constraint to include chart_type: `UNIQUE(date_filter_id, chart_type, ids)`
|
- [x] Update `convert_to_records()` with `chart_type` parameter (ALREADY DONE)
|
||||||
- [x] Add `idx_pathway_nodes_chart_type` index for filtering by chart type
|
- [ ] Verify: Code compiles, imports work correctly
|
||||||
- [x] Add `migrate_pathway_nodes_chart_type()` function for existing databases
|
|
||||||
- [x] Update `initialize_database()` to run migration automatically
|
|
||||||
- [x] Verify: Migration adds column, existing data defaults to "directory"
|
|
||||||
|
|
||||||
### 2.3 Create Indication Pathway Processing
|
### 2.3 Update Refresh Command for Dual Charts
|
||||||
- [x] Add `process_indication_pathway_for_date_filter()` to `pathway_pipeline.py`:
|
- [x] Add `--chart-type` argument: "all", "directory", "indication" (ALREADY DONE)
|
||||||
- Group by: Trust → Search_Term → Drug → Pathway
|
- [ ] Update indication processing to use new `get_patient_indication_groups()`:
|
||||||
- For unmatched patients: use directorate name as Search_Term fallback
|
- Replace `batch_lookup_indication_groups()` with the new Snowflake-direct approach
|
||||||
- Output: Same structure as directory pathways but with indication grouping
|
- Pass indication_df to `process_indication_pathway_for_date_filter()`
|
||||||
- [x] Add `generate_icicle_chart_indication()` to `pathway_analyzer.py`:
|
- [ ] Process all 6 date filters for both chart types
|
||||||
- Variant of `generate_icicle_chart()` that uses indication_df instead of directory_df
|
- [ ] Verify: Both chart types generate pathway data
|
||||||
- Takes `indication_df` parameter mapping UPID → Indication_Group
|
|
||||||
- [x] Add `extract_indication_fields()` for denormalized columns:
|
|
||||||
- Extract: trust_name, search_term (or fallback_directorate), drug_sequence
|
|
||||||
- [x] Update `convert_to_records()` to include `chart_type` parameter
|
|
||||||
- [x] Add `ChartType` type alias ("directory" | "indication")
|
|
||||||
- [x] Verify: Code compiles, imports work correctly
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Phase 3: CLI & Data Refresh Updates
|
## Phase 3: Test Full Pipeline
|
||||||
|
|
||||||
### 3.1 Update Refresh Command for Dual Chart Types
|
### 3.1 Test Refresh with Real Data
|
||||||
- [x] Modify `cli/refresh_pathways.py`:
|
- [ ] Run `python -m cli.refresh_pathways --chart-type all` with Snowflake
|
||||||
- Process both "directory" and "indication" chart types
|
- [ ] Verify pathway_nodes table has both chart_type values:
|
||||||
- For each of 6 date filters: generate 2 chart datasets
|
- `SELECT chart_type, COUNT(*) FROM pathway_nodes GROUP BY chart_type`
|
||||||
- Total: 12 pathway datasets (6 dates × 2 chart types)
|
- [ ] Verify indication hierarchy: Trust → Search_Term → Drug → Pathway
|
||||||
- [x] Add `--chart-type` argument: "all" (default), "directory", "indication"
|
- [ ] Verify unmatched patients show with directorate fallback label
|
||||||
- [x] Update progress logging to show both chart types
|
|
||||||
- [x] Verify: Dry run shows both chart types being processed (Task 3.2 complete)
|
|
||||||
|
|
||||||
### 3.2 Integrate Diagnosis-Based Directorate in Pipeline
|
|
||||||
- [x] Add `batch_lookup_indication_groups()` to `diagnosis_lookup.py`:
|
|
||||||
- Batch lookup SNOMED matches for all patients (500 patients per batch)
|
|
||||||
- Returns DataFrame with UPID, Indication_Group, Source columns
|
|
||||||
- Source is "DIAGNOSIS" (GP match found) or "FALLBACK" (no match)
|
|
||||||
- [x] Update `cli/refresh_pathways.py` indication processing:
|
|
||||||
- Call `batch_lookup_indication_groups()` before processing indication charts
|
|
||||||
- Build `indication_df` for use with `process_indication_pathway_for_date_filter()`
|
|
||||||
- Process all 6 date filters with indication grouping
|
|
||||||
- [x] Handle Snowflake connection for GP record queries (batched for performance)
|
|
||||||
- [x] Log coverage: X% diagnosis-matched, Y% fallback
|
|
||||||
- [ ] Verify: Test refresh with --dry-run, check coverage stats
|
|
||||||
|
|
||||||
### 3.3 Test Full Refresh Pipeline
|
|
||||||
- [~] Run `python -m cli.refresh_pathways` with real data
|
|
||||||
- [ ] Verify pathway_nodes table has both chart_type values
|
|
||||||
- [ ] Verify indication chart has expected hierarchy (Trust → SearchTerm → Drug)
|
|
||||||
- [ ] Verify unmatched patients appear with directorate fallback label
|
|
||||||
- [ ] Document: Processing time, record counts, coverage percentages
|
- [ ] Document: Processing time, record counts, coverage percentages
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -166,20 +127,14 @@ python -m reflex compile
|
|||||||
|
|
||||||
## Phase 5: Validation & Documentation
|
## Phase 5: Validation & Documentation
|
||||||
|
|
||||||
### 5.1 Measure Coverage Improvement
|
### 5.1 End-to-End Validation
|
||||||
- [ ] Compare match rates: cluster-only vs cluster+direct SNOMED
|
|
||||||
- [ ] Generate report: % of patients with diagnosis-based directorate
|
|
||||||
- [ ] Identify drugs with best/worst coverage improvement
|
|
||||||
- [ ] Document results in progress.txt
|
|
||||||
|
|
||||||
### 5.2 End-to-End Validation
|
|
||||||
- [ ] Run full app with both chart types
|
- [ ] Run full app with both chart types
|
||||||
- [ ] Verify chart toggle works correctly
|
- [ ] Verify chart toggle works correctly
|
||||||
- [ ] Verify filter interactions (drugs, directorates) work for both types
|
- [ ] Verify filter interactions (drugs, directorates) work for both types
|
||||||
- [ ] Verify KPIs update correctly for both chart types
|
- [ ] Verify KPIs update correctly for both chart types
|
||||||
- [ ] Test at multiple viewport sizes
|
- [ ] Test at multiple viewport sizes
|
||||||
|
|
||||||
### 5.3 Update Documentation
|
### 5.2 Update Documentation
|
||||||
- [ ] Update CLAUDE.md with new architecture
|
- [ ] Update CLAUDE.md with new architecture
|
||||||
- [ ] Document new CLI arguments
|
- [ ] Document new CLI arguments
|
||||||
- [ ] Document chart_type toggle behavior
|
- [ ] Document chart_type toggle behavior
|
||||||
@@ -193,7 +148,7 @@ All tasks marked `[x]` AND:
|
|||||||
- [ ] App compiles without errors (`reflex compile` succeeds)
|
- [ ] App compiles without errors (`reflex compile` succeeds)
|
||||||
- [ ] Both chart types generate pathway data (12 total: 6 dates × 2 types)
|
- [ ] Both chart types generate pathway data (12 total: 6 dates × 2 types)
|
||||||
- [ ] Chart type toggle switches between Directory and Indication views
|
- [ ] Chart type toggle switches between Directory and Indication views
|
||||||
- [ ] Diagnosis-based directorate is primary method with fallback working
|
- [ ] GP diagnosis matching works via Snowflake cluster query
|
||||||
- [ ] Unmatched patients show in indication chart with directorate fallback label
|
- [ ] Unmatched patients show in indication chart with directorate fallback label
|
||||||
- [ ] Coverage metrics logged (% diagnosis-matched vs fallback)
|
- [ ] Coverage metrics logged (% diagnosis-matched vs fallback)
|
||||||
- [ ] All filters work correctly for both chart types
|
- [ ] All filters work correctly for both chart types
|
||||||
@@ -203,6 +158,35 @@ All tasks marked `[x]` AND:
|
|||||||
|
|
||||||
## Reference
|
## Reference
|
||||||
|
|
||||||
|
### SNOMED Cluster Query Structure
|
||||||
|
```sql
|
||||||
|
-- From snomed_indication_mapping_query.sql
|
||||||
|
WITH SearchTermClusters AS (
|
||||||
|
SELECT Search_Term, Cluster_ID FROM (VALUES
|
||||||
|
('rheumatoid arthritis', 'eFI2_InflammatoryArthritis'),
|
||||||
|
('macular degeneration', 'CUST_ICB_VISUAL_IMPAIRMENT'),
|
||||||
|
-- ... ~148 mappings
|
||||||
|
) AS t(Search_Term, Cluster_ID)
|
||||||
|
),
|
||||||
|
ClusterCodes AS (
|
||||||
|
SELECT stc.Search_Term, c."SNOMEDCode", c."SNOMEDDescription"
|
||||||
|
FROM SearchTermClusters stc
|
||||||
|
JOIN DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes" c
|
||||||
|
ON stc.Cluster_ID = c."Cluster_ID"
|
||||||
|
WHERE c."SNOMEDCode" IS NOT NULL
|
||||||
|
),
|
||||||
|
ExplicitCodes AS (
|
||||||
|
-- Manual mappings for conditions not in clusters
|
||||||
|
SELECT Search_Term, SNOMEDCode, SNOMEDDescription FROM (VALUES
|
||||||
|
('ankylosing spondylitis', '162930007', 'Manual mapping'),
|
||||||
|
-- ...
|
||||||
|
) AS t(Search_Term, SNOMEDCode, SNOMEDDescription)
|
||||||
|
)
|
||||||
|
SELECT * FROM ClusterCodes
|
||||||
|
UNION ALL
|
||||||
|
SELECT * FROM ExplicitCodes
|
||||||
|
```
|
||||||
|
|
||||||
### Current Pathway Hierarchy (Directory-based)
|
### Current Pathway Hierarchy (Directory-based)
|
||||||
```
|
```
|
||||||
Root (N&W ICS)
|
Root (N&W ICS)
|
||||||
@@ -226,20 +210,17 @@ Root (N&W ICS)
|
|||||||
|
|
||||||
| File | Purpose |
|
| File | Purpose |
|
||||||
|------|---------|
|
|------|---------|
|
||||||
| `data_processing/schema.py` | SQLite schema for ref_drug_snomed_mapping |
|
| `snomed_indication_mapping_query.sql` | Master SNOMED cluster query |
|
||||||
| `data_processing/diagnosis_lookup.py` | Direct SNOMED lookup functions |
|
| `data_processing/diagnosis_lookup.py` | GP diagnosis lookup functions |
|
||||||
| `data_processing/pathway_pipeline.py` | Indication pathway processing |
|
| `data_processing/pathway_pipeline.py` | Indication pathway processing |
|
||||||
| `cli/refresh_pathways.py` | CLI for dual chart type refresh |
|
| `cli/refresh_pathways.py` | CLI for dual chart type refresh |
|
||||||
| `pathways_app/pathways_app.py` | Reflex UI with chart type toggle |
|
| `pathways_app/pathways_app.py` | Reflex UI with chart type toggle |
|
||||||
| `data/drug_snomed_mapping_enriched.csv` | Source mapping data |
|
|
||||||
|
|
||||||
### Expected Data Volumes
|
### Expected Data Volumes
|
||||||
|
|
||||||
| Metric | Expected |
|
| Metric | Expected |
|
||||||
|--------|----------|
|
|--------|----------|
|
||||||
| SNOMED mapping rows | ~163K |
|
| Search_Term conditions | ~148 (from cluster mapping) |
|
||||||
| Unique Search_Terms | 187 |
|
|
||||||
| Unique drugs | ~364 |
|
|
||||||
| Pathway nodes (directory, per date filter) | ~300 |
|
| Pathway nodes (directory, per date filter) | ~300 |
|
||||||
| Pathway nodes (indication, per date filter) | ~400-600 (more granular) |
|
| Pathway nodes (indication, per date filter) | ~400-600 (more granular) |
|
||||||
| Total pathway nodes (6 dates × 2 types) | ~4,000-5,000 |
|
| Total pathway nodes (6 dates × 2 types) | ~4,000-5,000 |
|
||||||
|
|||||||
@@ -1087,6 +1087,321 @@ def batch_lookup_indication_groups(
|
|||||||
return result_df
|
return result_df
|
||||||
|
|
||||||
|
|
||||||
|
# === NEW APPROACH: Query Snowflake directly using cluster CTE ===
|
||||||
|
|
||||||
|
# The cluster query mapping (embedded from snomed_indication_mapping_query.sql)
|
||||||
|
# This maps Search_Term -> Cluster_ID for ~148 clinical conditions
|
||||||
|
CLUSTER_MAPPING_SQL = """
|
||||||
|
WITH SearchTermClusters AS (
|
||||||
|
SELECT Search_Term, Cluster_ID FROM (VALUES
|
||||||
|
('acute lymphoblastic leukaemia', 'HAEMCANMORPH_COD'),
|
||||||
|
('acute myeloid leukaemia', 'C19HAEMCAN_COD'),
|
||||||
|
('acute promyelocytic leukaemia', 'HAEMCANMORPH_COD'),
|
||||||
|
('allergic asthma', 'AST_COD'),
|
||||||
|
('allergic rhinitis', 'MILDINTAST_COD'),
|
||||||
|
('alzheimer''s disease', 'DEMALZ_COD'),
|
||||||
|
('amyloidosis', 'AMYLOID_COD'),
|
||||||
|
('anaemia', 'eFI2_AnaemiaTimeSensitive'),
|
||||||
|
('anaplastic large cell lymphoma', 'C19HAEMCAN_COD'),
|
||||||
|
('apixaban', 'DOACCON_COD'),
|
||||||
|
('aplastic anaemia', 'eFI2_AnaemiaEver'),
|
||||||
|
('arthritis', 'eFI2_InflammatoryArthritis'),
|
||||||
|
('asthma', 'eFI2_Asthma'),
|
||||||
|
('atopic dermatitis', 'ATOPDERM_COD'),
|
||||||
|
('atrial fibrillation', 'eFI2_AtrialFibrillation'),
|
||||||
|
('attention deficit hyperactivity disorder', 'ADHD_COD'),
|
||||||
|
('bipolar disorder', 'MH_COD'),
|
||||||
|
('bladder', 'eFI2_UrinaryIncontinence'),
|
||||||
|
('breast cancer', 'BRCANSCR_COD'),
|
||||||
|
('cardiomyopathy', 'eFI2_HarmfulDrinking'),
|
||||||
|
('cardiovascular disease', 'CVDRISKASS_COD'),
|
||||||
|
('cervical cancer', 'CSDEC_COD'),
|
||||||
|
('cholangiocarcinoma', 'eFI2_Cancer'),
|
||||||
|
('chronic kidney disease', 'CKD_COD'),
|
||||||
|
('chronic liver disease', 'eFI2_LiverProblems'),
|
||||||
|
('chronic lymphocytic leukaemia', 'EPPHAEMCAN_COD'),
|
||||||
|
('chronic myeloid leukaemia', 'EPPHAEMCAN_COD'),
|
||||||
|
('chronic obstructive pulmonary disease', 'eFI2_COPD'),
|
||||||
|
('colon cancer', 'eFI2_Cancer'),
|
||||||
|
('colorectal cancer', 'GICANREF_COD'),
|
||||||
|
('constipation', 'CHRONCONSTIP_COD'),
|
||||||
|
('covid-19', 'POSSPOSTCOVID_COD'),
|
||||||
|
('crohn''s disease', 'eFI2_InflammatoryBowelDisease'),
|
||||||
|
('cutaneous t-cell lymphoma', 'C19HAEMCAN_COD'),
|
||||||
|
('cystic fibrosis', 'CUST_ICB_CYSTIC_FIBROSIS'),
|
||||||
|
('deep vein thrombosis', 'VTE_COD'),
|
||||||
|
('depression', 'eFI2_Depression'),
|
||||||
|
('diabetes', 'eFI2_DiabetesEver'),
|
||||||
|
('diabetic retinopathy', 'DRSELIGIBILITY_COD'),
|
||||||
|
('diffuse large b-cell lymphoma', 'C19HAEMCAN_COD'),
|
||||||
|
('dravet syndrome', 'EPIL_COD'),
|
||||||
|
('drug misuse', 'ILLSUBINT_COD'),
|
||||||
|
('dyspepsia', 'eFI2_AbdominalPain'),
|
||||||
|
('epilepsy', 'eFI2_Seizures'),
|
||||||
|
('fallopian tube', 'STERIL_COD'),
|
||||||
|
('follicular lymphoma', 'C19HAEMCAN_COD'),
|
||||||
|
('gastric cancer', 'eFI2_Cancer'),
|
||||||
|
('giant cell arteritis', 'GCA_COD'),
|
||||||
|
('glioma', 'NHAEMCANMORPH_COD'),
|
||||||
|
('gout', 'eFI2_InflammatoryArthritis'),
|
||||||
|
('graft versus host disease', 'GVHD_COD'),
|
||||||
|
('granulomatosis with polyangiitis', 'WEGENERVASC_COD'),
|
||||||
|
('growth hormone deficiency', 'HYPOPITUITARY_COD'),
|
||||||
|
('hand eczema', 'ECZEMA_COD'),
|
||||||
|
('heart failure', 'eFI2_HeartFailure'),
|
||||||
|
('hepatitis b', 'HEPBCVAC_COD'),
|
||||||
|
('hepatocellular carcinoma', 'eFI2_Cancer'),
|
||||||
|
('hiv', 'PREFLANG_COD'),
|
||||||
|
('hodgkin lymphoma', 'HAEMCANMORPH_COD'),
|
||||||
|
('hormone receptor', 'eFI2_ThyroidProblems'),
|
||||||
|
('hypercholesterolaemia', 'CLASSFH_COD'),
|
||||||
|
('immune thrombocytopenia', 'ITP_COD'),
|
||||||
|
('influenza', 'FLUINVITE_COD'),
|
||||||
|
('insomnia', 'eFI2_SleepProblems'),
|
||||||
|
('irritable bowel syndrome', 'IBS_COD'),
|
||||||
|
('ischaemic stroke', 'OSTR_COD'),
|
||||||
|
('juvenile idiopathic arthritis', 'RARTHAD_COD'),
|
||||||
|
('kidney transplant', 'RENALTRANSP_COD'),
|
||||||
|
('leukaemia', 'eFI2_Cancer'),
|
||||||
|
('lung cancer', 'FTCANREF_COD'),
|
||||||
|
('lymphoma', 'C19HAEMCAN_COD'),
|
||||||
|
('macular degeneration', 'CUST_ICB_VISUAL_IMPAIRMENT'),
|
||||||
|
('macular oedema', 'CUST_ICB_VISUAL_IMPAIRMENT'),
|
||||||
|
('major depressive episodes', 'eFI2_Depression'),
|
||||||
|
('malignant melanoma', 'eFI2_Cancer'),
|
||||||
|
('malignant pleural mesothelioma', 'LUNGCAN_COD'),
|
||||||
|
('manic episode', 'MH_COD'),
|
||||||
|
('mantle cell lymphoma', 'HAEMCANMORPH_COD'),
|
||||||
|
('melanoma', 'eFI2_Cancer'),
|
||||||
|
('merkel cell carcinoma', 'C19CAN_COD'),
|
||||||
|
('migraine', 'eFI2_Headache'),
|
||||||
|
('motor neurone disease', 'MND_COD'),
|
||||||
|
('multiple myeloma', 'C19HAEMCAN_COD'),
|
||||||
|
('multiple sclerosis', 'MS_COD'),
|
||||||
|
('myelodysplastic', 'eFI2_AnaemiaEver'),
|
||||||
|
('myelofibrosis', 'MDS_COD'),
|
||||||
|
('myocardial infarction', 'eFI2_IschaemicHeartDisease'),
|
||||||
|
('myotonia', 'CNDATRISK2_COD'),
|
||||||
|
('narcolepsy', 'LD_COD'),
|
||||||
|
('neuroendocrine tumour', 'LUNGCAN_COD'),
|
||||||
|
('non-small cell lung cancer', 'LUNGCAN_COD'),
|
||||||
|
('non-small-cell lung cancer', 'FTCANREF_COD'),
|
||||||
|
('obesity', 'BMI30_COD'),
|
||||||
|
('osteoarthritis', 'CUST_ICB_OSTEOARTHRITIS'),
|
||||||
|
('osteoporosis', 'eFI2_Osteoporosis'),
|
||||||
|
('osteosarcoma', 'NHAEMCANMORPH_COD'),
|
||||||
|
('ovarian cancer', 'C19CAN_COD'),
|
||||||
|
('peripheral arterial disease', 'PADEXC_COD'),
|
||||||
|
('plaque psoriasis', 'PSORIASIS_COD'),
|
||||||
|
('polycystic kidney disease', 'EPPCONGMALF_COD'),
|
||||||
|
('polycythaemia vera', 'C19HAEMCAN_COD'),
|
||||||
|
('pregnancy', 'C19PREG_COD'),
|
||||||
|
('primary biliary cholangitis', 'eFI2_LiverProblems'),
|
||||||
|
('primary hypercholesterolaemia', 'FNFHYP_COD'),
|
||||||
|
('prostate cancer', 'EPPSOLIDCAN_COD'),
|
||||||
|
('psoriasis', 'PSORIASIS_COD'),
|
||||||
|
('psoriatic arthritis', 'RARTHAD_COD'),
|
||||||
|
('pulmonary embolism', 'eFI2_RespiratoryDiseaseTimeSensitive'),
|
||||||
|
('pulmonary fibrosis', 'ILD_COD'),
|
||||||
|
('relapsing multiple sclerosis', 'MS_COD'),
|
||||||
|
('renal cell carcinoma', 'C19CAN_COD'),
|
||||||
|
('renal transplantation', 'RENALTRANSP_COD'),
|
||||||
|
('retinal vein occlusion', 'CUST_ICB_VISUAL_IMPAIRMENT'),
|
||||||
|
('rheumatoid arthritis', 'eFI2_InflammatoryArthritis'),
|
||||||
|
('rivaroxaban', 'DOACCON_COD'),
|
||||||
|
('schizophrenia', 'MH_COD'),
|
||||||
|
('seizures', 'LSZFREQ_COD'),
|
||||||
|
('sepsis', 'C19ACTIVITY_COD'),
|
||||||
|
('severe persistent allergic asthma', 'SEVAST_COD'),
|
||||||
|
('sickle cell disease', 'SICKLE_COD'),
|
||||||
|
('sleep apnoea', 'CUST_ICB_NON_SEVERE_LDA'),
|
||||||
|
('smoking cessation', 'SMOKINGINT_COD'),
|
||||||
|
('soft tissue sarcoma', 'NHAEMCANMORPH_COD'),
|
||||||
|
('spinal muscular atrophy', 'MND_COD'),
|
||||||
|
('squamous cell', 'C19CAN_COD'),
|
||||||
|
('squamous cell carcinoma', 'C19CAN_COD'),
|
||||||
|
('stem cell transplant', 'ALLOTRANSP_COD'),
|
||||||
|
('stroke', 'eFI2_Stroke'),
|
||||||
|
('systemic lupus erythematosus', 'SLUPUS_COD'),
|
||||||
|
('systemic mastocytosis', 'HAEMCANMORPH_COD'),
|
||||||
|
('thrombocytopenic purpura', 'TTP_COD'),
|
||||||
|
('thrombotic thrombocytopenic purpura', 'TTP_COD'),
|
||||||
|
('thyroid cancer', 'C19CAN_COD'),
|
||||||
|
('tophaceous gout', 'CUST_ICB_OSTEOARTHRITIS'),
|
||||||
|
('transitional cell carcinoma', 'C19CAN_COD'),
|
||||||
|
('type 1 diabetes', 'DMTYPE1_COD'),
|
||||||
|
('type 2 diabetes', 'DMTYPE2_COD'),
|
||||||
|
('ulcerative colitis', 'eFI2_InflammatoryBowelDisease'),
|
||||||
|
('urothelial carcinoma', 'NHAEMCANMORPH_COD'),
|
||||||
|
('urticaria', 'XSAL_COD'),
|
||||||
|
('uveitis', 'CUST_ICB_VISUAL_IMPAIRMENT'),
|
||||||
|
('vascular disease', 'CVDINVITE_COD'),
|
||||||
|
('vasculitis', 'CRYOGLOBVASC_COD')
|
||||||
|
) AS t(Search_Term, Cluster_ID)
|
||||||
|
),
|
||||||
|
|
||||||
|
ClusterCodes AS (
|
||||||
|
SELECT
|
||||||
|
stc.Search_Term,
|
||||||
|
c."SNOMEDCode",
|
||||||
|
c."SNOMEDDescription"
|
||||||
|
FROM SearchTermClusters stc
|
||||||
|
JOIN DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes" c
|
||||||
|
ON stc.Cluster_ID = c."Cluster_ID"
|
||||||
|
WHERE c."SNOMEDCode" IS NOT NULL
|
||||||
|
),
|
||||||
|
|
||||||
|
ExplicitCodes AS (
|
||||||
|
SELECT Search_Term, SNOMEDCode, SNOMEDDescription FROM (VALUES
|
||||||
|
('acute coronary syndrome', '837091000000100', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '162930007', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '239805001', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '239810002', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '239811003', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '394990003', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '429712009', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '441562009', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '441680005', 'Manual mapping'),
|
||||||
|
('ankylosing spondylitis', '441930001', 'Manual mapping'),
|
||||||
|
('axial spondyloarthritis', '723116002', 'Manual mapping'),
|
||||||
|
('choroidal neovascularisation', '380621000000102', 'Manual mapping'),
|
||||||
|
('choroidal neovascularisation', '733124000', 'Manual mapping')
|
||||||
|
) AS t(Search_Term, SNOMEDCode, SNOMEDDescription)
|
||||||
|
),
|
||||||
|
|
||||||
|
AllIndicationCodes AS (
|
||||||
|
SELECT Search_Term, "SNOMEDCode" AS SNOMEDCode, "SNOMEDDescription" AS SNOMEDDescription
|
||||||
|
FROM ClusterCodes
|
||||||
|
UNION ALL
|
||||||
|
SELECT Search_Term, SNOMEDCode, SNOMEDDescription
|
||||||
|
FROM ExplicitCodes
|
||||||
|
)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def get_patient_indication_groups(
|
||||||
|
patient_pseudonyms: list[str],
|
||||||
|
connector: Optional[SnowflakeConnector] = None,
|
||||||
|
batch_size: int = 500,
|
||||||
|
) -> "pd.DataFrame":
|
||||||
|
"""
|
||||||
|
Batch lookup GP diagnosis-based indication groups using Snowflake cluster query.
|
||||||
|
|
||||||
|
This function queries Snowflake directly using the embedded cluster CTE
|
||||||
|
(from snomed_indication_mapping_query.sql) to find patients with matching
|
||||||
|
GP diagnoses. This is the NEW approach replacing the old SQLite-based lookup.
|
||||||
|
|
||||||
|
The query:
|
||||||
|
1. Uses the cluster mapping CTE to get all Search_Term -> SNOMED code mappings
|
||||||
|
2. Joins with PrimaryCareClinicalCoding to find patients with matching codes
|
||||||
|
3. Returns the most recent match per patient (by EventDateTime)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
patient_pseudonyms: List of PseudoNHSNoLinked values (matches PatientPseudonym in GP records)
|
||||||
|
connector: Optional SnowflakeConnector (defaults to singleton)
|
||||||
|
batch_size: Number of patients per Snowflake query batch (default 500)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
DataFrame with columns:
|
||||||
|
- PatientPseudonym: The patient identifier (PseudoNHSNoLinked value)
|
||||||
|
- Search_Term: The matched indication (e.g., "rheumatoid arthritis")
|
||||||
|
- EventDateTime: Date of the GP diagnosis record
|
||||||
|
|
||||||
|
Patients not found in results have no matching GP diagnosis.
|
||||||
|
"""
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
logger.info(f"Starting Snowflake-direct indication lookup for {len(patient_pseudonyms)} patients...")
|
||||||
|
|
||||||
|
# Handle edge case: empty patient list
|
||||||
|
if not patient_pseudonyms:
|
||||||
|
logger.warning("Empty patient list provided")
|
||||||
|
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime'])
|
||||||
|
|
||||||
|
# Check Snowflake availability
|
||||||
|
if not SNOWFLAKE_AVAILABLE:
|
||||||
|
logger.error("Snowflake connector not available - cannot lookup GP records")
|
||||||
|
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime'])
|
||||||
|
|
||||||
|
if not is_snowflake_configured():
|
||||||
|
logger.error("Snowflake not configured - cannot lookup GP records")
|
||||||
|
return pd.DataFrame(columns=['PatientPseudonym', 'Search_Term', 'EventDateTime'])
|
||||||
|
|
||||||
|
if connector is None:
|
||||||
|
connector = get_connector()
|
||||||
|
|
||||||
|
# Results list to collect all matches
|
||||||
|
all_results: list[dict] = []
|
||||||
|
|
||||||
|
# Process patients in batches
|
||||||
|
total_patients = len(patient_pseudonyms)
|
||||||
|
for batch_start in range(0, total_patients, batch_size):
|
||||||
|
batch_end = min(batch_start + batch_size, total_patients)
|
||||||
|
batch_pseudonyms = patient_pseudonyms[batch_start:batch_end]
|
||||||
|
batch_num = batch_start // batch_size + 1
|
||||||
|
total_batches = (total_patients + batch_size - 1) // batch_size
|
||||||
|
|
||||||
|
logger.info(f"Batch {batch_num}/{total_batches}: patients {batch_start + 1} to {batch_end}")
|
||||||
|
|
||||||
|
# Build patient IN clause placeholders
|
||||||
|
patient_placeholders = ", ".join(["%s"] * len(batch_pseudonyms))
|
||||||
|
|
||||||
|
# Build the full query with cluster CTE
|
||||||
|
# This finds the most recent matching diagnosis for each patient
|
||||||
|
query = f"""
|
||||||
|
{CLUSTER_MAPPING_SQL}
|
||||||
|
SELECT
|
||||||
|
pc."PatientPseudonym",
|
||||||
|
aic.Search_Term,
|
||||||
|
pc."EventDateTime"
|
||||||
|
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pc
|
||||||
|
INNER JOIN AllIndicationCodes aic
|
||||||
|
ON pc."SNOMEDCode" = aic.SNOMEDCode
|
||||||
|
WHERE pc."PatientPseudonym" IN ({patient_placeholders})
|
||||||
|
QUALIFY ROW_NUMBER() OVER (
|
||||||
|
PARTITION BY pc."PatientPseudonym"
|
||||||
|
ORDER BY pc."EventDateTime" DESC
|
||||||
|
) = 1
|
||||||
|
"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
results = connector.execute_dict(query, tuple(batch_pseudonyms))
|
||||||
|
|
||||||
|
for row in results:
|
||||||
|
all_results.append({
|
||||||
|
'PatientPseudonym': row.get('PatientPseudonym'),
|
||||||
|
'Search_Term': row.get('Search_Term'),
|
||||||
|
'EventDateTime': row.get('EventDateTime'),
|
||||||
|
})
|
||||||
|
|
||||||
|
logger.debug(f"Batch {batch_num}: found {len(results)} matches")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error querying GP records for batch {batch_num}: {e}")
|
||||||
|
# Continue with other batches - partial results are better than none
|
||||||
|
|
||||||
|
# Build result DataFrame
|
||||||
|
result_df = pd.DataFrame(all_results)
|
||||||
|
|
||||||
|
# Log summary statistics
|
||||||
|
if len(result_df) > 0:
|
||||||
|
matched_count = len(result_df)
|
||||||
|
match_rate = 100 * matched_count / total_patients
|
||||||
|
unique_terms = result_df['Search_Term'].nunique()
|
||||||
|
logger.info(f"Indication lookup complete:")
|
||||||
|
logger.info(f" Total patients queried: {total_patients}")
|
||||||
|
logger.info(f" Patients with GP match: {matched_count} ({match_rate:.1f}%)")
|
||||||
|
logger.info(f" Unique Search_Terms found: {unique_terms}")
|
||||||
|
|
||||||
|
# Log top Search_Terms
|
||||||
|
top_terms = result_df['Search_Term'].value_counts().head(5)
|
||||||
|
logger.info(f" Top 5 indications: {dict(top_terms)}")
|
||||||
|
else:
|
||||||
|
logger.info(f"Indication lookup complete: 0 matches from {total_patients} patients")
|
||||||
|
|
||||||
|
return result_df
|
||||||
|
|
||||||
|
|
||||||
# Export public API
|
# Export public API
|
||||||
__all__ = [
|
__all__ = [
|
||||||
# Dataclasses
|
# Dataclasses
|
||||||
@@ -1112,4 +1427,7 @@ __all__ = [
|
|||||||
"get_directorate_from_diagnosis",
|
"get_directorate_from_diagnosis",
|
||||||
# Batch lookup for indication groups
|
# Batch lookup for indication groups
|
||||||
"batch_lookup_indication_groups",
|
"batch_lookup_indication_groups",
|
||||||
|
# Snowflake-direct indication lookup (new approach)
|
||||||
|
"get_patient_indication_groups",
|
||||||
|
"CLUSTER_MAPPING_SQL",
|
||||||
]
|
]
|
||||||
|
|||||||
+64
-499
@@ -1,45 +1,49 @@
|
|||||||
# Progress Log - Direct SNOMED Indication Mapping
|
# Progress Log - Indication-Based Pathway Charts
|
||||||
|
|
||||||
## Project Context
|
## Project Context
|
||||||
|
|
||||||
This project extends the existing HCD Pathway Analysis application with direct SNOMED code matching from GP records. The previous project (Phases 1-5) established the pre-computed pathway architecture and modern UI. This phase adds:
|
This project adds indication-based icicle charts alongside the existing directory-based charts. Patient diagnoses are matched from GP records using SNOMED cluster codes queried directly from Snowflake.
|
||||||
|
|
||||||
1. **Diagnosis-based directorate assignment** - Primary method using GP SNOMED codes
|
**Key Change from Previous Approach**: Instead of maintaining a local CSV/SQLite mapping of SNOMED codes, we now query the `ClinicalCodingClusterSnomedCodes` clusters directly in Snowflake during the data refresh. This simplifies the architecture and ensures we always use the latest cluster definitions.
|
||||||
2. **Indication-based icicle chart** - New chart type showing Trust → Search_Term → Drug → Pathway
|
|
||||||
|
|
||||||
## Key Files Reference
|
## Key Files Reference
|
||||||
|
|
||||||
**Existing (reuse these):**
|
**Existing (reuse these):**
|
||||||
- `data_processing/schema.py` - SQLite schema (add new table)
|
- `data_processing/schema.py` - SQLite schema (chart_type column already added)
|
||||||
- `data_processing/diagnosis_lookup.py` - Existing cluster-based lookup (extend with direct SNOMED)
|
- `data_processing/diagnosis_lookup.py` - Extend with new Snowflake query
|
||||||
- `data_processing/pathway_pipeline.py` - Pathway processing (add indication type)
|
- `data_processing/pathway_pipeline.py` - Pathway processing (indication functions exist)
|
||||||
- `cli/refresh_pathways.py` - CLI refresh command (add chart type support)
|
- `cli/refresh_pathways.py` - CLI refresh command (chart_type arg exists)
|
||||||
- `pathways_app/pathways_app.py` - Reflex app (add chart type toggle)
|
- `pathways_app/pathways_app.py` - Reflex app (add chart type toggle)
|
||||||
- `tools/data.py` - Data transformations including department_identification()
|
- `tools/data.py` - Data transformations including department_identification()
|
||||||
|
|
||||||
**New data:**
|
**New/Key:**
|
||||||
- `data/drug_snomed_mapping_enriched.csv` - 163K rows, 187 Search_Terms, 364 drugs
|
- `snomed_indication_mapping_query.sql` - Master SNOMED cluster query to embed in Snowflake calls
|
||||||
|
|
||||||
## Known Patterns
|
## Known Patterns
|
||||||
|
|
||||||
### SNOMED Mapping Structure
|
### SNOMED Cluster Query Approach
|
||||||
The enriched mapping CSV has columns:
|
The `snomed_indication_mapping_query.sql` contains the Search_Term → Cluster_ID mappings:
|
||||||
- Drug, Indication, TA_ID (from NICE TAs)
|
- ~148 conditions mapped to clinical coding clusters
|
||||||
- Search_Term (simplified grouping, 187 unique values)
|
- Joins with `DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"` to get SNOMED codes
|
||||||
- SNOMEDCode, SNOMEDDescription
|
- Includes explicit manual mappings for conditions not in clusters
|
||||||
- CleanedDrugName, PrimaryDirectorate, AllDirectorates
|
- Returns: Search_Term, SNOMEDCode, SNOMEDDescription
|
||||||
|
|
||||||
### Direct SNOMED Lookup Logic
|
### GP Record Matching
|
||||||
For a patient on drug X:
|
To find a patient's indication:
|
||||||
1. Get all SNOMED codes for that drug from ref_drug_snomed_mapping
|
1. Use the cluster query as a CTE
|
||||||
2. Query PrimaryCareClinicalCoding for those codes (patient's GP record)
|
2. Join with `PrimaryCareClinicalCoding` on SNOMEDCode
|
||||||
3. If match found → use Search_Term and PrimaryDirectorate from matched row
|
3. Filter by PatientPseudonym (use PseudoNHSNoLinked from HCD data)
|
||||||
4. If no match → fall back to department_identification()
|
4. Use most recent match by EventDateTime
|
||||||
5. Use most recent SNOMED code by EventDateTime if multiple matches
|
5. Return Search_Term for matched patients
|
||||||
|
|
||||||
|
### Patient Identifier Mapping
|
||||||
|
- HCD data has `PseudoNHSNoLinked` column - this matches `PatientPseudonym` in GP records
|
||||||
|
- DO NOT use `PersonKey` (LocalPatientID) - this is provider-specific and won't match GP records
|
||||||
|
- UPID = Provider Code (3 chars) + PersonKey
|
||||||
|
|
||||||
### Chart Type Architecture
|
### Chart Type Architecture
|
||||||
- `chart_type` column in pathway_nodes: "directory" or "indication"
|
- `chart_type` column in pathway_nodes: "directory" or "indication"
|
||||||
- 12 total pathway datasets: 6 date filters × 2 chart types
|
- 12 total pathway datasets: 6 date filters x 2 chart types
|
||||||
- Indication chart: mixed labels (Search_Term for matched, Directorate for unmatched)
|
- Indication chart: mixed labels (Search_Term for matched, Directorate for unmatched)
|
||||||
|
|
||||||
### Date Filter Combinations
|
### Date Filter Combinations
|
||||||
@@ -52,493 +56,54 @@ For a patient on drug X:
|
|||||||
| `2yr_6mo` | Last 2 years | Last 6 months | No |
|
| `2yr_6mo` | Last 2 years | Last 6 months | No |
|
||||||
| `2yr_12mo` | Last 2 years | Last 12 months | No |
|
| `2yr_12mo` | Last 2 years | Last 12 months | No |
|
||||||
|
|
||||||
### Expected Volumes
|
### Previous Work (Reusable)
|
||||||
- SNOMED mapping: 163K rows
|
These components from the previous approach are still valid:
|
||||||
- Search_Terms: 187 unique
|
- `chart_type` column and schema migration (Task 2.1 - complete)
|
||||||
- Pathway nodes per date filter: ~300 (directory), ~400-600 (indication)
|
- `generate_icicle_chart_indication()` function (Task 2.2 - complete)
|
||||||
|
- `process_indication_pathway_for_date_filter()` function (Task 2.2 - complete)
|
||||||
|
- `extract_indication_fields()` function (Task 2.2 - complete)
|
||||||
|
- `--chart-type` CLI argument (Task 2.3 - complete)
|
||||||
|
|
||||||
|
### What Needs Replacement
|
||||||
|
The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py` used a local SQLite table. This needs to be replaced with a new function that queries Snowflake directly using the cluster query.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Iteration Log
|
## Iteration Log
|
||||||
|
|
||||||
|
<!-- Each iteration appends a structured entry below -->
|
||||||
|
|
||||||
## Iteration 1 — 2026-02-05
|
## Iteration 1 — 2026-02-05
|
||||||
### Task: 1.1 Create SQLite Table for SNOMED Mapping
|
### Task: 1.1 Create Indication Lookup Query
|
||||||
### Why this task:
|
### Why this task:
|
||||||
- First task in Phase 1 (Data Infrastructure) — all other phases depend on having the data layer in place
|
- This is the foundation task — other tasks (1.2 CLI integration, 2.3 refresh command) depend on this function
|
||||||
- No external dependencies — pure schema definition work
|
- The progress.txt explicitly noted the old approach needs replacement
|
||||||
- Follows "data infrastructure first" principle
|
- Logical flow: data query function must exist before pipeline integration
|
||||||
### Status: COMPLETE
|
### Status: COMPLETE
|
||||||
### What was done:
|
### What was done:
|
||||||
- Added `REF_DRUG_SNOMED_MAPPING_SCHEMA` to `data_processing/schema.py` with 11 columns:
|
- Created `get_patient_indication_groups()` function in `data_processing/diagnosis_lookup.py`
|
||||||
- id, drug_name, indication, ta_id, search_term, snomed_code, snomed_description
|
- Embedded the full cluster mapping SQL (from snomed_indication_mapping_query.sql) as `CLUSTER_MAPPING_SQL` constant
|
||||||
- cleaned_drug_name, primary_directorate, all_directorates, created_at
|
- Function takes list of PseudoNHSNoLinked values and queries Snowflake directly
|
||||||
- Added 5 custom indexes for lookup performance:
|
- Uses QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1 to get most recent match
|
||||||
- idx_ref_drug_snomed_mapping_drug (drug_name)
|
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime columns
|
||||||
- idx_ref_drug_snomed_mapping_cleaned (cleaned_drug_name)
|
- Handles edge cases: empty patient list, Snowflake unavailable/unconfigured
|
||||||
- idx_ref_drug_snomed_mapping_snomed (snomed_code)
|
- Added batch processing (default 500 patients per batch) for large datasets
|
||||||
- idx_ref_drug_snomed_mapping_search_term (search_term)
|
- Added logging for match statistics (match rate, unique Search_Terms, top 5 indications)
|
||||||
- idx_ref_drug_snomed_mapping_drug_snomed (composite: cleaned_drug_name, snomed_code)
|
- Added both function and CLUSTER_MAPPING_SQL to __all__ exports
|
||||||
- Added `create_drug_snomed_mapping_table()` helper function
|
|
||||||
- Added schema to `REFERENCE_TABLES_SCHEMA` (included in `ALL_TABLES_SCHEMA`)
|
|
||||||
- Updated helper functions to include new table:
|
|
||||||
- `drop_reference_tables()` — drops new table
|
|
||||||
- `get_reference_table_counts()` — counts new table (with try/except for safety)
|
|
||||||
- `verify_reference_tables_exist()` — checks for new table
|
|
||||||
### Validation results:
|
### Validation results:
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/schema.py` — PASSED
|
- Tier 1 (Code): ✅ `python -m py_compile` passed, import check passed
|
||||||
- Tier 1 (Code): Import check — PASSED
|
- Tier 2 (Data): ✅ Empty list returns correct empty DataFrame with expected columns
|
||||||
- Tier 2 (Data): Migration created table with 0 rows — PASSED
|
- Tier 3 (Functional): N/A (not a UI task)
|
||||||
- Tier 2 (Data): All 11 columns and 6 indexes present — PASSED
|
|
||||||
### Files changed:
|
### Files changed:
|
||||||
- `data_processing/schema.py` — added schema, updated helpers
|
- `data_processing/diagnosis_lookup.py` — added CLUSTER_MAPPING_SQL constant and get_patient_indication_groups() function
|
||||||
### Committed: cf35937 "feat: add ref_drug_snomed_mapping schema (Task 1.1)"
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.1 items complete
|
||||||
|
### Committed: 052256c "feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)"
|
||||||
### Patterns discovered:
|
### Patterns discovered:
|
||||||
- Existing schema pattern: CREATE TABLE + CREATE INDEX in single SQL string
|
- Snowflake's QUALIFY clause is cleaner than subquery for row_number filtering
|
||||||
- Helper functions follow consistent pattern: create, drop, get_counts, verify_exists
|
- The cluster CTE has 148 Search_Term mappings plus 13 explicit SNOMED codes
|
||||||
- `get_reference_table_counts` needed try/except for tables that might not exist yet
|
|
||||||
### Next iteration should:
|
### Next iteration should:
|
||||||
- Start Task 1.2: Create `data_processing/load_snomed_mapping.py` script to load the CSV
|
- Test the function with real patient data to verify it returns expected Search_Terms (Task 1.1 verification)
|
||||||
- The CSV has columns: Drug, Indication, TA_ID, Search_Term, SNOMEDCode, SNOMEDDescription, CleanedDrugName, PrimaryDirectorate, AllDirectorates
|
- OR proceed to Task 1.2 (integrate with CLI refresh command) if confident in the function
|
||||||
- Note: SNOMEDCode in CSV has `.0` suffix (e.g., "156370009.0") — may need cleanup
|
- The key integration point: extract unique PseudoNHSNoLinked values from HCD data, call this function, map results back to UPID for indication_df
|
||||||
### Blocked items:
|
### Blocked items:
|
||||||
- None
|
- None
|
||||||
|
|
||||||
## Iteration 2 — 2026-02-05
|
|
||||||
### Task: 1.2 Load Enriched Mapping Data
|
|
||||||
### Why this task:
|
|
||||||
- Task 1.1 (schema creation) is complete — table `ref_drug_snomed_mapping` exists
|
|
||||||
- Task 1.2 is the next logical step in Phase 1 (Data Infrastructure)
|
|
||||||
- Task 1.3 (Extend Diagnosis Lookup) depends on having data in the table to query
|
|
||||||
- Following the "data infrastructure first" principle
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Created `data_processing/load_snomed_mapping.py` with:
|
|
||||||
- `migrate_drug_snomed_mapping()` - reads CSV, cleans SNOMED codes, inserts to SQLite
|
|
||||||
- `get_drug_snomed_mapping_counts()` - statistics function
|
|
||||||
- `verify_drug_snomed_mapping_migration()` - validation function
|
|
||||||
- `clean_snomed_code()` - removes trailing .0 from SNOMED codes (e.g., "156370009.0" → "156370009")
|
|
||||||
- CLI interface with `--verify-only` and `-v` options
|
|
||||||
- Added CLI entry point: `python -m data_processing.load_snomed_mapping`
|
|
||||||
- Fixed Unicode encoding issue on Windows console (emojis → ASCII text)
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/load_snomed_mapping.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check — PASSED
|
|
||||||
- Tier 2 (Data): Migration loaded 144,056 rows from 163,937 CSV rows (19,881 duplicates skipped)
|
|
||||||
- Tier 2 (Data): Verified: 707 unique drugs, 187 unique search terms, 21,265 SNOMED codes
|
|
||||||
- Tier 2 (Data): SNOMED codes cleaned — 0 rows with .0 suffix remaining
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/load_snomed_mapping.py` — new file
|
|
||||||
### Committed: 6d68b5e "feat: add SNOMED mapping loader script (Task 1.2)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- CSV has 163,937 rows but only 144,056 unique (drug_name, indication, snomed_code) combinations
|
|
||||||
- SNOMED codes in CSV have `.0` suffix from float formatting — need `clean_snomed_code()` function
|
|
||||||
- Windows console can't print Unicode emojis — use ASCII text like `[OK]` instead of ✅
|
|
||||||
### Next iteration should:
|
|
||||||
- Start Task 1.3: Extend Diagnosis Lookup Module
|
|
||||||
- Add `get_drug_snomed_codes(drug_name)` to `data_processing/diagnosis_lookup.py`
|
|
||||||
- Add `patient_has_indication_direct(patient_pseudonym, snomed_codes, connector)` function
|
|
||||||
- Look at existing `diagnosis_lookup.py` for patterns to follow
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 3 — 2026-02-05
|
|
||||||
### Task: 1.3 Extend Diagnosis Lookup Module
|
|
||||||
### Why this task:
|
|
||||||
- Tasks 1.1 and 1.2 are complete — the data infrastructure (schema + loaded data) is in place
|
|
||||||
- Task 1.3 is the next logical step — functions to query the data before using it in the pipeline
|
|
||||||
- Phase 2 (Pathway Processing) depends on having these lookup functions
|
|
||||||
- Following the "data infrastructure first" principle
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Added two new dataclasses to `data_processing/diagnosis_lookup.py`:
|
|
||||||
- `DrugSnomedMapping`: Holds SNOMED code mapping with snomed_code, snomed_description, search_term, primary_directorate, indication, ta_id
|
|
||||||
- `DirectSnomedMatchResult`: Result of direct SNOMED lookup with matched flag, snomed_code, search_term, primary_directorate, event_date, source
|
|
||||||
- Added `get_drug_snomed_codes(drug_name)` function:
|
|
||||||
- Queries `ref_drug_snomed_mapping` table for all SNOMED codes for a drug
|
|
||||||
- Case-insensitive matching on both `cleaned_drug_name` and `drug_name` columns
|
|
||||||
- Returns list of DrugSnomedMapping dataclass instances
|
|
||||||
- Added `patient_has_indication_direct(patient_pseudonym, drug_snomed_mappings, connector)` function:
|
|
||||||
- Queries `PrimaryCareClinicalCoding` directly for exact SNOMED code matches
|
|
||||||
- Returns most recent match by EventDateTime (ORDER BY DESC LIMIT 1)
|
|
||||||
- Handles Snowflake unavailability gracefully
|
|
||||||
- Updated `__all__` exports to include new dataclasses and functions
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/diagnosis_lookup.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check — PASSED
|
|
||||||
- Tier 2 (Data): ADALIMUMAB returns 1320 SNOMED mappings across 10 Search_Terms
|
|
||||||
- Tier 2 (Data): RANIBIZUMAB returns 104 SNOMED mappings
|
|
||||||
- Tier 2 (Data): Case insensitivity verified (upper/lower/mixed all return same results)
|
|
||||||
- Tier 2 (Data): Empty mappings returns unmatched result correctly
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/diagnosis_lookup.py` — added 2 dataclasses, 2 functions, updated __all__
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 1.3 complete
|
|
||||||
### Committed: b44d22d "feat: add direct SNOMED lookup functions (Task 1.3)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- ADALIMUMAB has 10 unique Search_Terms with varying SNOMED code counts:
|
|
||||||
- rheumatoid arthritis: 867 codes (largest)
|
|
||||||
- ankylosing spondylitis: 96 codes
|
|
||||||
- uveitis: 124 codes
|
|
||||||
- ulcerative colitis: 78 codes
|
|
||||||
- juvenile idiopathic arthritis: 57 codes
|
|
||||||
- crohn's disease: 50 codes
|
|
||||||
- psoriatic arthritis: 17 codes
|
|
||||||
- plaque psoriasis: 16 codes
|
|
||||||
- hidradenitis suppurativa: 8 codes
|
|
||||||
- axial spondyloarthritis: 7 codes (smallest)
|
|
||||||
- The query uses DISTINCT to avoid duplicate rows
|
|
||||||
- Function handles edge cases: empty mappings, Snowflake unavailable
|
|
||||||
### Next iteration should:
|
|
||||||
- Phase 1 is now complete! Start Phase 2: Pathway Processing Updates
|
|
||||||
- Begin with Task 2.1: Update Directorate Assignment Logic
|
|
||||||
- Create `get_directorate_from_diagnosis(upid, drug_name, connector)` function
|
|
||||||
- Logic: Try diagnosis-based first (using get_drug_snomed_codes + patient_has_indication_direct)
|
|
||||||
- Fallback to department_identification() if no match
|
|
||||||
- Return: (directorate, source) where source is "DIAGNOSIS" or "FALLBACK"
|
|
||||||
- Key consideration: This will need to integrate with the existing data flow in `fetch_and_transform_data()`
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 4 — 2026-02-05
|
|
||||||
### Task: 2.1 Update Directorate Assignment Logic
|
|
||||||
### Why this task:
|
|
||||||
- Phase 1 (Data Infrastructure) is complete — all 3 tasks marked [x]
|
|
||||||
- Task 2.1 is the first task in Phase 2 (Pathway Processing Updates)
|
|
||||||
- Previous iteration explicitly recommended starting with Task 2.1
|
|
||||||
- Tasks 2.2 (schema) and 2.3 (indication processing) logically depend on having the directorate assignment logic
|
|
||||||
- Following "data infrastructure first, then pipeline" principle
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Added `DirectorateAssignment` dataclass to `data_processing/diagnosis_lookup.py`:
|
|
||||||
- Fields: upid, drug_name, directorate, search_term, source, snomed_code, event_date
|
|
||||||
- source is "DIAGNOSIS" (SNOMED match found) or "FALLBACK" (no match)
|
|
||||||
- Added `get_directorate_from_diagnosis(upid, drug_name, connector, db_manager, before_date)` function:
|
|
||||||
- Gets all SNOMED codes for drug from ref_drug_snomed_mapping
|
|
||||||
- Queries patient's GP records (via patient_has_indication_direct)
|
|
||||||
- Returns diagnosis-based directorate and search_term if match found
|
|
||||||
- Returns FALLBACK result if no match (caller handles fallback logic)
|
|
||||||
- Extracts PatientPseudonym from UPID by removing first 3 characters (ProviderCode)
|
|
||||||
- Updated `__all__` exports to include new dataclass and function
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/diagnosis_lookup.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check for DirectorateAssignment, get_directorate_from_diagnosis — PASSED
|
|
||||||
- Tier 2 (Data): Test with ADALIMUMAB returns 1320 SNOMED mappings across 10 search_terms — PASSED
|
|
||||||
- Tier 2 (Data): Test with unknown drug returns FALLBACK source — PASSED
|
|
||||||
- Tier 2 (Data): Test without Snowflake returns FALLBACK source correctly — PASSED
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/diagnosis_lookup.py` — added DirectorateAssignment dataclass, get_directorate_from_diagnosis function, updated __all__
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 2.1 complete
|
|
||||||
### Committed: 5067694 "feat: add get_directorate_from_diagnosis() function (Task 2.1)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- UPID format: Provider Code (3 chars) + PersonKey, where PersonKey = PatientPseudonym
|
|
||||||
- The function is designed to be called at the DataFrame level during pipeline processing
|
|
||||||
- For batch processing, the caller will need to iterate over rows and collect DIAGNOSIS vs FALLBACK statistics
|
|
||||||
- The function handles edge cases: no SNOMED mappings, Snowflake unavailable, no GP record match
|
|
||||||
### Next iteration should:
|
|
||||||
- Start Task 2.2: Add Chart Type Support to Schema
|
|
||||||
- Add `chart_type` column to `pathway_nodes` table (values: "directory", "indication")
|
|
||||||
- Update schema in `data_processing/schema.py`
|
|
||||||
- Consider: may need ALTER TABLE migration for existing data
|
|
||||||
- Alternative: add to pathway_date_filters or create pathway_chart_types reference table
|
|
||||||
- Key consideration: The indication chart will group by Search_Term (from SNOMED match) or Directorate (fallback)
|
|
||||||
- The chart_type column allows filtering pathway_nodes by chart type when user toggles in UI
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 5 — 2026-02-05
|
|
||||||
### Task: 2.2 Add Chart Type Support to Schema
|
|
||||||
### Why this task:
|
|
||||||
- Task 2.1 (directorate assignment logic) is complete
|
|
||||||
- Task 2.2 is the next task in Phase 2 — schema changes before pipeline changes
|
|
||||||
- Task 2.3 (Indication Pathway Processing) requires the chart_type column to store indication pathways
|
|
||||||
- Following "data infrastructure first" principle — schema changes before pipeline logic
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Modified `PATHWAY_NODES_SCHEMA` in `data_processing/schema.py`:
|
|
||||||
- Added `chart_type TEXT NOT NULL DEFAULT 'directory'` column
|
|
||||||
- Updated UNIQUE constraint from `(date_filter_id, ids)` to `(date_filter_id, chart_type, ids)`
|
|
||||||
- Added `idx_pathway_nodes_chart_type` index on `(date_filter_id, chart_type)`
|
|
||||||
- Updated `idx_pathway_nodes_filter_composite` index to include `chart_type`
|
|
||||||
- Added `migrate_pathway_nodes_chart_type(conn)` function:
|
|
||||||
- Checks if chart_type column exists (idempotent)
|
|
||||||
- Adds column with ALTER TABLE if missing
|
|
||||||
- Creates/updates indexes
|
|
||||||
- Returns (success: bool, message: str)
|
|
||||||
- Updated `data_processing/migrate.py`:
|
|
||||||
- Added import for `migrate_pathway_nodes_chart_type`
|
|
||||||
- Called migration in `initialize_database()` after creating tables
|
|
||||||
- Ran migration on existing database: 293 rows updated with chart_type='directory'
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/schema.py` — PASSED
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/migrate.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check for migrate_pathway_nodes_chart_type — PASSED
|
|
||||||
- Tier 2 (Data): Migration ran successfully on existing database — PASSED
|
|
||||||
- Tier 2 (Data): All 293 existing rows have chart_type='directory' — PASSED
|
|
||||||
- Tier 2 (Data): Both indexes created (idx_pathway_nodes_chart_type, idx_pathway_nodes_filter_composite) — PASSED
|
|
||||||
- Tier 2 (Data): Database status check shows all tables intact — PASSED
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/schema.py` — added chart_type column, updated UNIQUE constraint, added indexes, added migration function
|
|
||||||
- `data_processing/migrate.py` — import and call migration function
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 2.2 complete with full checklist
|
|
||||||
### Committed: 19607d7 "feat: add chart_type column to pathway_nodes schema (Task 2.2)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- SQLite ALTER TABLE ADD COLUMN works with DEFAULT values — existing rows get the default
|
|
||||||
- SQLite doesn't allow modifying UNIQUE constraints after table creation, but since all existing rows have the same chart_type value, the new constraint is satisfied
|
|
||||||
- Migration function is idempotent — safe to run multiple times
|
|
||||||
- Composite indexes including chart_type allow efficient filtering by (date_filter_id, chart_type)
|
|
||||||
### Next iteration should:
|
|
||||||
- Start Task 2.3: Create Indication Pathway Processing
|
|
||||||
- Add `process_indication_pathways()` to `pathway_pipeline.py`
|
|
||||||
- Group by: Trust → Search_Term → Drug → Pathway
|
|
||||||
- For unmatched patients: use directorate name as Search_Term fallback
|
|
||||||
- Add `extract_indication_fields()` for denormalized columns
|
|
||||||
- Key consideration: Need to modify `generate_icicle_chart()` or create parallel function for indication hierarchy
|
|
||||||
- The indication chart uses Search_Term (from SNOMED match) or Directorate (from fallback) as the level-2 grouping
|
|
||||||
- Output structure should match directory chart but with different grouping labels
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 6 — 2026-02-05
|
|
||||||
### Task: 2.3 Create Indication Pathway Processing
|
|
||||||
### Why this task:
|
|
||||||
- Tasks 2.1 and 2.2 are complete — directorate assignment logic and schema are in place
|
|
||||||
- Task 2.3 is the next logical step in Phase 2 — processing logic before Phase 3 (CLI updates)
|
|
||||||
- Previous iteration explicitly recommended starting Task 2.3
|
|
||||||
- Phase 3 depends on having the indication pathway processing functions
|
|
||||||
- Following "pipeline before UI" principle
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Added `generate_icicle_chart_indication()` to `analysis/pathway_analyzer.py`:
|
|
||||||
- Variant of generate_icicle_chart() that uses indication_df instead of directory_df
|
|
||||||
- Takes `indication_df` parameter mapping UPID → Indication_Group
|
|
||||||
- The indication_df must have 'Directory' column (renamed from Indication_Group for compatibility)
|
|
||||||
- Hierarchy: Trust → Indication_Group → Drug → Pathway
|
|
||||||
|
|
||||||
- Added `process_indication_pathway_for_date_filter()` to `data_processing/pathway_pipeline.py`:
|
|
||||||
- Wrapper function that calls generate_icicle_chart_indication()
|
|
||||||
- Takes indication_df parameter (UPID → Indication_Group mapping)
|
|
||||||
- Computes date ranges and passes to the chart generator
|
|
||||||
|
|
||||||
- Added `extract_indication_fields()` to `data_processing/pathway_pipeline.py`:
|
|
||||||
- Similar to extract_denormalized_fields() but for indication charts
|
|
||||||
- Extracts: trust_name, directory (stores search_term), drug_sequence
|
|
||||||
- Uses 'directory' column for schema compatibility
|
|
||||||
|
|
||||||
- Updated `convert_to_records()` with `chart_type` parameter:
|
|
||||||
- Added chart_type to the record dictionary
|
|
||||||
- Supports "directory" and "indication" values
|
|
||||||
- Logs chart_type in output message
|
|
||||||
|
|
||||||
- Added `ChartType` type alias: `Literal["directory", "indication"]`
|
|
||||||
|
|
||||||
- Updated `__all__` exports to include new functions and type
|
|
||||||
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/pathway_pipeline.py` — PASSED
|
|
||||||
- Tier 1 (Code): `python -m py_compile analysis/pathway_analyzer.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check for all new functions — PASSED
|
|
||||||
- ChartType, process_indication_pathway_for_date_filter, extract_indication_fields all exported
|
|
||||||
- generate_icicle_chart_indication importable from pathway_analyzer
|
|
||||||
### Files changed:
|
|
||||||
- `analysis/pathway_analyzer.py` — added generate_icicle_chart_indication() function
|
|
||||||
- `data_processing/pathway_pipeline.py` — added ChartType, process_indication_pathway_for_date_filter(), extract_indication_fields(), updated convert_to_records()
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 2.3 complete with full checklist
|
|
||||||
### Committed: 7cbc648 "feat: add indication pathway processing functions (Task 2.3)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- The build_hierarchy() function uses directory_df to map UPID → Directory for level-2 grouping
|
|
||||||
- For indication charts, we pass indication_df with 'Directory' column (renamed from Indication_Group) to build_hierarchy()
|
|
||||||
- The indication_df must be indexed by UPID (same as directory_df)
|
|
||||||
- Schema compatibility is maintained by storing search_term in the 'directory' column
|
|
||||||
- The chart_type column allows filtering pathway_nodes by chart type in Reflex UI
|
|
||||||
### Next iteration should:
|
|
||||||
- Phase 2 is now complete! Start Phase 3: CLI & Data Refresh Updates
|
|
||||||
- Begin with Task 3.1: Update Refresh Command for Dual Chart Types
|
|
||||||
- Modify `cli/refresh_pathways.py` to process both "directory" and "indication" chart types
|
|
||||||
- Add `--chart-type` argument: "all" (default), "directory", "indication"
|
|
||||||
- For each of 6 date filters: generate 2 chart datasets
|
|
||||||
- Total: 12 pathway datasets (6 dates × 2 chart types)
|
|
||||||
- Key consideration: Need to create indication_df by looking up GP diagnoses for all patients
|
|
||||||
- This involves batching Snowflake queries for performance
|
|
||||||
- Use get_directorate_from_diagnosis() function from diagnosis_lookup.py
|
|
||||||
- For unmatched patients: indication_group = "Directory (no GP dx)"
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 7 — 2026-02-05
|
|
||||||
### Task: 3.1 Update Refresh Command for Dual Chart Types
|
|
||||||
### Why this task:
|
|
||||||
- Phase 2 complete — all data infrastructure and processing functions ready
|
|
||||||
- Task 3.1 is first in Phase 3 — CLI command is the entry point for the pipeline
|
|
||||||
- Previous iteration recommended starting Task 3.1
|
|
||||||
- Processing functions exist but weren't integrated into refresh workflow yet
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Modified `cli/refresh_pathways.py`:
|
|
||||||
- Added `--chart-type` argument with choices: "directory", "indication", "all"
|
|
||||||
- Default is "directory" to maintain backward compatibility
|
|
||||||
- Updated `insert_pathway_records` to include `chart_type` column (required for new schema)
|
|
||||||
- Added `chart_type` parameter to `refresh_pathways()` function
|
|
||||||
- Updated Step 2 to loop through chart_types_to_process list
|
|
||||||
- For "directory": uses existing `process_all_date_filters()`
|
|
||||||
- For "indication": placeholder with warning (requires Task 3.2 for GP diagnosis)
|
|
||||||
- Updated logging to show per-chart-type counts
|
|
||||||
- Updated help text with new examples
|
|
||||||
- Added imports for new pathway_pipeline functions:
|
|
||||||
- `ChartType`, `DATE_FILTER_CONFIGS`, `process_indication_pathway_for_date_filter`
|
|
||||||
- `extract_indication_fields`, `convert_to_records`
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile cli/refresh_pathways.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check for `refresh_pathways`, `main` — PASSED
|
|
||||||
- Tier 1 (Code): `python -m cli.refresh_pathways --help` — Shows new argument — PASSED
|
|
||||||
- Tier 2 (Data): Not fully testable until Task 3.2 (GP diagnosis integration)
|
|
||||||
### Files changed:
|
|
||||||
- `cli/refresh_pathways.py` — added chart_type support
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 3.1 items complete
|
|
||||||
### Committed: 593d14c "feat: add chart_type argument to refresh command (Task 3.1)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- The `convert_to_records()` function already has `chart_type` parameter with default "directory"
|
|
||||||
- Results dictionary now keyed by "date_filter_id:chart_type" (e.g., "all_6mo:directory")
|
|
||||||
- Stats now include `chart_type_counts` for summary by chart type
|
|
||||||
### Next iteration should:
|
|
||||||
- Start Task 3.2: Integrate Diagnosis-Based Directorate in Pipeline
|
|
||||||
- This is the key task that enables indication chart processing
|
|
||||||
- Need to add batch GP diagnosis lookup during `fetch_and_transform_data()`
|
|
||||||
- Create `indication_df` mapping UPID → Indication_Group (Search_Term or fallback)
|
|
||||||
- Call `process_indication_pathway_for_date_filter()` with the indication_df
|
|
||||||
- Key consideration: Batch Snowflake queries for performance (don't query per patient)
|
|
||||||
- Expected: ~35K patients, need to query in batches of ~1000
|
|
||||||
- Coverage logging: "X% diagnosis-matched, Y% fallback"
|
|
||||||
### Blocked items:
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Iteration 8 — 2026-02-05
|
|
||||||
### Task: 3.2 Integrate Diagnosis-Based Directorate in Pipeline
|
|
||||||
### Why this task:
|
|
||||||
- Task 3.1 complete — CLI argument added but indication processing was placeholder
|
|
||||||
- Task 3.2 is the key task that enables actual indication chart processing
|
|
||||||
- Previous iteration explicitly recommended starting Task 3.2
|
|
||||||
- Task 3.3 (full pipeline test) and Phase 4 (UI) depend on this being complete
|
|
||||||
- Following "pipeline before UI" principle
|
|
||||||
### Status: COMPLETE
|
|
||||||
### What was done:
|
|
||||||
- Added `batch_lookup_indication_groups()` to `data_processing/diagnosis_lookup.py`:
|
|
||||||
- Efficient batch function to look up GP diagnoses for all patients
|
|
||||||
- Queries Snowflake in batches of 500 patients (configurable batch_size)
|
|
||||||
- Gets all SNOMED codes for drugs from local SQLite (fast)
|
|
||||||
- Builds single query per batch checking all patient-SNOMED combinations
|
|
||||||
- Returns DataFrame with: UPID, Indication_Group, Source
|
|
||||||
- Indication_Group is Search_Term (if matched) or "Directory (no GP dx)" (if fallback)
|
|
||||||
- Source is "DIAGNOSIS" or "FALLBACK"
|
|
||||||
- Logs coverage statistics: X% diagnosis-matched, Y% fallback
|
|
||||||
- Updated `cli/refresh_pathways.py` indication chart processing:
|
|
||||||
- Import batch_lookup_indication_groups
|
|
||||||
- When processing indication chart type:
|
|
||||||
1. Call batch_lookup_indication_groups(df) to create indication_df
|
|
||||||
2. Log coverage statistics to stats dict
|
|
||||||
3. Rename Indication_Group → Directory for compatibility with generate_icicle_chart_indication
|
|
||||||
4. Set index to UPID for lookup during chart generation
|
|
||||||
5. Process all 6 date filters with process_indication_pathway_for_date_filter()
|
|
||||||
6. Extract indication fields and convert to records with chart_type="indication"
|
|
||||||
- Added error handling with fallback to empty results if GP lookup fails
|
|
||||||
- Added TYPE_CHECKING import for pandas type hints
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): `python -m py_compile data_processing/diagnosis_lookup.py` — PASSED
|
|
||||||
- Tier 1 (Code): `python -m py_compile cli/refresh_pathways.py` — PASSED
|
|
||||||
- Tier 1 (Code): Import check for batch_lookup_indication_groups — PASSED
|
|
||||||
- Tier 1 (Code): `python -m cli.refresh_pathways --help` — Shows all arguments — PASSED
|
|
||||||
- Tier 2 (Data): Not fully testable without Snowflake connection (requires --dry-run with SSO)
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/diagnosis_lookup.py` — added batch_lookup_indication_groups(), TYPE_CHECKING import
|
|
||||||
- `cli/refresh_pathways.py` — integrated batch lookup, added full indication processing flow
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 3.2 items complete
|
|
||||||
### Committed: 8952156 "feat: integrate batch GP diagnosis lookup for indication charts (Task 3.2)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- Batch Snowflake queries: Build one query with IN clauses for both patients AND SNOMED codes
|
|
||||||
- ORDER BY EventDateTime DESC in query lets us pick first result = most recent in Python
|
|
||||||
- PersonKey column = PatientPseudonym (used directly for Snowflake lookup)
|
|
||||||
- indication_df must be indexed by UPID and have 'Directory' column (renamed from Indication_Group)
|
|
||||||
- Fallback label format: "Directory (no GP dx)" distinguishes matched vs unmatched in chart
|
|
||||||
### Next iteration should:
|
|
||||||
- Start Task 3.3: Test Full Refresh Pipeline
|
|
||||||
- Run `python -m cli.refresh_pathways --chart-type all` with real data (requires Snowflake SSO)
|
|
||||||
- Verify pathway_nodes table has both chart_type="directory" and chart_type="indication"
|
|
||||||
- Verify indication chart hierarchy: Trust → Search_Term → Drug → Pathway
|
|
||||||
- Verify unmatched patients show with "Directory (no GP dx)" labels
|
|
||||||
- Document: Processing time, record counts, coverage percentages
|
|
||||||
- If no Snowflake access, skip to Phase 4 (UI) and note as blocked
|
|
||||||
### Blocked items:
|
|
||||||
- Task 3.3 verification requires Snowflake connection (NHS SSO)
|
|
||||||
|
|
||||||
## Iteration 9 — 2026-02-05
|
|
||||||
### Task: 3.3 Test Full Refresh Pipeline
|
|
||||||
### Why this task:
|
|
||||||
- Previous iteration completed Task 3.2 but noted verification was pending
|
|
||||||
- Task 3.3 is the final task in Phase 3 to validate all pipeline work
|
|
||||||
- Phase 4 (UI) depends on having working indication chart data
|
|
||||||
- Following "validate before moving forward" principle
|
|
||||||
### Status: IN PROGRESS
|
|
||||||
### What was done:
|
|
||||||
1. **First refresh attempt**: Ran `python -m cli.refresh_pathways --chart-type all --verbose`
|
|
||||||
- Directory chart: Processed successfully (293 nodes for all_6mo)
|
|
||||||
- Indication chart: **0% diagnosis matches** - all 37,257 patients fell back to FALLBACK
|
|
||||||
- This resulted in no indication pathway data (charts empty)
|
|
||||||
|
|
||||||
2. **Diagnosed root cause #1**: SNOMED codes stored in scientific notation
|
|
||||||
- CSV has codes like "1.0629311000119108e+16" due to pandas/Excel export
|
|
||||||
- The `clean_snomed_code()` function only handled ".0" suffix removal
|
|
||||||
- Codes were stored as "1.06e+16" which never match Snowflake data
|
|
||||||
- **Fix**: Updated `clean_snomed_code()` to convert scientific notation to integers
|
|
||||||
- Reloaded 144,056 SNOMED mappings with properly formatted codes
|
|
||||||
|
|
||||||
3. **Diagnosed root cause #2**: Wrong patient identifier used for GP lookup
|
|
||||||
- `batch_lookup_indication_groups()` was using `PersonKey` column
|
|
||||||
- `PersonKey` = `LocalPatientID` (provider-specific like "J188448")
|
|
||||||
- GP records use `PatientPseudonym` which matches `PseudoNHSNoLinked` (SHA-256 hash)
|
|
||||||
- **Fix**: Changed to use `PseudoNHSNoLinked` column for GP record matching
|
|
||||||
- Test showed ~20% match rate for ADALIMUMAB patients with correct identifier
|
|
||||||
|
|
||||||
4. **Committed fixes**: `5b1569e` "fix: correct patient identifier for GP diagnosis lookup (Task 3.3)"
|
|
||||||
|
|
||||||
5. **Started second refresh**: Running in background (task ID: be9b9e7)
|
|
||||||
- Processing time expected: ~15-20 minutes total
|
|
||||||
- Should now show non-zero GP matches
|
|
||||||
|
|
||||||
### Validation results:
|
|
||||||
- Tier 1 (Code): Syntax check passed for both modified files
|
|
||||||
- Tier 1 (Code): Import check passed
|
|
||||||
- Tier 2 (Data): SNOMED codes now properly formatted (0 scientific notation entries)
|
|
||||||
- Tier 2 (Data): GP record matching test: 20 matches found in 100 ADALIMUMAB patients
|
|
||||||
- Tier 2 (Data): Full refresh still running (started 15:XX) - pending final verification
|
|
||||||
### Files changed:
|
|
||||||
- `data_processing/load_snomed_mapping.py` — fixed clean_snomed_code() for scientific notation
|
|
||||||
- `data_processing/diagnosis_lookup.py` — changed to use PseudoNHSNoLinked for GP lookup
|
|
||||||
- `IMPLEMENTATION_PLAN.md` — marked Task 3.3 as in progress
|
|
||||||
### Committed: 5b1569e "fix: correct patient identifier for GP diagnosis lookup (Task 3.3)"
|
|
||||||
### Patterns discovered:
|
|
||||||
- **Critical**: PersonKey ≠ PatientPseudonym. HCD data has two patient identifiers:
|
|
||||||
- `LocalPatientID` (aliased as PersonKey) — provider-specific, NOT in GP records
|
|
||||||
- `PseudoNHSNoLinked` — pseudonymised NHS number, matches `PatientPseudonym` in GP records
|
|
||||||
- SNOMED codes can have 15-16 digits, causing float precision issues in pandas/Excel exports
|
|
||||||
- Scientific notation must be converted back to integers for string matching
|
|
||||||
### Next iteration should:
|
|
||||||
1. **Check refresh completion**: Read output from task be9b9e7
|
|
||||||
- Look for "DIAGNOSIS matches: X%" line in batch lookup output
|
|
||||||
- Should now show non-zero percentage (expected 10-30% based on ADALIMUMAB test)
|
|
||||||
- Look for "indication: X nodes total" confirming indication charts generated
|
|
||||||
|
|
||||||
2. **If refresh succeeded**: Verify database state
|
|
||||||
- `SELECT chart_type, COUNT(*) FROM pathway_nodes GROUP BY chart_type`
|
|
||||||
- Should show both "directory" (293) and "indication" (expected 300-600) rows
|
|
||||||
- `SELECT DISTINCT directory FROM pathway_nodes WHERE chart_type='indication' LIMIT 20`
|
|
||||||
- Should show Search_Term values like "rheumatoid arthritis", "macular degeneration"
|
|
||||||
|
|
||||||
3. **Mark Task 3.3 complete** with validation evidence:
|
|
||||||
- Processing time
|
|
||||||
- Record counts per chart type
|
|
||||||
- Coverage percentage (diagnosis vs fallback)
|
|
||||||
|
|
||||||
4. **If refresh still running**: Wait or check `tail -50` of output file
|
|
||||||
|
|
||||||
5. **Start Phase 4**: If 3.3 passes, begin Task 4.1 (Add Chart Type State to Reflex)
|
|
||||||
### Blocked items:
|
|
||||||
- None (Snowflake connection established)
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user