206 lines
12 KiB
Plaintext
206 lines
12 KiB
Plaintext
# Progress Log - Indication-Based Pathway Charts
|
|
|
|
## Project Context
|
|
|
|
This project adds indication-based icicle charts alongside the existing directory-based charts. Patient diagnoses are matched from GP records using SNOMED cluster codes queried directly from Snowflake.
|
|
|
|
**Key Change from Previous Approach**: Instead of maintaining a local CSV/SQLite mapping of SNOMED codes, we now query the `ClinicalCodingClusterSnomedCodes` clusters directly in Snowflake during the data refresh. This simplifies the architecture and ensures we always use the latest cluster definitions.
|
|
|
|
## Key Files Reference
|
|
|
|
**Existing (reuse these):**
|
|
- `data_processing/schema.py` - SQLite schema (chart_type column already added)
|
|
- `data_processing/diagnosis_lookup.py` - Extend with new Snowflake query
|
|
- `data_processing/pathway_pipeline.py` - Pathway processing (indication functions exist)
|
|
- `cli/refresh_pathways.py` - CLI refresh command (chart_type arg exists)
|
|
- `pathways_app/pathways_app.py` - Reflex app (add chart type toggle)
|
|
- `tools/data.py` - Data transformations including department_identification()
|
|
|
|
**New/Key:**
|
|
- `snomed_indication_mapping_query.sql` - Master SNOMED cluster query to embed in Snowflake calls
|
|
|
|
## Known Patterns
|
|
|
|
### SNOMED Cluster Query Approach
|
|
The `snomed_indication_mapping_query.sql` contains the Search_Term → Cluster_ID mappings:
|
|
- ~148 conditions mapped to clinical coding clusters
|
|
- Joins with `DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"` to get SNOMED codes
|
|
- Includes explicit manual mappings for conditions not in clusters
|
|
- Returns: Search_Term, SNOMEDCode, SNOMEDDescription
|
|
|
|
### GP Record Matching
|
|
To find a patient's indication:
|
|
1. Use the cluster query as a CTE
|
|
2. Join with `PrimaryCareClinicalCoding` on SNOMEDCode
|
|
3. Filter by PatientPseudonym (use PseudoNHSNoLinked from HCD data)
|
|
4. Use most recent match by EventDateTime
|
|
5. Return Search_Term for matched patients
|
|
|
|
### Patient Identifier Mapping
|
|
- HCD data has `PseudoNHSNoLinked` column - this matches `PatientPseudonym` in GP records
|
|
- DO NOT use `PersonKey` (LocalPatientID) - this is provider-specific and won't match GP records
|
|
- UPID = Provider Code (3 chars) + PersonKey
|
|
|
|
### Chart Type Architecture
|
|
- `chart_type` column in pathway_nodes: "directory" or "indication"
|
|
- 12 total pathway datasets: 6 date filters x 2 chart types
|
|
- Indication chart: mixed labels (Search_Term for matched, Directorate for unmatched)
|
|
|
|
### Date Filter Combinations
|
|
| ID | Initiated | Last Seen | Default |
|
|
|----|-----------|-----------|---------|
|
|
| `all_6mo` | All years | Last 6 months | Yes |
|
|
| `all_12mo` | All years | Last 12 months | No |
|
|
| `1yr_6mo` | Last 1 year | Last 6 months | No |
|
|
| `1yr_12mo` | Last 1 year | Last 12 months | No |
|
|
| `2yr_6mo` | Last 2 years | Last 6 months | No |
|
|
| `2yr_12mo` | Last 2 years | Last 12 months | No |
|
|
|
|
### Previous Work (Reusable)
|
|
These components from the previous approach are still valid:
|
|
- `chart_type` column and schema migration (Task 2.1 - complete)
|
|
- `generate_icicle_chart_indication()` function (Task 2.2 - complete)
|
|
- `process_indication_pathway_for_date_filter()` function (Task 2.2 - complete)
|
|
- `extract_indication_fields()` function (Task 2.2 - complete)
|
|
- `--chart-type` CLI argument (Task 2.3 - complete)
|
|
|
|
### What Needs Replacement
|
|
The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py` used a local SQLite table. This needs to be replaced with a new function that queries Snowflake directly using the cluster query.
|
|
|
|
---
|
|
|
|
## Iteration Log
|
|
|
|
<!-- Each iteration appends a structured entry below -->
|
|
|
|
## Iteration 1 — 2026-02-05
|
|
### Task: 1.1 Create Indication Lookup Query
|
|
### Why this task:
|
|
- This is the foundation task — other tasks (1.2 CLI integration, 2.3 refresh command) depend on this function
|
|
- The progress.txt explicitly noted the old approach needs replacement
|
|
- Logical flow: data query function must exist before pipeline integration
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Created `get_patient_indication_groups()` function in `data_processing/diagnosis_lookup.py`
|
|
- Embedded the full cluster mapping SQL (from snomed_indication_mapping_query.sql) as `CLUSTER_MAPPING_SQL` constant
|
|
- Function takes list of PseudoNHSNoLinked values and queries Snowflake directly
|
|
- Uses QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1 to get most recent match
|
|
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime columns
|
|
- Handles edge cases: empty patient list, Snowflake unavailable/unconfigured
|
|
- Added batch processing (default 500 patients per batch) for large datasets
|
|
- Added logging for match statistics (match rate, unique Search_Terms, top 5 indications)
|
|
- Added both function and CLUSTER_MAPPING_SQL to __all__ exports
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ `python -m py_compile` passed, import check passed
|
|
- Tier 2 (Data): ✅ Empty list returns correct empty DataFrame with expected columns
|
|
- Tier 3 (Functional): N/A (not a UI task)
|
|
### Files changed:
|
|
- `data_processing/diagnosis_lookup.py` — added CLUSTER_MAPPING_SQL constant and get_patient_indication_groups() function
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.1 items complete
|
|
### Committed: 052256c "feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)"
|
|
### Patterns discovered:
|
|
- Snowflake's QUALIFY clause is cleaner than subquery for row_number filtering
|
|
- The cluster CTE has 148 Search_Term mappings plus 13 explicit SNOMED codes
|
|
### Next iteration should:
|
|
- Test the function with real patient data to verify it returns expected Search_Terms (Task 1.1 verification)
|
|
- OR proceed to Task 1.2 (integrate with CLI refresh command) if confident in the function
|
|
- The key integration point: extract unique PseudoNHSNoLinked values from HCD data, call this function, map results back to UPID for indication_df
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 2 — 2026-02-05
|
|
### Task: 1.2 Update Data Pipeline to Include Indications & 2.3 Update Refresh Command
|
|
### Why this task:
|
|
- Task 1.1 verification would naturally happen when running the full pipeline with real data
|
|
- Task 1.2 is the logical next step - integrates the new function into the CLI
|
|
- Task 2.3 overlaps with 1.2 - both involve updating refresh_pathways.py to use the new approach
|
|
- Combined these tasks since they're tightly coupled
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Modified `cli/refresh_pathways.py` to use `get_patient_indication_groups()` instead of `batch_lookup_indication_groups()`
|
|
- Updated import statement to use the new function
|
|
- Replaced the indication chart processing section (lines 361-441) with new logic:
|
|
1. Extracts unique PseudoNHSNoLinked values from df
|
|
2. Calls `get_patient_indication_groups()` with patient list
|
|
3. Builds indication_df mapping UPID → Indication_Group:
|
|
- For matched patients: Search_Term (from GP record)
|
|
- For unmatched patients: Directory + " (no GP dx)"
|
|
4. Logs coverage statistics and top indications
|
|
5. Passes indication_df to existing `process_indication_pathway_for_date_filter()`
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ `python -m py_compile cli/refresh_pathways.py` passed
|
|
- Tier 1 (Import): ✅ `from cli.refresh_pathways import refresh_pathways` works
|
|
- Tier 1 (Import): ✅ `from data_processing.diagnosis_lookup import get_patient_indication_groups` works
|
|
- Tier 2 (Data): Pending - needs live Snowflake test with `--chart-type indication`
|
|
- Tier 3 (Functional): Pending - needs full pipeline test
|
|
### Files changed:
|
|
- `cli/refresh_pathways.py` — replaced batch_lookup_indication_groups with get_patient_indication_groups integration
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.2 and 2.3 subtasks complete
|
|
### Committed: ad10b37 "feat: integrate Snowflake-direct indication lookup into CLI refresh (Task 1.2, 2.3)"
|
|
### Patterns discovered:
|
|
- The indication processing follows the same flow as before, just with different data source
|
|
- patient_lookup DataFrame helps map PseudoNHSNoLinked → UPID for the final indication_df
|
|
- match_lookup dict (PatientPseudonym → Search_Term) makes joining simple
|
|
### Next iteration should:
|
|
- Run a live test with `python -m cli.refresh_pathways --chart-type indication --dry-run` to verify the full pipeline
|
|
- This will test Task 1.1 verification (function returns expected Search_Terms) and Task 3.1 (full pipeline test)
|
|
- Alternatively, proceed to Phase 4 (Reflex UI) if confident
|
|
- Key verification points: coverage statistics logged, indication_df structure correct
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 3 — 2026-02-05
|
|
### Task: 3.1 Test Refresh with Real Data
|
|
### Why this task:
|
|
- Previous iteration recommended testing the full pipeline with Snowflake
|
|
- Task 3.1 validates Tasks 1.1, 1.2, 2.1-2.3 in one comprehensive test
|
|
- Must verify data layer works before building UI (Phase 4)
|
|
### Status: IN PROGRESS (bugs identified and fixed, need another test run)
|
|
### What was done:
|
|
1. Ran `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
|
|
2. Identified and fixed THREE bugs:
|
|
|
|
**Bug 1: Snowflake column name casing**
|
|
- Issue: `Search_Term` returned as `SEARCH_TERM` (uppercase) from Snowflake
|
|
- Symptom: "Unique Search_Terms found: 0" despite 34,006 patient matches
|
|
- Root cause: Unquoted column aliases in SQL are uppercased by Snowflake
|
|
- Fix: Added quoted aliases: `aic.Search_Term AS "Search_Term"`
|
|
|
|
**Bug 2: Duplicate UPID index in indication_df**
|
|
- Issue: `indication_df_for_chart.set_index('UPID')` failed with non-unique index
|
|
- Symptom: `InvalidIndexError: Reindexing only valid with uniquely valued Index objects`
|
|
- Root cause: Same patient could appear multiple times if data had edge cases
|
|
- Fix: Added `drop_duplicates(subset=['UPID'], keep='first')` before set_index()
|
|
|
|
**Bug 3: Missing UPIDs in indication mapping**
|
|
- Issue: Old code built indication_df from unique PseudoNHSNoLinked, not unique UPIDs
|
|
- Symptom: `TypeError: can only concatenate str (not "float") to str` in build_hierarchy
|
|
- Root cause: Patients with multiple UPIDs (from different providers) had some UPIDs unmapped
|
|
- Fix: Changed to build indication_df from ALL unique UPIDs, with NaN handling
|
|
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ Both files compile, imports work
|
|
- Tier 2 (Data):
|
|
- ✅ 36,628 patients queried
|
|
- ✅ 34,006 (92.8%) matched GP diagnoses
|
|
- ✅ 139 unique Search_Terms found (was 0 before fix)
|
|
- ✅ Top 5 indications: drug misuse (8602), influenza (6239), diabetes (2476), sepsis (1980), cardiovascular disease (940)
|
|
- Tier 3 (Functional): ❌ Pipeline still fails after indication lookup — need another test run
|
|
### Files changed:
|
|
- `data_processing/diagnosis_lookup.py` — fixed column aliasing in SQL query
|
|
- `cli/refresh_pathways.py` — fixed UPID mapping logic, added deduplication, NaN handling
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 3.1 as in progress
|
|
### Committed: 22222fe "fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1)"
|
|
### Patterns discovered:
|
|
- Snowflake ALWAYS uppercases unquoted identifiers — must use AS "column" for mixed case
|
|
- Patients can have multiple UPIDs if they visited different providers (UPID = ProviderCode[:3] + PersonKey)
|
|
- Must handle NaN values in Directory column or get TypeError in string concatenation
|
|
- ~92.8% of patients have matching GP diagnoses — this is excellent coverage!
|
|
### Next iteration should:
|
|
- Run another `python -m cli.refresh_pathways --chart-type indication --dry-run -v` to verify fixes work end-to-end
|
|
- The indication lookup now works (139 Search_Terms found) — need to confirm pathway processing also works
|
|
- If successful, mark Task 3.1 complete and proceed to Phase 4 (Reflex UI)
|
|
- Test run takes ~35 minutes total (7 min data fetch/transform, 25 min indication lookup, 3 min pathway processing)
|
|
### Blocked items:
|
|
- None
|