HighCostDrugsDemo/progress.txt

# Progress Log - Indication-Based Pathway Charts

## Project Context

This project adds indication-based icicle charts alongside the existing directory-based charts. Patient diagnoses are matched from GP records using SNOMED cluster codes queried directly from Snowflake.

**Key Change from Previous Approach**: Instead of maintaining a local CSV/SQLite mapping of SNOMED codes, we now query the `ClinicalCodingClusterSnomedCodes` clusters directly in Snowflake during the data refresh. This simplifies the architecture and ensures we always use the latest cluster definitions.

## Key Files Reference

**Existing (reuse these):**
- `data_processing/schema.py` - SQLite schema (chart_type column already added)
- `data_processing/diagnosis_lookup.py` - Extend with new Snowflake query
- `data_processing/pathway_pipeline.py` - Pathway processing (indication functions exist)
- `cli/refresh_pathways.py` - CLI refresh command (chart_type arg exists)
- `pathways_app/pathways_app.py` - Reflex app (add chart type toggle)
- `tools/data.py` - Data transformations including department_identification()

**New/Key:**
- `snomed_indication_mapping_query.sql` - Master SNOMED cluster query to embed in Snowflake calls

## Known Patterns

### SNOMED Cluster Query Approach
The `snomed_indication_mapping_query.sql` contains the Search_Term → Cluster_ID mappings:
- ~148 conditions mapped to clinical coding clusters
- Joins with `DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"` to get SNOMED codes
- Includes explicit manual mappings for conditions not in clusters
- Returns: Search_Term, SNOMEDCode, SNOMEDDescription

### GP Record Matching
To find a patient's indication:
1. Use the cluster query as a CTE
2. Join with `PrimaryCareClinicalCoding` on SNOMEDCode
3. Filter by PatientPseudonym (use PseudoNHSNoLinked from HCD data)
4. Use most recent match by EventDateTime
5. Return Search_Term for matched patients

### Patient Identifier Mapping
- HCD data has `PseudoNHSNoLinked` column - this matches `PatientPseudonym` in GP records
- DO NOT use `PersonKey` (LocalPatientID) - this is provider-specific and won't match GP records
- UPID = Provider Code (3 chars) + PersonKey

### Chart Type Architecture
- `chart_type` column in pathway_nodes: "directory" or "indication"
- 12 total pathway datasets: 6 date filters x 2 chart types
- Indication chart: mixed labels (Search_Term for matched, Directorate for unmatched)

### Date Filter Combinations
| ID | Initiated | Last Seen | Default |
|----|-----------|-----------|---------|
| `all_6mo` | All years | Last 6 months | Yes |
| `all_12mo` | All years | Last 12 months | No |
| `1yr_6mo` | Last 1 year | Last 6 months | No |
| `1yr_12mo` | Last 1 year | Last 12 months | No |
| `2yr_6mo` | Last 2 years | Last 6 months | No |
| `2yr_12mo` | Last 2 years | Last 12 months | No |

### Previous Work (Reusable)
These components from the previous approach are still valid:
- `chart_type` column and schema migration (Task 2.1 - complete)
- `generate_icicle_chart_indication()` function (Task 2.2 - complete)
- `process_indication_pathway_for_date_filter()` function (Task 2.2 - complete)
- `extract_indication_fields()` function (Task 2.2 - complete)
- `--chart-type` CLI argument (Task 2.3 - complete)

### What Needs Replacement
The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py` used a local SQLite table. This needs to be replaced with a new function that queries Snowflake directly using the cluster query.

---

## Iteration Log

<!-- Each iteration appends a structured entry below -->

## Iteration 1 — 2026-02-05
### Task: 1.1 Create Indication Lookup Query
### Why this task:
- This is the foundation task — other tasks (1.2 CLI integration, 2.3 refresh command) depend on this function
- The progress.txt explicitly noted the old approach needs replacement
- Logical flow: data query function must exist before pipeline integration
### Status: COMPLETE
### What was done:
- Created `get_patient_indication_groups()` function in `data_processing/diagnosis_lookup.py`
- Embedded the full cluster mapping SQL (from snomed_indication_mapping_query.sql) as `CLUSTER_MAPPING_SQL` constant
- Function takes list of PseudoNHSNoLinked values and queries Snowflake directly
- Uses QUALIFY ROW_NUMBER() OVER (PARTITION BY PatientPseudonym ORDER BY EventDateTime DESC) = 1 to get most recent match
- Returns DataFrame with PatientPseudonym, Search_Term, EventDateTime columns
- Handles edge cases: empty patient list, Snowflake unavailable/unconfigured
- Added batch processing (default 500 patients per batch) for large datasets
- Added logging for match statistics (match rate, unique Search_Terms, top 5 indications)
- Added both function and CLUSTER_MAPPING_SQL to __all__ exports
### Validation results:
- Tier 1 (Code): ✅ `python -m py_compile` passed, import check passed
- Tier 2 (Data): ✅ Empty list returns correct empty DataFrame with expected columns
- Tier 3 (Functional): N/A (not a UI task)
### Files changed:
- `data_processing/diagnosis_lookup.py` — added CLUSTER_MAPPING_SQL constant and get_patient_indication_groups() function
- `IMPLEMENTATION_PLAN.md` — marked Task 1.1 items complete
### Committed: 052256c "feat: add get_patient_indication_groups() for Snowflake-direct GP lookup (Task 1.1)"
### Patterns discovered:
- Snowflake's QUALIFY clause is cleaner than subquery for row_number filtering
- The cluster CTE has 148 Search_Term mappings plus 13 explicit SNOMED codes
### Next iteration should:
- Test the function with real patient data to verify it returns expected Search_Terms (Task 1.1 verification)
- OR proceed to Task 1.2 (integrate with CLI refresh command) if confident in the function
- The key integration point: extract unique PseudoNHSNoLinked values from HCD data, call this function, map results back to UPID for indication_df
### Blocked items:
- None