docs: update progress.txt with iteration 3, add new guardrails (Task 3.1)
This commit is contained in:
+29
-26
@@ -135,13 +135,11 @@ def filtered_count(self) -> int:
|
||||
|
||||
### Check existing code for patterns
|
||||
- **When**: Unsure how to implement something in Reflex or pathway processing
|
||||
- **Rule**: Look at `pathways_app/app_v2.py`, `analysis/pathway_analyzer.py`, `visualization/plotly_generator.py`
|
||||
- **Rule**: Look at `pathways_app/pathways_app.py`, `analysis/pathway_analyzer.py`, `visualization/plotly_generator.py`
|
||||
- **Why**: The existing codebase has solved many quirks already
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## UI Redesign Guardrails
|
||||
|
||||
### Clear Reflex cache before running
|
||||
@@ -176,47 +174,52 @@ def filtered_count(self) -> int:
|
||||
|
||||
---
|
||||
|
||||
## SNOMED Mapping Guardrails
|
||||
## Snowflake Query Guardrails
|
||||
|
||||
### Use PseudoNHSNoLinked for GP record matching
|
||||
- **When**: Querying GP records (PrimaryCareClinicalCoding) for patient diagnoses
|
||||
- **Rule**: Use `PseudoNHSNoLinked` column from HCD data, NOT `PersonKey` (LocalPatientID)
|
||||
- **Why**: PersonKey is provider-specific local ID. Only PseudoNHSNoLinked matches PatientPseudonym in GP records.
|
||||
|
||||
### Use Search_Term for grouping, not Indication
|
||||
- **When**: Creating indication-based pathway hierarchy
|
||||
- **Rule**: Group patients by `Search_Term` column, NOT `Indication` column
|
||||
- **Why**: Indication has 603 granular values; Search_Term has 187 broader categories suitable for chart grouping
|
||||
- **Rule**: Group patients by `Search_Term` from the cluster query
|
||||
- **Why**: Search_Term provides meaningful clinical groupings (~148 values)
|
||||
|
||||
### Handle unmatched patients in indication chart
|
||||
- **When**: Patient has no GP diagnosis matching their drug's SNOMED codes
|
||||
- **When**: Patient has no GP diagnosis matching cluster SNOMED codes
|
||||
- **Rule**: Use their assigned directorate (from fallback logic) as the grouping label, not "Unknown"
|
||||
- **Why**: User wants mixed labels - indication Search_Terms for matched patients, directorate names for unmatched
|
||||
- **Why**: User wants mixed labels - Search_Terms for matched patients, directorate names for unmatched
|
||||
|
||||
### Use most recent SNOMED code for multiple matches
|
||||
- **When**: Patient has GP records matching multiple SNOMED codes for their drug
|
||||
- **When**: Patient has GP records matching multiple SNOMED codes
|
||||
- **Rule**: Use the match with the most recent `EventDateTime` from PrimaryCareClinicalCoding
|
||||
- **Why**: Most recent diagnosis reflects current clinical state
|
||||
|
||||
### Batch Snowflake queries for performance
|
||||
- **When**: Looking up GP records for many patients
|
||||
- **Rule**: Batch SNOMED lookups (e.g., 1000 patients at a time) rather than one query per patient
|
||||
- **Why**: Individual queries for 35K+ patients would be extremely slow
|
||||
|
||||
### Track diagnosis match source
|
||||
- **When**: Assigning directorate to a patient
|
||||
- **Rule**: Track whether assignment came from "DIAGNOSIS" (SNOMED match) or "FALLBACK" (department_identification)
|
||||
- **Why**: Needed for coverage metrics and debugging
|
||||
### Embed cluster query as CTE in Snowflake
|
||||
- **When**: Looking up patient indications during data refresh
|
||||
- **Rule**: Use the `snomed_indication_mapping_query.sql` content as a WITH clause in the patient lookup query
|
||||
- **Why**: This ensures we always use the complete cluster mapping and don't need local storage
|
||||
|
||||
### Chart type column in pathway_nodes
|
||||
- **When**: Inserting pathway records to SQLite
|
||||
- **Rule**: Include `chart_type` column with value "directory" or "indication"
|
||||
- **Why**: Needed to filter pathways when user toggles chart type in UI
|
||||
|
||||
### Use PseudoNHSNoLinked for GP record matching
|
||||
- **When**: Querying GP records (PrimaryCareClinicalCoding) for patient diagnoses
|
||||
- **Rule**: Use `PseudoNHSNoLinked` column, NOT `PersonKey` (LocalPatientID)
|
||||
- **Why**: PersonKey is provider-specific local ID. Only PseudoNHSNoLinked matches PatientPseudonym in GP records. Using PersonKey caused 0% GP match rate.
|
||||
### Quote mixed-case column aliases in Snowflake SQL
|
||||
- **When**: Writing SELECT queries that return results to Python code
|
||||
- **Rule**: Use `AS "ColumnName"` (quoted) for any column alias you'll access by name in Python
|
||||
- **Why**: Snowflake uppercases unquoted identifiers. `SELECT foo AS Search_Term` returns `SEARCH_TERM`, so `row.get('Search_Term')` returns None. Fix: `SELECT foo AS "Search_Term"`
|
||||
|
||||
### Handle scientific notation in SNOMED codes
|
||||
- **When**: Loading SNOMED codes from CSV files
|
||||
- **Rule**: Convert scientific notation (e.g., "1.06e+16") back to full integers before storing
|
||||
- **Why**: Large SNOMED codes (15-16 digits) exceed float precision. Pandas/Excel exports them as scientific notation. String matching will fail unless converted.
|
||||
### Build indication_df from all unique UPIDs, not PseudoNHSNoLinked
|
||||
- **When**: Creating the indication mapping DataFrame for pathway processing
|
||||
- **Rule**: Use `df.drop_duplicates(subset=['UPID'])` not `drop_duplicates(subset=['PseudoNHSNoLinked'])`
|
||||
- **Why**: A patient visiting multiple providers has multiple UPIDs (UPID = ProviderCode[:3] + PersonKey). Using unique PseudoNHSNoLinked only maps one UPID per patient, leaving others as NaN and causing TypeError in build_hierarchy.
|
||||
|
||||
### Handle NaN in Directory when building fallback labels
|
||||
- **When**: Creating fallback indication labels for patients without GP diagnosis match
|
||||
- **Rule**: Check `pd.notna(directory)` before concatenating to string. Use `"UNKNOWN (no GP dx)"` for NaN cases.
|
||||
- **Why**: `str(nan) + " (no GP dx)"` doesn't cause error, but `nan + " (no GP dx)"` causes TypeError. Always be explicit about NaN handling.
|
||||
|
||||
<!--
|
||||
ADD NEW GUARDRAILS BELOW as failures are observed during the loop.
|
||||
|
||||
@@ -148,3 +148,58 @@ The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py
|
||||
- Key verification points: coverage statistics logged, indication_df structure correct
|
||||
### Blocked items:
|
||||
- None
|
||||
|
||||
## Iteration 3 — 2026-02-05
|
||||
### Task: 3.1 Test Refresh with Real Data
|
||||
### Why this task:
|
||||
- Previous iteration recommended testing the full pipeline with Snowflake
|
||||
- Task 3.1 validates Tasks 1.1, 1.2, 2.1-2.3 in one comprehensive test
|
||||
- Must verify data layer works before building UI (Phase 4)
|
||||
### Status: IN PROGRESS (bugs identified and fixed, need another test run)
|
||||
### What was done:
|
||||
1. Ran `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
|
||||
2. Identified and fixed THREE bugs:
|
||||
|
||||
**Bug 1: Snowflake column name casing**
|
||||
- Issue: `Search_Term` returned as `SEARCH_TERM` (uppercase) from Snowflake
|
||||
- Symptom: "Unique Search_Terms found: 0" despite 34,006 patient matches
|
||||
- Root cause: Unquoted column aliases in SQL are uppercased by Snowflake
|
||||
- Fix: Added quoted aliases: `aic.Search_Term AS "Search_Term"`
|
||||
|
||||
**Bug 2: Duplicate UPID index in indication_df**
|
||||
- Issue: `indication_df_for_chart.set_index('UPID')` failed with non-unique index
|
||||
- Symptom: `InvalidIndexError: Reindexing only valid with uniquely valued Index objects`
|
||||
- Root cause: Same patient could appear multiple times if data had edge cases
|
||||
- Fix: Added `drop_duplicates(subset=['UPID'], keep='first')` before set_index()
|
||||
|
||||
**Bug 3: Missing UPIDs in indication mapping**
|
||||
- Issue: Old code built indication_df from unique PseudoNHSNoLinked, not unique UPIDs
|
||||
- Symptom: `TypeError: can only concatenate str (not "float") to str` in build_hierarchy
|
||||
- Root cause: Patients with multiple UPIDs (from different providers) had some UPIDs unmapped
|
||||
- Fix: Changed to build indication_df from ALL unique UPIDs, with NaN handling
|
||||
|
||||
### Validation results:
|
||||
- Tier 1 (Code): ✅ Both files compile, imports work
|
||||
- Tier 2 (Data):
|
||||
- ✅ 36,628 patients queried
|
||||
- ✅ 34,006 (92.8%) matched GP diagnoses
|
||||
- ✅ 139 unique Search_Terms found (was 0 before fix)
|
||||
- ✅ Top 5 indications: drug misuse (8602), influenza (6239), diabetes (2476), sepsis (1980), cardiovascular disease (940)
|
||||
- Tier 3 (Functional): ❌ Pipeline still fails after indication lookup — need another test run
|
||||
### Files changed:
|
||||
- `data_processing/diagnosis_lookup.py` — fixed column aliasing in SQL query
|
||||
- `cli/refresh_pathways.py` — fixed UPID mapping logic, added deduplication, NaN handling
|
||||
- `IMPLEMENTATION_PLAN.md` — marked Task 3.1 as in progress
|
||||
### Committed: 22222fe "fix: resolve Snowflake column casing and UPID mapping issues (Task 3.1)"
|
||||
### Patterns discovered:
|
||||
- Snowflake ALWAYS uppercases unquoted identifiers — must use AS "column" for mixed case
|
||||
- Patients can have multiple UPIDs if they visited different providers (UPID = ProviderCode[:3] + PersonKey)
|
||||
- Must handle NaN values in Directory column or get TypeError in string concatenation
|
||||
- ~92.8% of patients have matching GP diagnoses — this is excellent coverage!
|
||||
### Next iteration should:
|
||||
- Run another `python -m cli.refresh_pathways --chart-type indication --dry-run -v` to verify fixes work end-to-end
|
||||
- The indication lookup now works (139 Search_Terms found) — need to confirm pathway processing also works
|
||||
- If successful, mark Task 3.1 complete and proceed to Phase 4 (Reflex UI)
|
||||
- Test run takes ~35 minutes total (7 min data fetch/transform, 25 min indication lookup, 3 min pathway processing)
|
||||
### Blocked items:
|
||||
- None
|
||||
|
||||
Reference in New Issue
Block a user