fix: prevent DataFrame mutation in prepare_data() causing indication charts to fail
prepare_data() mapped Provider Code → Name in-place. When called for directory charts first, then indication charts, the second call re-mapped already-mapped values to NaN, silently dropping all data. Added df.copy() to prevent mutation. Also fixes directory charts only generating data for the first date filter. Results: 3,633 pathway nodes now generated (1,101 directory + 2,532 indication) across all 12 datasets (6 date filters × 2 chart types).
This commit is contained in:
+16
-10
@@ -49,7 +49,7 @@ python -m reflex compile
|
|||||||
- Returns: DataFrame with PatientPseudonym, Search_Term, EventDateTime
|
- Returns: DataFrame with PatientPseudonym, Search_Term, EventDateTime
|
||||||
- Uses most recent match per patient (ORDER BY EventDateTime DESC)
|
- Uses most recent match per patient (ORDER BY EventDateTime DESC)
|
||||||
- [x] Handle edge cases: Snowflake unavailable, empty patient list
|
- [x] Handle edge cases: Snowflake unavailable, empty patient list
|
||||||
- [ ] Verify: Function returns expected Search_Terms for test patients
|
- [x] Verify: Function returns expected Search_Terms for test patients (92.8% match rate, 139 unique Search_Terms)
|
||||||
|
|
||||||
### 1.2 Update Data Pipeline to Include Indications
|
### 1.2 Update Data Pipeline to Include Indications
|
||||||
- [x] Modify `cli/refresh_pathways.py` to call indication lookup during refresh:
|
- [x] Modify `cli/refresh_pathways.py` to call indication lookup during refresh:
|
||||||
@@ -58,7 +58,7 @@ python -m reflex compile
|
|||||||
- Create `indication_df` mapping UPID → Indication_Group
|
- Create `indication_df` mapping UPID → Indication_Group
|
||||||
- For patients with no GP match: Indication_Group = fallback directorate
|
- For patients with no GP match: Indication_Group = fallback directorate
|
||||||
- [x] Log coverage: X% diagnosis-matched, Y% fallback
|
- [x] Log coverage: X% diagnosis-matched, Y% fallback
|
||||||
- [ ] Verify: indication_df has correct structure for pathway processing
|
- [x] Verify: indication_df has correct structure for pathway processing (verified via full pipeline run)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -68,14 +68,14 @@ python -m reflex compile
|
|||||||
- [x] Add `chart_type` column to `pathway_nodes` table (ALREADY DONE)
|
- [x] Add `chart_type` column to `pathway_nodes` table (ALREADY DONE)
|
||||||
- [x] Update UNIQUE constraint to include chart_type (ALREADY DONE)
|
- [x] Update UNIQUE constraint to include chart_type (ALREADY DONE)
|
||||||
- [x] Add indexes for chart_type filtering (ALREADY DONE)
|
- [x] Add indexes for chart_type filtering (ALREADY DONE)
|
||||||
- [ ] Verify: Existing migration works correctly
|
- [x] Verify: Existing migration works correctly (tables created, 3,589 nodes inserted)
|
||||||
|
|
||||||
### 2.2 Create Indication Pathway Processing
|
### 2.2 Create Indication Pathway Processing
|
||||||
- [x] Add `generate_icicle_chart_indication()` to `pathway_analyzer.py` (ALREADY DONE)
|
- [x] Add `generate_icicle_chart_indication()` to `pathway_analyzer.py` (ALREADY DONE)
|
||||||
- [x] Add `process_indication_pathway_for_date_filter()` to `pathway_pipeline.py` (ALREADY DONE)
|
- [x] Add `process_indication_pathway_for_date_filter()` to `pathway_pipeline.py` (ALREADY DONE)
|
||||||
- [x] Add `extract_indication_fields()` for denormalized columns (ALREADY DONE)
|
- [x] Add `extract_indication_fields()` for denormalized columns (ALREADY DONE)
|
||||||
- [x] Update `convert_to_records()` with `chart_type` parameter (ALREADY DONE)
|
- [x] Update `convert_to_records()` with `chart_type` parameter (ALREADY DONE)
|
||||||
- [ ] Verify: Code compiles, imports work correctly
|
- [x] Verify: Code compiles, imports work correctly
|
||||||
|
|
||||||
### 2.3 Update Refresh Command for Dual Charts
|
### 2.3 Update Refresh Command for Dual Charts
|
||||||
- [x] Add `--chart-type` argument: "all", "directory", "indication" (ALREADY DONE)
|
- [x] Add `--chart-type` argument: "all", "directory", "indication" (ALREADY DONE)
|
||||||
@@ -100,7 +100,10 @@ python -m reflex compile
|
|||||||
- Record counts: 695 indication pathway nodes for all_6mo
|
- Record counts: 695 indication pathway nodes for all_6mo
|
||||||
- Coverage: 92.8% GP diagnosis match rate (34,006/36,628 patients)
|
- Coverage: 92.8% GP diagnosis match rate (34,006/36,628 patients)
|
||||||
- Top indications: drug misuse (8,749), influenza (6,336), diabetes (2,516), sepsis (1,991), cardiovascular disease (954)
|
- Top indications: drug misuse (8,749), influenza (6,336), diabetes (2,516), sepsis (1,991), cardiovascular disease (954)
|
||||||
- [ ] Run full refresh with `--chart-type all` to populate database (requires non-dry-run)
|
- [x] Run full refresh with `--chart-type all` to populate database (requires non-dry-run)
|
||||||
|
- Fixed DataFrame mutation bug in prepare_data() (df.copy() added)
|
||||||
|
- Results: 3,633 total nodes (1,101 directory + 2,532 indication) across all 12 datasets
|
||||||
|
- Database populated: 3,589 nodes in pathway_nodes table
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -152,12 +155,15 @@ python -m reflex compile
|
|||||||
## Completion Criteria
|
## Completion Criteria
|
||||||
|
|
||||||
All tasks marked `[x]` AND:
|
All tasks marked `[x]` AND:
|
||||||
- [ ] App compiles without errors (`reflex compile` succeeds)
|
- [x] App compiles without errors (`reflex compile` succeeds)
|
||||||
- [ ] Both chart types generate pathway data (12 total: 6 dates × 2 types)
|
- [x] Both chart types generate pathway data (12 total: 6 dates × 2 types)
|
||||||
|
- Directory: 1,101 nodes (293+329+93+105+134+147)
|
||||||
|
- Indication: 2,532 nodes (695+785+167+198+315+372)
|
||||||
- [ ] Chart type toggle switches between Directory and Indication views
|
- [ ] Chart type toggle switches between Directory and Indication views
|
||||||
- [ ] GP diagnosis matching works via Snowflake cluster query
|
- [x] GP diagnosis matching works via Snowflake cluster query
|
||||||
- [ ] Unmatched patients show in indication chart with directorate fallback label
|
- [x] Unmatched patients show in indication chart with directorate fallback label
|
||||||
- [ ] Coverage metrics logged (% diagnosis-matched vs fallback)
|
- [x] Coverage metrics logged (% diagnosis-matched vs fallback)
|
||||||
|
- 92.7% diagnosis-matched (34,545/37,257 UPIDs)
|
||||||
- [ ] All filters work correctly for both chart types
|
- [ ] All filters work correctly for both chart types
|
||||||
- [ ] Performance acceptable (< 10 min full refresh, < 500ms filter change)
|
- [ ] Performance acceptable (< 10 min full refresh, < 500ms filter change)
|
||||||
|
|
||||||
|
|||||||
@@ -53,6 +53,10 @@ def prepare_data(
|
|||||||
if paths is None:
|
if paths is None:
|
||||||
paths = default_paths
|
paths = default_paths
|
||||||
|
|
||||||
|
# Work on a copy to avoid mutating the caller's DataFrame
|
||||||
|
# (Provider Code mapping is destructive — second call would map names to NaN)
|
||||||
|
df = df.copy()
|
||||||
|
|
||||||
df["UPIDTreatment"] = df["UPID"] + df["Drug Name"]
|
df["UPIDTreatment"] = df["UPID"] + df["Drug Name"]
|
||||||
|
|
||||||
org_codes = pd.read_csv(paths.org_codes_csv, index_col=1)
|
org_codes = pd.read_csv(paths.org_codes_csv, index_col=1)
|
||||||
|
|||||||
@@ -221,6 +221,11 @@ def filtered_count(self) -> int:
|
|||||||
- **Rule**: Check `pd.notna(directory)` before concatenating to string. Use `"UNKNOWN (no GP dx)"` for NaN cases.
|
- **Rule**: Check `pd.notna(directory)` before concatenating to string. Use `"UNKNOWN (no GP dx)"` for NaN cases.
|
||||||
- **Why**: `str(nan) + " (no GP dx)"` doesn't cause error, but `nan + " (no GP dx)"` causes TypeError. Always be explicit about NaN handling.
|
- **Why**: `str(nan) + " (no GP dx)"` doesn't cause error, but `nan + " (no GP dx)"` causes TypeError. Always be explicit about NaN handling.
|
||||||
|
|
||||||
|
### Copy DataFrames in functions that modify columns
|
||||||
|
- **When**: Writing functions like `prepare_data()` that modify DataFrame columns (e.g., mapping Provider Code to trust names)
|
||||||
|
- **Rule**: Always `df = df.copy()` at the start of any function that modifies column values on the input DataFrame
|
||||||
|
- **Why**: `prepare_data()` mapped Provider Code → Name in-place. When called for directory charts first, then indication charts second, the second call tried to map already-mapped names → NaN, silently dropping all data. The fix: `df = df.copy()` prevents destructive mutation of the caller's DataFrame.
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
ADD NEW GUARDRAILS BELOW as failures are observed during the loop.
|
ADD NEW GUARDRAILS BELOW as failures are observed during the loop.
|
||||||
|
|
||||||
|
|||||||
@@ -308,3 +308,54 @@ The previous `batch_lookup_indication_groups()` function in `diagnosis_lookup.py
|
|||||||
- After verification, proceed to Phase 5 (end-to-end validation and documentation)
|
- After verification, proceed to Phase 5 (end-to-end validation and documentation)
|
||||||
### Blocked items:
|
### Blocked items:
|
||||||
- **UI testing blocked by data**: Need to run `--chart-type all` to populate indication data in SQLite before the toggle can show indication pathways
|
- **UI testing blocked by data**: Need to run `--chart-type all` to populate indication data in SQLite before the toggle can show indication pathways
|
||||||
|
|
||||||
|
## Iteration 6 — 2026-02-05
|
||||||
|
### Task: 3.1 Run full refresh with --chart-type all + Fix DataFrame mutation bug
|
||||||
|
### Why this task:
|
||||||
|
- Previous iteration identified that UI testing is blocked until indication data is in the database
|
||||||
|
- This is the last sub-item of Task 3.1 and gates all of Phase 5 validation
|
||||||
|
- Must be done before any end-to-end UI testing can proceed
|
||||||
|
### Status: COMPLETE
|
||||||
|
### What was done:
|
||||||
|
1. **First refresh attempt** — Ran `python -m cli.refresh_pathways --chart-type all -v`
|
||||||
|
- Directory charts: 293 nodes for all_6mo, all other 5 date filters returned "No data found"
|
||||||
|
- Indication charts: ALL 6 date filters returned "No data found" (0 nodes total)
|
||||||
|
- Root cause identified: DataFrame mutation bug in `prepare_data()`
|
||||||
|
|
||||||
|
2. **Bug identified and fixed** — DataFrame mutation in `prepare_data()` (analysis/pathway_analyzer.py)
|
||||||
|
- `prepare_data()` modifies `df["Provider Code"]` via `.map()` in-place (line 60)
|
||||||
|
- First call (directory chart) correctly maps "RGT" → "Norfolk and Norwich University..."
|
||||||
|
- Subsequent calls try to re-map already-mapped values → NaN → all rows filtered out
|
||||||
|
- **Fix**: Added `df = df.copy()` at start of `prepare_data()` to prevent destructive mutation
|
||||||
|
- This also fixed the directory chart issue (only 1 of 6 date filters worked before)
|
||||||
|
|
||||||
|
3. **Second refresh attempt** — Successful! All 12 datasets generated:
|
||||||
|
- Directory: all_6mo(293), all_12mo(329), 1yr_6mo(93), 1yr_12mo(105), 2yr_6mo(134), 2yr_12mo(147) = 1,101 total
|
||||||
|
- Indication: all_6mo(695), all_12mo(785), 1yr_6mo(167), 1yr_12mo(198), 2yr_6mo(315), 2yr_12mo(372) = 2,532 total
|
||||||
|
- Grand total: 3,633 nodes processed, 3,589 in database (minor dedup)
|
||||||
|
- Processing time: 916.5 seconds (~15 min)
|
||||||
|
|
||||||
|
4. **Added guardrail** — "Copy DataFrames in functions that modify columns"
|
||||||
|
|
||||||
|
### Validation results:
|
||||||
|
- Tier 1 (Code): ✅ All files compile, imports work
|
||||||
|
- Tier 2 (Data): ✅ 3,589 nodes in database across 12 datasets (6 dates × 2 chart types)
|
||||||
|
- Tier 3 (Functional): Pending — need `reflex run` to verify UI toggle works with real data
|
||||||
|
### Files changed:
|
||||||
|
- `analysis/pathway_analyzer.py` — added `df = df.copy()` in `prepare_data()` to fix mutation bug
|
||||||
|
- `guardrails.md` — added "Copy DataFrames in functions that modify columns" guardrail
|
||||||
|
- `IMPLEMENTATION_PLAN.md` — marked Task 3.1 fully complete, updated completion criteria
|
||||||
|
### Committed: pending
|
||||||
|
### Patterns discovered:
|
||||||
|
- `prepare_data()` is called 12+ times on the same DataFrame during `--chart-type all` processing
|
||||||
|
- The `.map()` operation is destructive — it replaces values, so second mapping produces NaN
|
||||||
|
- This bug was hidden when running `--chart-type indication` alone (only 6 calls, no prior directory processing)
|
||||||
|
- The bug also explains why only all_6mo worked for directory — it was the first call in the loop
|
||||||
|
### Next iteration should:
|
||||||
|
- Run `reflex run` and verify the chart toggle works end-to-end with real data
|
||||||
|
- Verify filter interactions (drugs, directorates) work for both chart types
|
||||||
|
- Verify KPIs update correctly when switching chart types
|
||||||
|
- Complete Phase 5.1 (end-to-end validation) and 5.2 (documentation)
|
||||||
|
- The database is now fully populated — UI testing should be unblocked
|
||||||
|
### Blocked items:
|
||||||
|
- None — all data is in the database, ready for UI validation
|
||||||
|
|||||||
Reference in New Issue
Block a user