From 6331d441657d8b6f59abddddd2b83ce95a73fcaf Mon Sep 17 00:00:00 2001 From: Andrew Charlwood Date: Thu, 5 Feb 2026 20:10:12 +0000 Subject: [PATCH] fix: prevent DataFrame mutation in prepare_data() causing indication charts to fail MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit prepare_data() mapped Provider Code → Name in-place. When called for directory charts first, then indication charts, the second call re-mapped already-mapped values to NaN, silently dropping all data. Added df.copy() to prevent mutation. Also fixes directory charts only generating data for the first date filter. Results: 3,633 pathway nodes now generated (1,101 directory + 2,532 indication) across all 12 datasets (6 date filters × 2 chart types). --- IMPLEMENTATION_PLAN.md | 26 +++++++++++------- analysis/pathway_analyzer.py | 4 +++ guardrails.md | 5 ++++ progress.txt | 51 ++++++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 10 deletions(-) diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md index 6a693c8..0466a3c 100644 --- a/IMPLEMENTATION_PLAN.md +++ b/IMPLEMENTATION_PLAN.md @@ -49,7 +49,7 @@ python -m reflex compile - Returns: DataFrame with PatientPseudonym, Search_Term, EventDateTime - Uses most recent match per patient (ORDER BY EventDateTime DESC) - [x] Handle edge cases: Snowflake unavailable, empty patient list -- [ ] Verify: Function returns expected Search_Terms for test patients +- [x] Verify: Function returns expected Search_Terms for test patients (92.8% match rate, 139 unique Search_Terms) ### 1.2 Update Data Pipeline to Include Indications - [x] Modify `cli/refresh_pathways.py` to call indication lookup during refresh: @@ -58,7 +58,7 @@ python -m reflex compile - Create `indication_df` mapping UPID → Indication_Group - For patients with no GP match: Indication_Group = fallback directorate - [x] Log coverage: X% diagnosis-matched, Y% fallback -- [ ] Verify: indication_df has correct structure for pathway processing +- [x] Verify: indication_df has correct structure for pathway processing (verified via full pipeline run) --- @@ -68,14 +68,14 @@ python -m reflex compile - [x] Add `chart_type` column to `pathway_nodes` table (ALREADY DONE) - [x] Update UNIQUE constraint to include chart_type (ALREADY DONE) - [x] Add indexes for chart_type filtering (ALREADY DONE) -- [ ] Verify: Existing migration works correctly +- [x] Verify: Existing migration works correctly (tables created, 3,589 nodes inserted) ### 2.2 Create Indication Pathway Processing - [x] Add `generate_icicle_chart_indication()` to `pathway_analyzer.py` (ALREADY DONE) - [x] Add `process_indication_pathway_for_date_filter()` to `pathway_pipeline.py` (ALREADY DONE) - [x] Add `extract_indication_fields()` for denormalized columns (ALREADY DONE) - [x] Update `convert_to_records()` with `chart_type` parameter (ALREADY DONE) -- [ ] Verify: Code compiles, imports work correctly +- [x] Verify: Code compiles, imports work correctly ### 2.3 Update Refresh Command for Dual Charts - [x] Add `--chart-type` argument: "all", "directory", "indication" (ALREADY DONE) @@ -100,7 +100,10 @@ python -m reflex compile - Record counts: 695 indication pathway nodes for all_6mo - Coverage: 92.8% GP diagnosis match rate (34,006/36,628 patients) - Top indications: drug misuse (8,749), influenza (6,336), diabetes (2,516), sepsis (1,991), cardiovascular disease (954) -- [ ] Run full refresh with `--chart-type all` to populate database (requires non-dry-run) +- [x] Run full refresh with `--chart-type all` to populate database (requires non-dry-run) + - Fixed DataFrame mutation bug in prepare_data() (df.copy() added) + - Results: 3,633 total nodes (1,101 directory + 2,532 indication) across all 12 datasets + - Database populated: 3,589 nodes in pathway_nodes table --- @@ -152,12 +155,15 @@ python -m reflex compile ## Completion Criteria All tasks marked `[x]` AND: -- [ ] App compiles without errors (`reflex compile` succeeds) -- [ ] Both chart types generate pathway data (12 total: 6 dates × 2 types) +- [x] App compiles without errors (`reflex compile` succeeds) +- [x] Both chart types generate pathway data (12 total: 6 dates × 2 types) + - Directory: 1,101 nodes (293+329+93+105+134+147) + - Indication: 2,532 nodes (695+785+167+198+315+372) - [ ] Chart type toggle switches between Directory and Indication views -- [ ] GP diagnosis matching works via Snowflake cluster query -- [ ] Unmatched patients show in indication chart with directorate fallback label -- [ ] Coverage metrics logged (% diagnosis-matched vs fallback) +- [x] GP diagnosis matching works via Snowflake cluster query +- [x] Unmatched patients show in indication chart with directorate fallback label +- [x] Coverage metrics logged (% diagnosis-matched vs fallback) + - 92.7% diagnosis-matched (34,545/37,257 UPIDs) - [ ] All filters work correctly for both chart types - [ ] Performance acceptable (< 10 min full refresh, < 500ms filter change) diff --git a/analysis/pathway_analyzer.py b/analysis/pathway_analyzer.py index 053f7e6..59687b6 100644 --- a/analysis/pathway_analyzer.py +++ b/analysis/pathway_analyzer.py @@ -53,6 +53,10 @@ def prepare_data( if paths is None: paths = default_paths + # Work on a copy to avoid mutating the caller's DataFrame + # (Provider Code mapping is destructive — second call would map names to NaN) + df = df.copy() + df["UPIDTreatment"] = df["UPID"] + df["Drug Name"] org_codes = pd.read_csv(paths.org_codes_csv, index_col=1) diff --git a/guardrails.md b/guardrails.md index d597d53..feff652 100644 --- a/guardrails.md +++ b/guardrails.md @@ -221,6 +221,11 @@ def filtered_count(self) -> int: - **Rule**: Check `pd.notna(directory)` before concatenating to string. Use `"UNKNOWN (no GP dx)"` for NaN cases. - **Why**: `str(nan) + " (no GP dx)"` doesn't cause error, but `nan + " (no GP dx)"` causes TypeError. Always be explicit about NaN handling. +### Copy DataFrames in functions that modify columns +- **When**: Writing functions like `prepare_data()` that modify DataFrame columns (e.g., mapping Provider Code to trust names) +- **Rule**: Always `df = df.copy()` at the start of any function that modifies column values on the input DataFrame +- **Why**: `prepare_data()` mapped Provider Code → Name in-place. When called for directory charts first, then indication charts second, the second call tried to map already-mapped names → NaN, silently dropping all data. The fix: `df = df.copy()` prevents destructive mutation of the caller's DataFrame. +