docs: update progress.txt with iteration 5 completion (Task 2.2 Pipeline Test)

This commit is contained in:
Andrew Charlwood
2026-02-05 00:21:08 +00:00
parent adc1dbfc58
commit 0a13ba550e
+57
View File
@@ -199,3 +199,60 @@ The `ids` column in ice_df contains hierarchical paths like:
- Compare patient counts with original app to validate correctness - Compare patient counts with original app to validate correctness
### Blocked items: ### Blocked items:
- None - None
## Iteration 5 — 2026-02-05
### Task: 2.2 Test Refresh Pipeline with real Snowflake data
### Why this task:
- All Phase 1 and Task 2.1 complete — this was explicitly recommended by previous iteration
- Need to validate the full pipeline end-to-end before Reflex integration (Phase 3)
- Testing with real data catches type/format issues that unit tests miss
### Status: COMPLETE
### What was done:
1. **Configuration fixes**:
- Added Snowflake account identifier: `ZK91403.uk-south.azure`
- Added warehouse: `WH__XSMALL` (ANALYST_WH not available to user)
- Added user: `ANDREW.CHARLWOOD@NHS.NET`
2. **Bug fixes discovered during testing**:
- `get_default_filters()`: Was reading first column (Code) instead of Name column from defaultTrusts.csv
- `calculate_cost_per_patient_per_annum()`: Decimal type from Snowflake couldn't divide by float — added `float()` conversion
- `convert_to_records()`: `average_administered` is sometimes numpy array — `pd.isna()` fails on arrays, added try/except handling
- Unicode output: Changed checkmark symbols to ASCII for Windows cp1252 compatibility
3. **Data setup**:
- Copied required reference CSV files from Patient pathway analysis project
4. **Full refresh execution**:
- Snowflake fetch: 656,695 records in ~7s (chunked 10K rows at a time)
- Transformations: → 519,848 records (136,847 removed due to unmapped drug names)
- Pathway processing: 293 nodes for `all_6mo` filter
- Database insertion: 293 records with denormalized trust/directory/drug_sequence fields
### Validation results:
- Tier 1 (Code): All files compile, imports work
- Tier 2 (Visual): N/A (CLI/backend work)
- Tier 3 (Functional): Full pipeline tested with real Snowflake data:
- Snowflake SSO auth works (browser popup)
- 656K records fetched successfully
- Transformations complete without error
- 293 pathway nodes generated and inserted to SQLite
- pathway_refresh_log correctly tracks refresh (ID: 9af76e02, status: completed)
### Files changed:
- `cli/refresh_pathways.py` — Fixed trust filter column selection
- `analysis/statistics.py` — Fixed Decimal/float division
- `data_processing/pathway_pipeline.py` — Fixed array handling in convert_to_records
- `config/snowflake.toml` — Added account, warehouse, user settings
- `IMPLEMENTATION_PLAN.md` — Marked Task 2.2 complete with notes
- `data/*.csv` — Added 7 reference CSV files
### Committed: adc1dbf "feat: complete Task 2.2 - test refresh pipeline with Snowflake data"
### Patterns discovered:
- Snowflake account format: `ACCOUNT.uk-south.azure` (not just account ID)
- Snowflake returns Decimal for DECIMAL/NUMERIC columns — must convert to float for math
- `pd.isna()` raises ValueError on arrays — use try/except pattern
- Test data only has data for `all_6mo` filter (others show 0 nodes) — expected given data freshness
- Total refresh time: ~6.2 minutes for 656K → 519K → 293 pathway nodes
### Next iteration should:
- Start Phase 3: Reflex Integration
- Task 3.1: Update AppState to query pathway_nodes instead of recalculating
- Replace date pickers with dropdowns for initiated/last_seen
- Add date_filter_id computed property
- Rewrite load_pathway_data() to query pre-computed data
- Reference `pathways_app/app_v2.py` for existing state structure
### Blocked items:
- None