9bb4748588
Task 1.3 (Create Migration Script) is satisfied by existing code: - python -m data_processing.migrate creates all pathway tables - pathway_date_filters auto-populated via INSERT OR REPLACE in schema - Verified: fresh database creates all 3 tables with 6 date filters
158 lines
8.7 KiB
Plaintext
158 lines
8.7 KiB
Plaintext
# Progress Log - Pathway Data Architecture
|
|
|
|
## Project Context
|
|
|
|
This project extends the existing Reflex UI redesign (`pathways_app/app_v2.py`) with pre-computed pathway data from Snowflake. The current app uses a simplified `prepare_chart_data()` that only does Trust → Directory → Drug aggregation. The goal is to support full sequential patient treatment pathways with treatment statistics.
|
|
|
|
## Key Files Reference
|
|
|
|
**Existing (reuse these):**
|
|
- `analysis/pathway_analyzer.py` - Has `prepare_data()`, `calculate_statistics()`, `build_hierarchy()`, `generate_icicle_chart()`
|
|
- `visualization/plotly_generator.py` - Has chart generation with full customdata structure
|
|
- `data_processing/snowflake_connector.py` - Snowflake connection with SSO auth
|
|
- `tools/data.py` - `patient_id()`, `drug_names()`, `department_identification()`
|
|
- `data_processing/schema.py` - Existing SQLite schema
|
|
|
|
**To create:**
|
|
- `data_processing/pathway_pipeline.py` - New pathway processing pipeline
|
|
- `cli/refresh_pathways.py` - CLI command for data refresh
|
|
|
|
## Known Patterns
|
|
|
|
### Pathway ids format
|
|
The `ids` column in ice_df contains hierarchical paths like:
|
|
- "Norfolk & Waveney ICS" (root)
|
|
- "Norfolk & Waveney ICS|NNUH" (trust)
|
|
- "Norfolk & Waveney ICS|NNUH|OPHTHALMOLOGY" (directory)
|
|
- "Norfolk & Waveney ICS|NNUH|OPHTHALMOLOGY|RANIBIZUMAB" (drug)
|
|
- "Norfolk & Waveney ICS|NNUH|OPHTHALMOLOGY|RANIBIZUMAB|AFLIBERCEPT" (pathway)
|
|
|
|
### Date filter combinations
|
|
6 pre-defined combinations stored in `pathway_date_filters` table:
|
|
- all_6mo (default), all_12mo, 1yr_6mo, 1yr_12mo, 2yr_6mo, 2yr_12mo
|
|
|
|
### Expected data volumes
|
|
- ~440K intervention records
|
|
- ~35K patients
|
|
- ~6-12 minutes for full refresh (6 date combinations)
|
|
|
|
---
|
|
|
|
## Iteration Log
|
|
|
|
## Iteration 1 — 2026-02-04
|
|
### Task: 1.1 Extend Database Schema
|
|
### Why this task:
|
|
- Foundation task with no dependencies — everything else needs the schema first
|
|
- Task 1.2 (Pipeline Module) and 1.3 (Migration Script) both depend on having schema constants defined
|
|
- Logical starting point for a new project
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Added `PATHWAY_DATE_FILTERS_SCHEMA` with 6 pre-defined date combinations (all_6mo, all_12mo, 1yr_6mo, 1yr_12mo, 2yr_6mo, 2yr_12mo)
|
|
- Added `PATHWAY_NODES_SCHEMA` with all required columns:
|
|
- Hierarchy: parents, ids, labels, level
|
|
- Counts: value (patient count)
|
|
- Costs: cost, costpp, cost_pp_pa
|
|
- Dates: first_seen, last_seen, first_seen_parent, last_seen_parent
|
|
- Statistics: average_spacing, average_administered, avg_days
|
|
- Denormalized filters: trust_name, directory, drug_sequence
|
|
- Metadata: date_filter_id (FK), created_at, data_refresh_id
|
|
- Added `PATHWAY_REFRESH_LOG_SCHEMA` for tracking refresh status
|
|
- Created 8 indexes for efficient filtering
|
|
- Added helper functions:
|
|
- `create_pathway_tables()` / `drop_pathway_tables()`
|
|
- `get_pathway_table_counts()` / `verify_pathway_tables_exist()`
|
|
- `clear_pathway_nodes(date_filter_id=None)` — selective or full clearing
|
|
- `get_pathway_refresh_status()` — returns latest refresh status
|
|
- Updated `ALL_TABLES_SCHEMA` and combined helpers to include pathway tables
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ python -m py_compile passed, all imports successful
|
|
- Tier 2 (Visual): N/A (schema work, no UI)
|
|
- Tier 3 (Functional): ✅ Created test database, verified all 6 date filters populated, all 8 indexes created, all helper functions work correctly
|
|
### Files changed:
|
|
- `data_processing/schema.py` — added ~300 lines (3 new schema constants, 6 new helper functions)
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.1 subtasks complete
|
|
### Committed: 34396fe "feat: add pathway data architecture schema (Task 1.1)"
|
|
### Patterns discovered:
|
|
- The INSERT OR REPLACE pattern works well for pre-populating date filters
|
|
- Using tempfile.TemporaryDirectory() for test databases avoids polluting the workspace
|
|
### Next iteration should:
|
|
- Start Task 1.2: Create Pathway Pipeline Module (`data_processing/pathway_pipeline.py`)
|
|
- Reference `analysis/pathway_analyzer.py` for existing pathway logic to reuse
|
|
- The pipeline needs to: fetch_and_transform_data(), process_pathway_for_date_filter(), extract_denormalized_fields(), convert_to_records()
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 2 — 2026-02-04
|
|
### Task: 1.2 Create Pathway Pipeline Module
|
|
### Why this task:
|
|
- Task 1.1 (Schema) is complete — pipeline module is the logical next step
|
|
- Task 1.3 (Migration Script) and Task 2.1 (CLI Refresh Command) both depend on having the pipeline
|
|
- The pipeline module provides the data processing layer between Snowflake and SQLite
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Created `data_processing/pathway_pipeline.py` with:
|
|
- `DateFilterConfig` dataclass for date filter configuration
|
|
- `DATE_FILTER_CONFIGS` constant with all 6 pre-defined combinations
|
|
- `compute_date_ranges(config, max_date)` — computes actual ISO dates from config
|
|
- `fetch_and_transform_data(start_date, end_date, provider_codes, paths)` — Snowflake fetch + UPID/drug/directory transformations
|
|
- `process_pathway_for_date_filter(df, config, trust_filter, drug_filter, directory_filter, ...)` — processes single date filter using existing `generate_icicle_chart()`
|
|
- `extract_denormalized_fields(ice_df)` — parses ids column to extract trust_name, directory, drug_sequence
|
|
- `convert_to_records(ice_df, date_filter_id, refresh_id)` — converts ice_df to list of dicts for SQLite insertion
|
|
- `process_all_date_filters(df, ...)` — convenience function to process all 6 filters
|
|
- Integrated with existing `analysis/pathway_analyzer.py` via `generate_icicle_chart()`
|
|
- Integrated with `data_processing/snowflake_connector.py` via `fetch_activity_data()`
|
|
- Integrated with `tools/data.py` transformations (patient_id, drug_names, department_identification)
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ python -m py_compile passed, all imports successful
|
|
- Tier 2 (Visual): N/A (backend module, no UI)
|
|
- Tier 3 (Functional): ✅ Verified all 6 DATE_FILTER_CONFIGS, tested compute_date_ranges() returns correct dates
|
|
### Files changed:
|
|
- `data_processing/pathway_pipeline.py` — new file (~380 lines)
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.2 subtasks complete
|
|
### Committed: 5945649 "feat: add pathway pipeline module (Task 1.2)"
|
|
### Patterns discovered:
|
|
- The ids format uses " - " delimiter (space-hyphen-space), not "|" as noted in progress.txt — corrected in extract_denormalized_fields()
|
|
- avg_days column from pathway_analyzer can be a timedelta — need to convert with .total_seconds() / 86400
|
|
- Some ice_df columns may be NaN at certain hierarchy levels — defensive None checks needed
|
|
### Next iteration should:
|
|
- Start Task 1.3: Create Migration Script
|
|
- OR jump to Task 2.1: Create Refresh Command (which can implicitly handle table creation)
|
|
- The refresh command needs: DATE_FILTER_CONFIGS, compute_date_ranges(), and pipeline functions
|
|
### Blocked items:
|
|
- None
|
|
|
|
## Iteration 3 — 2026-02-04
|
|
### Task: 1.3 Create Migration Script
|
|
### Why this task:
|
|
- Task 1.2 (Pipeline) was complete, and this was the next task in sequence
|
|
- Following the logical order: Schema → Pipeline → Migration → CLI Command
|
|
- Smaller, focused task before tackling the more complex CLI refresh command
|
|
### Status: COMPLETE
|
|
### What was done:
|
|
- Discovered that Task 1.3 is already satisfied by work done in Task 1.1:
|
|
- `python -m data_processing.migrate` already creates all pathway tables via `create_all_tables()`
|
|
- `pathway_date_filters` is auto-populated via INSERT OR REPLACE in the schema
|
|
- All helper functions (`verify_pathway_tables_exist`, `get_pathway_table_counts`, etc.) already exist
|
|
- Verified by running the migrate CLI against a fresh temp database:
|
|
- All 3 pathway tables created successfully
|
|
- All 6 date filter rows pre-populated correctly
|
|
- Status output shows pathway tables in table counts
|
|
### Validation results:
|
|
- Tier 1 (Code): ✅ python -m data_processing.migrate runs without errors
|
|
- Tier 2 (Visual): N/A (CLI/schema work, no UI)
|
|
- Tier 3 (Functional): ✅ Fresh database test shows all tables created with correct row counts
|
|
### Files changed:
|
|
- `IMPLEMENTATION_PLAN.md` — marked Task 1.3 subtasks complete with notes
|
|
### Committed: f976324 "docs: mark Task 1.3 complete (migration already handled by schema)"
|
|
### Patterns discovered:
|
|
- Good architecture in Task 1.1 (including schema auto-population) made Task 1.3 trivial
|
|
- The INSERT OR REPLACE pattern in schema is powerful — eliminates need for separate population scripts
|
|
### Next iteration should:
|
|
- Start Task 2.1: Create CLI Refresh Command (`cli/refresh_pathways.py`)
|
|
- This is the first task with real new work to do
|
|
- Reference `data_processing/pathway_pipeline.py` for DATE_FILTER_CONFIGS, compute_date_ranges()
|
|
- The CLI needs to: parse args, fetch Snowflake data, process all 6 filters, insert to SQLite, log status
|
|
### Blocked items:
|
|
- None
|