feat: add pathway pipeline module (Task 1.2)

Create data_processing/pathway_pipeline.py with:
- DateFilterConfig dataclass for date filter configuration
- DATE_FILTER_CONFIGS with 6 pre-defined combinations
- compute_date_ranges() for computing actual dates from config
- fetch_and_transform_data() for Snowflake fetch + transformations
- process_pathway_for_date_filter() using existing generate_icicle_chart()
- extract_denormalized_fields() to parse trust/directory/drugs from ids
- convert_to_records() for SQLite insertion
- process_all_date_filters() convenience function
This commit is contained in:
Andrew Charlwood
2026-02-04 23:21:27 +00:00
parent f2717a2219
commit 5945649ae3
3 changed files with 518 additions and 3 deletions
+39
View File
@@ -82,3 +82,42 @@ The `ids` column in ice_df contains hierarchical paths like:
- The pipeline needs to: fetch_and_transform_data(), process_pathway_for_date_filter(), extract_denormalized_fields(), convert_to_records()
### Blocked items:
- None
## Iteration 2 — 2026-02-04
### Task: 1.2 Create Pathway Pipeline Module
### Why this task:
- Task 1.1 (Schema) is complete — pipeline module is the logical next step
- Task 1.3 (Migration Script) and Task 2.1 (CLI Refresh Command) both depend on having the pipeline
- The pipeline module provides the data processing layer between Snowflake and SQLite
### Status: COMPLETE
### What was done:
- Created `data_processing/pathway_pipeline.py` with:
- `DateFilterConfig` dataclass for date filter configuration
- `DATE_FILTER_CONFIGS` constant with all 6 pre-defined combinations
- `compute_date_ranges(config, max_date)` — computes actual ISO dates from config
- `fetch_and_transform_data(start_date, end_date, provider_codes, paths)` — Snowflake fetch + UPID/drug/directory transformations
- `process_pathway_for_date_filter(df, config, trust_filter, drug_filter, directory_filter, ...)` — processes single date filter using existing `generate_icicle_chart()`
- `extract_denormalized_fields(ice_df)` — parses ids column to extract trust_name, directory, drug_sequence
- `convert_to_records(ice_df, date_filter_id, refresh_id)` — converts ice_df to list of dicts for SQLite insertion
- `process_all_date_filters(df, ...)` — convenience function to process all 6 filters
- Integrated with existing `analysis/pathway_analyzer.py` via `generate_icicle_chart()`
- Integrated with `data_processing/snowflake_connector.py` via `fetch_activity_data()`
- Integrated with `tools/data.py` transformations (patient_id, drug_names, department_identification)
### Validation results:
- Tier 1 (Code): ✅ python -m py_compile passed, all imports successful
- Tier 2 (Visual): N/A (backend module, no UI)
- Tier 3 (Functional): ✅ Verified all 6 DATE_FILTER_CONFIGS, tested compute_date_ranges() returns correct dates
### Files changed:
- `data_processing/pathway_pipeline.py` — new file (~380 lines)
- `IMPLEMENTATION_PLAN.md` — marked Task 1.2 subtasks complete
### Committed: 3c68478 "feat: add pathway pipeline module (Task 1.2)"
### Patterns discovered:
- The ids format uses " - " delimiter (space-hyphen-space), not "|" as noted in progress.txt — corrected in extract_denormalized_fields()
- avg_days column from pathway_analyzer can be a timedelta — need to convert with .total_seconds() / 86400
- Some ice_df columns may be NaN at certain hierarchy levels — defensive None checks needed
### Next iteration should:
- Start Task 1.3: Create Migration Script
- OR jump to Task 2.1: Create Refresh Command (which can implicitly handle table creation)
- The refresh command needs: DATE_FILTER_CONFIGS, compute_date_ranges(), and pipeline functions
### Blocked items:
- None