fix: increase network timeout and batch size for GP lookup queries (Task 3.2)
Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
This commit is contained in:
@@ -299,3 +299,63 @@ This project extends the indication-based pathway charts (Phase 1-5 complete) wi
|
||||
- No errors in indication pathway processing
|
||||
### Blocked items:
|
||||
- None
|
||||
|
||||
## Iteration 6 — 2026-02-05
|
||||
### Task: 3.2 — Test with dry run
|
||||
### Why this task:
|
||||
- All Phase 1-3.1 dependencies complete (query, drug mapping, matching function, pipeline integration)
|
||||
- 3.2 validates the integrated pipeline end-to-end before Phase 4 (full refresh)
|
||||
- Must pass before moving to production refresh
|
||||
### Status: COMPLETE
|
||||
### What was done:
|
||||
- **Discovered**: GP lookup queries were timing out at 30 seconds — every batch failed
|
||||
- Root cause: `connection_timeout=30` in config/snowflake.toml sets Snowflake Python client `network_timeout`
|
||||
- This kills any query taking >30s, regardless of server-side STATEMENT_TIMEOUT (300s)
|
||||
- The GROUP BY + COUNT(*) query takes ~40s per batch (even for 5 patients)
|
||||
- The old QUALIFY ROW_NUMBER() query took ~20s (borderline but usually OK with caching)
|
||||
- **Fixed timeout**: Changed `connection_timeout` from 30 → 600 in snowflake.toml and config/__init__.py fallback
|
||||
- Safe because query_timeout (300s) still controls server-side statement limits
|
||||
- All existing queries still work fine (activity data fetch: 7s, chunked)
|
||||
- **Optimized batch size**: Changed from 500 → 5000 patients per batch
|
||||
- Query time is ~constant regardless of batch size (~40s) — bottleneck is CTE compilation, not data volume
|
||||
- 500-patient batches: 74 batches × 40s = ~50 minutes for GP lookup
|
||||
- 5000-patient batches: 8 batches × 45s = ~6 minutes for GP lookup
|
||||
- Updated both default in get_patient_indication_groups() and caller in refresh_pathways.py
|
||||
- **Dry run results** (successful):
|
||||
- GP Lookup: 36,628 patients, 33,642 matched (91.8%), 8 batches in ~5.5 min
|
||||
- Drug-Indication Matching: 50,797 UPID-Drug pairs → 25,059 matched (49.3%), 15,238 tiebreakers, 25,738 fallback
|
||||
- Modified UPIDs: 42,072 (up from 36,628 original patients — some patients split across indications)
|
||||
- Pathway nodes per date filter: all_6mo=438, all_12mo=484, 1yr_6mo=181, 1yr_12mo=199, 2yr_6mo=257, 2yr_12mo=287
|
||||
- Total: 1,846 indication nodes across 6 date filters
|
||||
- No errors during pathway processing
|
||||
### Validation results:
|
||||
- Tier 1 (Code): py_compile PASSED for diagnosis_lookup.py, refresh_pathways.py, config/__init__.py
|
||||
- Tier 2 (Data): Dry run completed successfully with correct log output:
|
||||
- Modified UPIDs appear (42,072 unique)
|
||||
- Match/fallback rates logged (49.3% / 50.7%)
|
||||
- Tiebreaker count logged (15,238)
|
||||
- Top indications: macular degeneration, diabetes, rheumatoid arthritis
|
||||
- Pathway node counts reasonable (181-484 per date filter)
|
||||
- Tier 3 (Functional): Dry run completed, no insertion (as expected)
|
||||
### Files changed:
|
||||
- config/snowflake.toml (connection_timeout 30 → 600)
|
||||
- config/__init__.py (fallback connection_timeout 30 → 600)
|
||||
- data_processing/diagnosis_lookup.py (batch_size default 500 → 5000)
|
||||
- cli/refresh_pathways.py (batch_size 500 → 5000)
|
||||
- IMPLEMENTATION_PLAN.md (marked 3.2 subtasks [x])
|
||||
### Committed: [pending]
|
||||
### Patterns discovered:
|
||||
- Snowflake Python connector `network_timeout` (set via connection_timeout in config) controls client-side wait time for ALL query responses, not just connection establishment. Must be high enough for slow queries.
|
||||
- PrimaryCareClinicalCoding query performance is dominated by CTE compilation (~40s fixed cost), not by patient count. Larger batches (5000 vs 500) are dramatically more efficient.
|
||||
- 49.3% match rate means about half of UPID-Drug pairs have both a drug mapping in DimSearchTerm AND matching GP diagnosis. The 50.7% fallback is expected since not all HCD drugs are in DimSearchTerm.csv.
|
||||
### Next iteration should:
|
||||
- Work on Task 4.1: Full refresh with both chart types
|
||||
- Run `python -m cli.refresh_pathways --chart-type all` (no --dry-run)
|
||||
- This will insert ~1,846 indication nodes + ~1,800 directory nodes into pathway_nodes table
|
||||
- Verify both chart types generate data, directory charts unchanged
|
||||
- Takes ~15 minutes total (7s Snowflake + 6min transforms + 6min GP lookup + 2min pathways)
|
||||
- After 4.1, Tasks 4.2 and 4.3 can be done together:
|
||||
- 4.2: Validate indication chart correctness (spot-check drug grouping)
|
||||
- 4.3: Validate Reflex UI compiles and chart type toggle works
|
||||
### Blocked items:
|
||||
- None
|
||||
|
||||
Reference in New Issue
Block a user