fix: increase network timeout and batch size for GP lookup queries (Task 3.2)

Dry run test revealed GP lookup queries timing out at 30s (connection_timeout
in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to
5000 — query time is ~40s regardless of batch size (CTE compilation overhead),
so larger batches reduce total time from ~50min to ~6min for 36K patients.

Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate,
42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
This commit is contained in:
Andrew Charlwood
2026-02-05 23:55:12 +00:00
parent 73088b063b
commit c6e426e36c
7 changed files with 197 additions and 207 deletions
+60
View File
@@ -299,3 +299,63 @@ This project extends the indication-based pathway charts (Phase 1-5 complete) wi
- No errors in indication pathway processing
### Blocked items:
- None
## Iteration 6 — 2026-02-05
### Task: 3.2 — Test with dry run
### Why this task:
- All Phase 1-3.1 dependencies complete (query, drug mapping, matching function, pipeline integration)
- 3.2 validates the integrated pipeline end-to-end before Phase 4 (full refresh)
- Must pass before moving to production refresh
### Status: COMPLETE
### What was done:
- **Discovered**: GP lookup queries were timing out at 30 seconds — every batch failed
- Root cause: `connection_timeout=30` in config/snowflake.toml sets Snowflake Python client `network_timeout`
- This kills any query taking >30s, regardless of server-side STATEMENT_TIMEOUT (300s)
- The GROUP BY + COUNT(*) query takes ~40s per batch (even for 5 patients)
- The old QUALIFY ROW_NUMBER() query took ~20s (borderline but usually OK with caching)
- **Fixed timeout**: Changed `connection_timeout` from 30 → 600 in snowflake.toml and config/__init__.py fallback
- Safe because query_timeout (300s) still controls server-side statement limits
- All existing queries still work fine (activity data fetch: 7s, chunked)
- **Optimized batch size**: Changed from 500 → 5000 patients per batch
- Query time is ~constant regardless of batch size (~40s) — bottleneck is CTE compilation, not data volume
- 500-patient batches: 74 batches × 40s = ~50 minutes for GP lookup
- 5000-patient batches: 8 batches × 45s = ~6 minutes for GP lookup
- Updated both default in get_patient_indication_groups() and caller in refresh_pathways.py
- **Dry run results** (successful):
- GP Lookup: 36,628 patients, 33,642 matched (91.8%), 8 batches in ~5.5 min
- Drug-Indication Matching: 50,797 UPID-Drug pairs → 25,059 matched (49.3%), 15,238 tiebreakers, 25,738 fallback
- Modified UPIDs: 42,072 (up from 36,628 original patients — some patients split across indications)
- Pathway nodes per date filter: all_6mo=438, all_12mo=484, 1yr_6mo=181, 1yr_12mo=199, 2yr_6mo=257, 2yr_12mo=287
- Total: 1,846 indication nodes across 6 date filters
- No errors during pathway processing
### Validation results:
- Tier 1 (Code): py_compile PASSED for diagnosis_lookup.py, refresh_pathways.py, config/__init__.py
- Tier 2 (Data): Dry run completed successfully with correct log output:
- Modified UPIDs appear (42,072 unique)
- Match/fallback rates logged (49.3% / 50.7%)
- Tiebreaker count logged (15,238)
- Top indications: macular degeneration, diabetes, rheumatoid arthritis
- Pathway node counts reasonable (181-484 per date filter)
- Tier 3 (Functional): Dry run completed, no insertion (as expected)
### Files changed:
- config/snowflake.toml (connection_timeout 30 → 600)
- config/__init__.py (fallback connection_timeout 30 → 600)
- data_processing/diagnosis_lookup.py (batch_size default 500 → 5000)
- cli/refresh_pathways.py (batch_size 500 → 5000)
- IMPLEMENTATION_PLAN.md (marked 3.2 subtasks [x])
### Committed: [pending]
### Patterns discovered:
- Snowflake Python connector `network_timeout` (set via connection_timeout in config) controls client-side wait time for ALL query responses, not just connection establishment. Must be high enough for slow queries.
- PrimaryCareClinicalCoding query performance is dominated by CTE compilation (~40s fixed cost), not by patient count. Larger batches (5000 vs 500) are dramatically more efficient.
- 49.3% match rate means about half of UPID-Drug pairs have both a drug mapping in DimSearchTerm AND matching GP diagnosis. The 50.7% fallback is expected since not all HCD drugs are in DimSearchTerm.csv.
### Next iteration should:
- Work on Task 4.1: Full refresh with both chart types
- Run `python -m cli.refresh_pathways --chart-type all` (no --dry-run)
- This will insert ~1,846 indication nodes + ~1,800 directory nodes into pathway_nodes table
- Verify both chart types generate data, directory charts unchanged
- Takes ~15 minutes total (7s Snowflake + 6min transforms + 6min GP lookup + 2min pathways)
- After 4.1, Tasks 4.2 and 4.3 can be done together:
- 4.2: Validate indication chart correctness (spot-check drug grouping)
- 4.3: Validate Reflex UI compiles and chart type toggle works
### Blocked items:
- None