fix: increase network timeout and batch size for GP lookup queries (Task 3.2)

Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
2026-02-05 23:55:12 +00:00
parent 73088b063b
commit c6e426e36c
7 changed files with 197 additions and 207 deletions
@@ -299,3 +299,63 @@ This project extends the indication-based pathway charts (Phase 1-5 complete) wi
  - No errors in indication pathway processing
 ### Blocked items:
 - None
+
+## Iteration 6 — 2026-02-05
+### Task: 3.2 — Test with dry run
+### Why this task:
+- All Phase 1-3.1 dependencies complete (query, drug mapping, matching function, pipeline integration)
+- 3.2 validates the integrated pipeline end-to-end before Phase 4 (full refresh)
+- Must pass before moving to production refresh
+### Status: COMPLETE
+### What was done:
+- **Discovered**: GP lookup queries were timing out at 30 seconds — every batch failed
+  - Root cause: `connection_timeout=30` in config/snowflake.toml sets Snowflake Python client `network_timeout`
+  - This kills any query taking >30s, regardless of server-side STATEMENT_TIMEOUT (300s)
+  - The GROUP BY + COUNT(*) query takes ~40s per batch (even for 5 patients)
+  - The old QUALIFY ROW_NUMBER() query took ~20s (borderline but usually OK with caching)
+- **Fixed timeout**: Changed `connection_timeout` from 30 → 600 in snowflake.toml and config/__init__.py fallback
+  - Safe because query_timeout (300s) still controls server-side statement limits
+  - All existing queries still work fine (activity data fetch: 7s, chunked)
+- **Optimized batch size**: Changed from 500 → 5000 patients per batch
+  - Query time is ~constant regardless of batch size (~40s) — bottleneck is CTE compilation, not data volume
+  - 500-patient batches: 74 batches × 40s = ~50 minutes for GP lookup
+  - 5000-patient batches: 8 batches × 45s = ~6 minutes for GP lookup
+  - Updated both default in get_patient_indication_groups() and caller in refresh_pathways.py
+- **Dry run results** (successful):
+  - GP Lookup: 36,628 patients, 33,642 matched (91.8%), 8 batches in ~5.5 min
+  - Drug-Indication Matching: 50,797 UPID-Drug pairs → 25,059 matched (49.3%), 15,238 tiebreakers, 25,738 fallback
+  - Modified UPIDs: 42,072 (up from 36,628 original patients — some patients split across indications)
+  - Pathway nodes per date filter: all_6mo=438, all_12mo=484, 1yr_6mo=181, 1yr_12mo=199, 2yr_6mo=257, 2yr_12mo=287
+  - Total: 1,846 indication nodes across 6 date filters
+  - No errors during pathway processing
+### Validation results:
+- Tier 1 (Code): py_compile PASSED for diagnosis_lookup.py, refresh_pathways.py, config/__init__.py
+- Tier 2 (Data): Dry run completed successfully with correct log output:
+  - Modified UPIDs appear (42,072 unique)
+  - Match/fallback rates logged (49.3% / 50.7%)
+  - Tiebreaker count logged (15,238)
+  - Top indications: macular degeneration, diabetes, rheumatoid arthritis
+  - Pathway node counts reasonable (181-484 per date filter)
+- Tier 3 (Functional): Dry run completed, no insertion (as expected)
+### Files changed:
+- config/snowflake.toml (connection_timeout 30 → 600)
+- config/__init__.py (fallback connection_timeout 30 → 600)
+- data_processing/diagnosis_lookup.py (batch_size default 500 → 5000)
+- cli/refresh_pathways.py (batch_size 500 → 5000)
+- IMPLEMENTATION_PLAN.md (marked 3.2 subtasks [x])
+### Committed: [pending]
+### Patterns discovered:
+- Snowflake Python connector `network_timeout` (set via connection_timeout in config) controls client-side wait time for ALL query responses, not just connection establishment. Must be high enough for slow queries.
+- PrimaryCareClinicalCoding query performance is dominated by CTE compilation (~40s fixed cost), not by patient count. Larger batches (5000 vs 500) are dramatically more efficient.
+- 49.3% match rate means about half of UPID-Drug pairs have both a drug mapping in DimSearchTerm AND matching GP diagnosis. The 50.7% fallback is expected since not all HCD drugs are in DimSearchTerm.csv.
+### Next iteration should:
+- Work on Task 4.1: Full refresh with both chart types
+  - Run `python -m cli.refresh_pathways --chart-type all` (no --dry-run)
+  - This will insert ~1,846 indication nodes + ~1,800 directory nodes into pathway_nodes table
+  - Verify both chart types generate data, directory charts unchanged
+  - Takes ~15 minutes total (7s Snowflake + 6min transforms + 6min GP lookup + 2min pathways)
+- After 4.1, Tasks 4.2 and 4.3 can be done together:
+  - 4.2: Validate indication chart correctness (spot-check drug grouping)
+  - 4.3: Validate Reflex UI compiles and chart type toggle works
+### Blocked items:
+- None