fix: increase network timeout and batch size for GP lookup queries (Task 3.2)
Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
This commit is contained in:
@@ -153,12 +153,14 @@ Only assign a drug to an indication if BOTH conditions are met. If a patient's d
|
||||
- [x] Verify: Pipeline compiles, `python -m py_compile cli/refresh_pathways.py`
|
||||
|
||||
### 3.2 Test with dry run
|
||||
- [ ] Run `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
|
||||
- [ ] Verify:
|
||||
- Modified UPIDs appear in pipeline log (e.g., `RMV12345|rheumatoid arthritis`)
|
||||
- Patient counts are reasonable (will be higher than before since same patient can appear under multiple indications)
|
||||
- Drug-indication matching is logged (match rate, fallback rate)
|
||||
- Pathway hierarchy shows drug-specific grouping under correct indications
|
||||
- [x] Run `python -m cli.refresh_pathways --chart-type indication --dry-run -v`
|
||||
- [x] Verify:
|
||||
- Modified UPIDs appear in pipeline log (42,072 unique modified UPIDs)
|
||||
- Patient counts are reasonable (42,072 modified UPIDs vs 36,628 original patients)
|
||||
- Drug-indication matching is logged (49.3% match, 50.7% fallback, 15,238 tiebreakers)
|
||||
- Pathway hierarchy shows drug-specific grouping under correct indications (1,846 total nodes)
|
||||
- [x] Fixed: network_timeout increased from 30→600 (was killing GP lookup queries)
|
||||
- [x] Fixed: batch_size increased from 500→5000 (reduces CTE compilation overhead from 74 to 8 batches)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user