fix: increase network timeout and batch size for GP lookup queries (Task 3.2)

Dry run test revealed GP lookup queries timing out at 30s (connection_timeout
in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to
5000 — query time is ~40s regardless of batch size (CTE compilation overhead),
so larger batches reduce total time from ~50min to ~6min for 36K patients.

Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate,
42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
This commit is contained in:
Andrew Charlwood
2026-02-05 23:55:12 +00:00
parent 73088b063b
commit c6e426e36c
7 changed files with 197 additions and 207 deletions
+1 -1
View File
@@ -1568,7 +1568,7 @@ AllIndicationCodes AS (
def get_patient_indication_groups(
patient_pseudonyms: list[str],
connector: Optional[SnowflakeConnector] = None,
batch_size: int = 500,
batch_size: int = 5000,
earliest_hcd_date: Optional[str] = None,
) -> "pd.DataFrame":
"""