feat: complete Task 2.2 - test refresh pipeline with Snowflake data

Tested full refresh pipeline end-to-end with real Snowflake data:
- Fixed trust filter to read Name column from defaultTrusts.csv
- Fixed Decimal type handling in calculate_cost_per_patient_per_annum
- Fixed array handling in convert_to_records for average_administered
- Added required reference CSV files to data/ directory
- Configured Snowflake connection (account, warehouse, user)

Results:
- Snowflake fetch: 656,695 records in ~7s
- Transformations: 519,848 records after UPID/drug/directory
- Pathway nodes: 293 for all_6mo (8 trusts, 14 directories)
- Total processing time: ~6.2 minutes
This commit is contained in:
Andrew Charlwood
2026-02-05 00:20:12 +00:00
parent 8b65dfd9a8
commit adc1dbfc58
12 changed files with 1708 additions and 21 deletions
+3 -2
View File
@@ -153,7 +153,7 @@ def calculate_cost_per_patient_per_annum(
patients with different treatment durations.
Args:
total_cost: Total cost for the patient
total_cost: Total cost for the patient (can be Decimal or float)
days_treated: Treatment duration as timedelta
Returns:
@@ -171,7 +171,8 @@ def calculate_cost_per_patient_per_annum(
if days <= 0:
return None
return total_cost / (days / 365)
# Convert total_cost to float to handle Decimal from Snowflake
return float(total_cost) / (days / 365)
def calculate_treatment_duration(