fix: correct patient identifier for GP diagnosis lookup (Task 3.3)

Two critical fixes for the indication-based pathway feature:

1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16")
   - CSV export from pandas/Excel converts large SNOMED codes to scientific notation
   - Without this fix, codes like "10629311000119108" were stored as "1.06e+16"
   - Now properly converts to full integer strings

2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey
   - PersonKey is LocalPatientID (provider-specific like "J188448")
   - PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records
   - Without this fix, 0% of patients matched GP records
   - Test shows ~20% match rate for ADALIMUMAB patients with correct identifier
This commit is contained in:
Andrew Charlwood
2026-02-05 15:49:24 +00:00
parent b9f4041670
commit 5b1569ed5c
3 changed files with 42 additions and 24 deletions
+22 -5
View File
@@ -36,23 +36,40 @@ DEFAULT_CSV_PATH = Path("./data/drug_snomed_mapping_enriched.csv")
def clean_snomed_code(snomed_code: str) -> str:
"""
Clean SNOMED code by removing trailing .0 suffix.
Clean SNOMED code by removing trailing .0 suffix and handling scientific notation.
The enriched CSV has SNOMED codes with decimal notation (e.g., "156370009.0")
that need to be converted to clean integer strings.
The enriched CSV has SNOMED codes that may be in decimal notation (e.g., "156370009.0")
or scientific notation (e.g., "1.0629311000119108e+16") due to pandas/Excel export.
These need to be converted to clean integer strings.
Args:
snomed_code: Raw SNOMED code from CSV.
Returns:
Cleaned SNOMED code as string (e.g., "156370009").
Cleaned SNOMED code as string (e.g., "156370009" or "10629311000119108").
"""
if not snomed_code:
return ""
code = snomed_code.strip()
# Remove trailing .0 if present
# Handle scientific notation (e.g., "1.0629311000119108e+16")
if 'e' in code.lower():
try:
# Convert to float first, then to int, then to string
# Using int() directly on the float preserves precision for SNOMED codes
value = float(code)
# Check if it's a whole number (no decimal part)
if value == int(value):
return str(int(value))
else:
# Has decimal part - return as cleaned float
return str(value).replace('.0', '')
except (ValueError, OverflowError):
# If conversion fails, return as-is but cleaned
return code
# Remove trailing .0 if present (for non-scientific notation)
if code.endswith(".0"):
code = code[:-2]