fix: correct patient identifier for GP diagnosis lookup (Task 3.3)

Two critical fixes for the indication-based pathway feature: 1. clean_snomed_code() now handles scientific notation (e.g., "1.06e+16") - CSV export from pandas/Excel converts large SNOMED codes to scientific notation - Without this fix, codes like "10629311000119108" were stored as "1.06e+16" - Now properly converts to full integer strings 2. batch_lookup_indication_groups() now uses PseudoNHSNoLinked instead of PersonKey - PersonKey is LocalPatientID (provider-specific like "J188448") - PseudoNHSNoLinked is the pseudonymised NHS number that matches PatientPseudonym in GP records - Without this fix, 0% of patients matched GP records - Test shows ~20% match rate for ADALIMUMAB patients with correct identifier
2026-02-05 15:49:24 +00:00
parent b9f4041670
commit 5b1569ed5c
3 changed files with 42 additions and 24 deletions
@@ -36,23 +36,40 @@ DEFAULT_CSV_PATH = Path("./data/drug_snomed_mapping_enriched.csv")

 def clean_snomed_code(snomed_code: str) -> str:
    """
-    Clean SNOMED code by removing trailing .0 suffix.
+    Clean SNOMED code by removing trailing .0 suffix and handling scientific notation.

-    The enriched CSV has SNOMED codes with decimal notation (e.g., "156370009.0")
-    that need to be converted to clean integer strings.
+    The enriched CSV has SNOMED codes that may be in decimal notation (e.g., "156370009.0")
+    or scientific notation (e.g., "1.0629311000119108e+16") due to pandas/Excel export.
+    These need to be converted to clean integer strings.

    Args:
        snomed_code: Raw SNOMED code from CSV.

    Returns:
-        Cleaned SNOMED code as string (e.g., "156370009").
+        Cleaned SNOMED code as string (e.g., "156370009" or "10629311000119108").
    """
    if not snomed_code:
        return ""

    code = snomed_code.strip()

-    # Remove trailing .0 if present
+    # Handle scientific notation (e.g., "1.0629311000119108e+16")
+    if 'e' in code.lower():
+        try:
+            # Convert to float first, then to int, then to string
+            # Using int() directly on the float preserves precision for SNOMED codes
+            value = float(code)
+            # Check if it's a whole number (no decimal part)
+            if value == int(value):
+                return str(int(value))
+            else:
+                # Has decimal part - return as cleaned float
+                return str(value).replace('.0', '')
+        except (ValueError, OverflowError):
+            # If conversion fails, return as-is but cleaned
+            return code
+
+    # Remove trailing .0 if present (for non-scientific notation)
    if code.endswith(".0"):
        code = code[:-2]