c6e426e36c
Dry run test revealed GP lookup queries timing out at 30s (connection_timeout in snowflake.toml). Increased to 600s. Also increased batch_size from 500 to 5000 — query time is ~40s regardless of batch size (CTE compilation overhead), so larger batches reduce total time from ~50min to ~6min for 36K patients. Dry run results: 91.8% GP match rate, 49.3% drug-indication match rate, 42,072 modified UPIDs, 1,846 pathway nodes across 6 date filters.
12 KiB
12 KiB
Guardrails
Known failure patterns. Read EVERY iteration. Follow ALL of these rules. If you discover a new failure pattern during your work, add it to this file.
Drug-Indication Matching Guardrails
Match drugs to indications, not just patients to indications
- When: Building the indication mapping for pathway charts
- Rule: Each drug must be validated against BOTH the patient's GP diagnoses AND the drug-to-indication mapping from DimSearchTerm.csv. A patient being diagnosed with rheumatoid arthritis does NOT mean all their drugs are for rheumatoid arthritis.
- Why: The previous approach assigned ONE indication per patient (most recent GP dx), ignoring which drugs actually treat which conditions. This produced misleading pathways.
Use DimSearchTerm.csv for drug-to-Search_Term mapping
- When: Determining which Search_Term a drug belongs to
- Rule: Load
data/DimSearchTerm.csv. TheCleanedDrugNamecolumn has pipe-separated drug name fragments. Match HCD drug names against these fragments using substring matching (case-insensitive). - Why: This CSV is the authoritative mapping of which drugs are used for which clinical indications.
Use substring matching for drug fragments
- When: Matching HCD drug names against DimSearchTerm CleanedDrugName fragments
- Rule: Check if any fragment from DimSearchTerm is a SUBSTRING of the HCD drug name (case-insensitive). E.g., "PEGYLATED" should match "PEGYLATED LIPOSOMAL DOXORUBICIN".
- Why: DimSearchTerm contains both full drug names (ADALIMUMAB) and partial fragments (PEGYLATED, INHALED). Exact match would miss the partial ones.
Modified UPID uses pipe delimiter
- When: Creating indication-aware UPIDs
- Rule: Format is
{original_UPID}|{search_term}. Use pipe|as delimiter. Do NOT use-(hyphen with spaces) as that's used for pathway hierarchy levels in theidscolumn. - Why: The
idscolumn uses " - " to separate hierarchy levels (e.g., "N&WICS - NNUH - rheumatoid arthritis - ADALIMUMAB"). Using the same delimiter in UPIDs would break hierarchy parsing.
Return ALL GP matches per patient, not just most recent
- When: Querying Snowflake for patient GP diagnoses
- Rule: Remove
QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1. Return ALL matching Search_Terms per patient withGROUP BY + COUNT(*)for code_frequency. - Why: A patient may have GP diagnoses for both rheumatoid arthritis AND asthma. We need ALL matches to cross-reference with their drugs.
Restrict GP code lookup to HCD data window
- When: Building the WHERE clause for the GP record query
- Rule: Add
AND pc."EventDateTime" >= :earliest_hcd_datewhereearliest_hcd_dateisMIN(Intervention Date)from the HCD DataFrame. Pass this as a parameter toget_patient_indication_groups(). - Why: Old GP codes from years before treatment started add noise. A diagnosis coded 10 years ago may no longer be relevant. Restricting to the HCD window ensures code_frequency reflects recent clinical activity for the conditions being actively treated.
Tiebreaker: highest GP code frequency when a drug matches multiple indications
- When: A single drug maps to multiple Search_Terms AND the patient has GP dx for multiple
- Rule: Use
code_frequency(COUNT of matching SNOMED codes per Search_Term per patient) from the GP query. The Search_Term with the most matching codes in the patient's GP record wins. If tied, use alphabetical Search_Term for determinism. - Why: E.g., ADALIMUMAB is listed under rheumatoid arthritis, crohn's disease, psoriatic arthritis, etc. A patient with 47 RA codes and 2 crohn's codes is almost certainly on ADALIMUMAB for RA. Frequency of GP coding is a much stronger signal of clinical intent than recency — a recent one-off asthma check doesn't mean ADALIMUMAB is for asthma.
Same patient, different indications = separate modified UPIDs
- When: A patient's drugs map to different Search_Terms
- Rule: Create separate modified UPIDs for each indication. E.g.,
RMV12345|rheumatoid arthritisandRMV12345|asthma. These are treated as separate "patients" by the pathway analyzer. - Why: This is the core design — drugs for different indications should create separate treatment pathways, even for the same physical patient.
Fallback to directory for unmatched drugs
- When: A drug doesn't match any Search_Term OR the patient has no GP dx for any of the drug's Search_Terms
- Rule: Use fallback format:
{UPID}|{Directory} (no GP dx). The indication_df maps this to"{Directory} (no GP dx)". - Why: Maintains consistent behavior with the previous approach for patients/drugs without GP diagnosis matches.
Merge asthma Search_Terms but keep urticaria separate
- When: Working with asthma-related Search_Terms from CLUSTER_MAPPING_SQL or DimSearchTerm.csv
- Rule: Merge "allergic asthma", "asthma", and "severe persistent allergic asthma" into a single "asthma" Search_Term. Keep "urticaria" as a separate Search_Term — do NOT merge it with asthma.
- Why: These are clinically the same condition at different severity levels. Splitting them fragments the data. Urticaria is a distinct dermatological condition that happens to share OMALIZUMAB.
Don't modify directory chart processing
- When: Making changes to the indication matching logic
- Rule: Only modify the indication chart path (
elif current_chart_type == "indication":). Directory charts use unmodified UPIDs and directory-based grouping. - Why: Directory charts work correctly and should not be affected by indication matching changes.
Snowflake Query Guardrails
Use PseudoNHSNoLinked for GP record matching
- When: Querying GP records (PrimaryCareClinicalCoding) for patient diagnoses
- Rule: Use
PseudoNHSNoLinkedcolumn from HCD data, NOTPersonKey(LocalPatientID) - Why: PersonKey is provider-specific local ID. Only PseudoNHSNoLinked matches PatientPseudonym in GP records.
Embed cluster query as CTE in Snowflake
- When: Looking up patient indications during data refresh
- Rule: Use the
CLUSTER_MAPPING_SQLcontent as a WITH clause in the patient lookup query - Why: This ensures we always use the complete cluster mapping and don't need local storage
Quote mixed-case column aliases in Snowflake SQL
- When: Writing SELECT queries that return results to Python code
- Rule: Use
AS "ColumnName"(quoted) for any column alias you'll access by name in Python - Why: Snowflake uppercases unquoted identifiers.
SELECT foo AS Search_TermreturnsSEARCH_TERM, sorow.get('Search_Term')returns None. Fix:SELECT foo AS "Search_Term"
Build indication_df from all unique UPIDs, not PseudoNHSNoLinked
- When: Creating the indication mapping DataFrame for pathway processing
- Rule: Use
df.drop_duplicates(subset=['UPID'])notdrop_duplicates(subset=['PseudoNHSNoLinked']) - Why: A patient visiting multiple providers has multiple UPIDs. Using unique PseudoNHSNoLinked only maps one UPID per patient, leaving others as NaN.
Data Processing Guardrails
Copy DataFrames in functions that modify columns
- When: Writing functions like
prepare_data()that modify DataFrame columns - Rule: Always
df = df.copy()at the start of any function that modifies column values on the input DataFrame - Why:
prepare_data()mapped Provider Code → Name in-place. When called multiple times on the same DataFrame, only the first call worked. The fix:df.copy()prevents destructive mutation.
Include chart_type in UNIQUE constraints for pathway_nodes
- When: Creating or modifying the pathway_nodes table schema
- Rule: The UNIQUE constraint MUST include
chart_type:UNIQUE(date_filter_id, chart_type, ids) - Why: Without
chart_type,INSERT OR REPLACEsilently overwrites directory chart nodes when indication chart nodes are inserted.
Handle NaN in Directory when building fallback labels
- When: Creating fallback indication labels for patients without GP diagnosis match
- Rule: Check
pd.notna(directory)before concatenating to string. Use"UNKNOWN (no GP dx)"for NaN cases. - Why: NaN handling prevents TypeError and ensures meaningful fallback labels.
Use parameterized queries for SQLite
- When: Building WHERE clauses with user-selected filters
- Rule: Use
?placeholders and pass params tuple — never string interpolation - Why: Prevents SQL injection and handles special characters in drug/directory names
Use existing pathway_analyzer functions
- When: Processing pathway data for the icicle chart
- Rule: Reuse functions from
analysis/pathway_analyzer.py— don't reinvent - Why: The existing code handles edge cases (empty groups, statistics calculation, color mapping)
Reflex Guardrails
Use .to() methods for Var operations in rx.foreach
- When: Working with items inside
rx.foreachrender functions - Rule: Use
item.to(int)for numeric comparisons,item.to_string()for text operations - Why: Items from rx.foreach are Var objects, not plain Python values.
Use rx.cond for conditional rendering, not Python if
- When: Conditionally showing/hiding components or changing styles based on state
- Rule: Use
rx.cond(condition, true_component, false_component)— not Pythonif - Why: Python
ifevaluates at definition time;rx.condevaluates reactively at render time
Process Guardrails
One task per iteration
- When: Temptation to do additional tasks after completing the current one
- Rule: Complete ONE task, validate it, commit it, update progress, then stop
- Why: Multiple tasks increase error risk and make failures harder to diagnose
Never mark complete without validation
- When: Task feels "done" but hasn't been tested
- Rule: All validation tiers must pass before marking
[x] - Why: "Feels done" is not "is done"
Write explicit handoff notes
- When: Every iteration, before stopping
- Rule: The "Next iteration should" section must contain specific, actionable guidance
- Why: The next iteration has zero memory. If you don't write it down, it's lost.
Check existing code for patterns
- When: Unsure how to implement something
- Rule: Look at
pathways_app/pathways_app.py,analysis/pathway_analyzer.py,cli/refresh_pathways.py - Why: The existing codebase has solved many quirks already
Snowflake connection_timeout must be high enough for GP lookup queries
- When: GP record queries against PrimaryCareClinicalCoding time out
- Rule: Ensure
connection_timeoutin config/snowflake.toml is at least 600 (currently set to 600). This controls the Python client'snetwork_timeout, which is how long the client waits for ANY Snowflake response. Do NOT lower this value. - Why: GP lookup queries take ~40s per batch due to CTE compilation overhead. With connection_timeout=30, every batch timed out silently (error 000604/57014).
Use large batch sizes (5000+) for GP record lookups
- When: Calling
get_patient_indication_groups()with patient batches - Rule: Use batch_size=5000 or larger. The query time is ~40s regardless of batch size (5 patients ≈ 500 patients ≈ 5000 patients). Smaller batches just multiply the fixed overhead.
- Why: With batch_size=500, 36K patients needed 74 batches × 40s = ~50 min. With batch_size=5000, only 8 batches × 45s = ~6 min. The bottleneck is CTE compilation, not data volume.