feat: complete drug-aware indication matching and cleanup app_v2

- Remove app_v2.py (consolidated into pathways_app.py), fix __init__ import - Add DimSearchTerm.csv, drug_indication_clusters.csv, drug_snomed_mapping_enriched.csv as reference data for SNOMED-based indication matching - Add snomed_indication_mapping_query.sql (source for embedded cluster mapping) - Update DESIGN_SYSTEM.md, RALPH_PROMPT.md, ralph.ps1, uv.lock
2026-02-06 00:33:29 +00:00
parent f3bba6dfab
commit a31907aa1f
11 changed files with 41902 additions and 2306 deletions
@@ -1,6 +1,8 @@
-# Ralph Wiggum Loop - Reflex UI Redesign
+# Ralph Wiggum Loop - Drug-Aware Indication Matching

-You are operating inside an automated loop building a Reflex frontend application. Each iteration you receive fresh context — you have NO memory of previous iterations. Your only memory is the filesystem.
+You are operating inside an automated loop extending a pathway analysis application with drug-aware indication matching. Each iteration you receive fresh context — you have NO memory of previous iterations. Your only memory is the filesystem.
+
+**Current Focus**: Update indication charts so that patient indications are matched **per drug**, not just per patient. Each drug must be validated against the patient's GP diagnoses AND the drug-to-indication mapping from DimSearchTerm.csv.

 ## First Actions Every Iteration

@@ -9,7 +11,7 @@ Read these files in this order before doing anything else:
 1. `progress.txt` — What previous iterations accomplished, what's blocked, and what to do next. The most recent entry is most important.
 2. `IMPLEMENTATION_PLAN.md` — Task list with status markers, project overview, and completion criteria.
 3. `guardrails.md` — Known failure patterns to avoid. You MUST read and follow these.
-4. `DESIGN_SYSTEM.md` — Color palette, typography, spacing, and component specifications.
+4. `CLAUDE.md` — Project architecture and code patterns.

 Then run `git log --oneline -5` to see recent commits.

@@ -18,48 +20,109 @@ Then run `git log --oneline -5` to see recent commits.
 Narrate your work as you go. Your output is the only visibility the operator has into what's happening. For every significant action, explain what you're doing and why:

 - **Reading files**: "Reading progress.txt to check what the last iteration accomplished..."
- **Creating components**: "Creating the top_bar() component with logo, title, and chart tabs..."
- **Debugging**: "Reflex compilation failed with TypeError. Checking the error — looks like rx.foreach issue..."
- **Testing**: "Running reflex compile to verify the component renders..."
- **Making decisions**: "The design system specifies Primary Blue #0066CC for buttons. Using that."
- **Committing**: "Committing styles.py — design token module complete."
+- **Creating code**: "Adding assign_drug_indications() function to diagnosis_lookup.py..."
+- **Debugging**: "Drug matching returned 0 results for ADALIMUMAB. Checking DimSearchTerm lookup..."
+- **Testing**: "Running import check to verify the new function is accessible..."
+- **Making decisions**: "The guardrails say to use substring matching for drug fragments."
+- **Committing**: "Committing drug-indication matching logic."

 Do NOT just output a summary at the end. Narrate throughout. Think of this as a live log of your reasoning.

 ## Task Selection

-Pick the highest-priority task that is READY to work on:
+You have flexibility to choose which task to work on. Use your judgement, but document your reasoning.

 1. Read ALL tasks in IMPLEMENTATION_PLAN.md — understand the full picture
 2. Skip any marked `[x]` (complete) or `[B]` (blocked)
-3. Check progress.txt for guidance — if the previous iteration recommended a specific next task, prefer that unless it's blocked
-4. If no guidance exists, pick the first `[ ]` (ready) task in the first incomplete phase
-5. Mark your chosen task `[~]` (in progress) in IMPLEMENTATION_PLAN.md
+3. Check progress.txt for guidance — the previous iteration may have recommendations
+4. **Choose a task** based on:
+   - Dependencies (some tasks require others to be done first)
+   - Logical flow (query changes before matching logic, matching before pipeline integration)
+   - Your assessment of what would be most valuable to tackle next
+   - Previous iteration's recommendations (consider but don't blindly follow)
+5. **Document your reasoning**: Before starting work, briefly explain WHY you chose this task over others
+6. Mark your chosen task `[~]` (in progress) in IMPLEMENTATION_PLAN.md

 If your chosen task turns out to be blocked during work:
 - Mark it `[B]` with a reason in IMPLEMENTATION_PLAN.md
 - Document the blocker in progress.txt
- Move to the next ready task within this same iteration
+- Move to a different ready task within this same iteration

 ## Development

 Work on ONE task per iteration. Build incrementally and verify as you go.

+### Key Concepts
+
+**Drug-Indication Matching Flow:**
+1. Get patient's GP-matched Search_Terms from Snowflake (ALL matches, not just most recent, with code_frequency)
+   - Only count GP codes from MIN(Intervention Date) onwards (the HCD data window)
+2. Load DimSearchTerm.csv to get which drugs belong to which Search_Terms
+3. For each patient-drug pair: intersection of (Search_Terms listing this drug) AND (patient's GP matches)
+   - If multiple matches: pick highest code_frequency (most GP coding = most likely indication)
+4. Modify UPID to include matched indication: `{UPID}|{search_term}`
+5. Drugs sharing the same indication for the same patient → same modified UPID → same pathway
+6. Drugs under different indications → different modified UPIDs → separate pathways
+
+**DimSearchTerm.csv:**
+- `Search_Term`: Clinical condition (e.g., "rheumatoid arthritis")
+- `CleanedDrugName`: Pipe-separated drug fragments (e.g., "ADALIMUMAB|GOLIMUMAB|...")
+- `PrimaryDirectorate`: The directorate for this condition
+- Drug matching: check if any fragment is a substring of the HCD drug name (case-insensitive)
+
+**Modified UPID Format:**
+- Original: `RMV12345` (Provider Code[:3] + PersonKey)
+- Modified: `RMV12345|rheumatoid arthritis`
+- Fallback: `RMV12345|RHEUMATOLOGY (no GP dx)`
+- The existing pathway analyzer treats UPID as an opaque identifier — this works transparently
+
 ### Code Patterns

- **Use design tokens**: Import from `pathways_app/styles.py` — never hardcode colors/spacing
- **Reflex Vars in rx.foreach**: Use `.to(int)` for comparisons, `.to_string()` for text interpolation
- **Component functions**: Each component should be a function returning `rx.Component`
- **State class**: All reactive state goes in the `AppState` class
- **Computed properties**: Use `@rx.var` decorator for derived values
+- **Snowflake queries**: Use parameterized queries, embed the cluster CTE from CLUSTER_MAPPING_SQL
+- **GP record matching**: Return ALL matches per patient (not just most recent)
+- **Drug mapping**: Load from `data/DimSearchTerm.csv`, match drug name fragments
+- **Pathway pipeline**: Use existing functions — modified UPIDs flow through naturally
+- **Reflex state**: No changes expected — indication charts already work, just with better matching
+
+### Key Data Structures
+
+**GP Matches (from Snowflake) — updated to return ALL matches with frequency:**
+```python
+# Multiple rows per patient (one per matched Search_Term)
+# code_frequency = COUNT of matching SNOMED codes (used as tiebreaker)
+# Only counts codes from MIN(Intervention Date) onwards
+DataFrame with: PatientPseudonym, Search_Term, code_frequency
+```
+
+**Drug-to-Indication Mapping (from DimSearchTerm.csv):**
+```python
+# search_term → list of drug fragments
+{"rheumatoid arthritis": ["ABATACEPT", "ADALIMUMAB", "ANAKINRA", ...]}
+```
+
+**Modified HCD Data:**
+```python
+# Original UPID replaced with indication-aware UPID
+df["UPID"] = "RMV12345|rheumatoid arthritis"  # for matched drugs
+df["UPID"] = "RMV12345|RHEUMATOLOGY (no GP dx)"  # for unmatched drugs
+```
+
+**Indication DataFrame:**
+```python
+# Maps modified UPID → Search_Term (for pathway hierarchy level 2)
+indication_df = pd.DataFrame({
+    'Directory': ['rheumatoid arthritis', 'asthma', 'CARDIOLOGY (no GP dx)']
+}, index=['RMV12345|rheumatoid arthritis', 'RMV12345|asthma', 'RMV67890|CARDIOLOGY (no GP dx)'])
+```

 ### Verification Steps

 After writing code, ALWAYS verify:

-1. **Syntax check**: `python -m py_compile pathways_app/app_v2.py`
-2. **Import check**: `python -c "from pathways_app.app_v2 import app"`
-3. **Reflex compile**: Run `reflex run` briefly to check for compilation errors
+1. **Syntax check**: `python -m py_compile <file.py>`
+2. **Import check**: `python -c "from module import function"`
+3. **For database changes**: Test with query against pathways.db
+4. **For Reflex changes**: `python -m reflex compile`

 If any step fails, fix the issue before proceeding.

@@ -69,18 +132,19 @@ Every task MUST pass validation before being marked complete:

 ### Tier 1: Code Validation (MANDATORY)
 - Code compiles without Python syntax errors
- Reflex compiles the app without errors
+- Imports work without errors
 - No TypeErrors, ImportErrors, or AttributeErrors

-### Tier 2: Visual Validation (MANDATORY for UI tasks)
- Component renders in the browser
- Styling matches DESIGN_SYSTEM.md specifications
- Responsive behavior works (if applicable)
+### Tier 2: Data Validation (for data/pipeline tasks)
+- Queries return expected row counts
+- Data structures have correct columns/types
+- Drug-indication matching produces valid results
+- Modified UPIDs have correct format

-### Tier 3: Functional Validation (MANDATORY for state/logic tasks)
- State changes trigger expected UI updates
- Computed properties return correct values
- Filters produce expected data transformations
+### Tier 3: Functional Validation (for UI/integration tasks)
+- Reflex compiles the app without errors
+- State changes trigger expected behavior
+- Both chart types render correctly

 ### Validation Failure

@@ -97,8 +161,7 @@ Before marking ANY task `[x]`, ALL of these must be true:
 1. Code is saved to the appropriate file(s)
 2. Tier 1 code validation passed
 3. Tier 2/3 validation passed (as applicable)
-4. Design tokens used — no hardcoded colors, fonts, or spacing
-5. All changes committed to git with a descriptive message
+4. All changes committed to git with a descriptive message

 These are non-negotiable. A task that "feels done" but hasn't passed all gates is NOT done.

@@ -109,18 +172,21 @@ After completing your work (whether the task succeeded, failed, or was blocked),
 ```
 ## Iteration [N] — [YYYY-MM-DD]
 ### Task: [which task you worked on]
+### Why this task:
+- [Brief explanation of why you chose this task over others]
+- [What dependencies or logical flow led to this choice]
 ### Status: COMPLETE | BLOCKED | IN PROGRESS
 ### What was done:
 - [Specific actions taken]
 ### Validation results:
- Tier 1 (Code): [syntax check, import check, reflex compile]
- Tier 2 (Visual): [what was checked visually, or N/A]
- Tier 3 (Functional): [what logic was tested, or N/A]
+- Tier 1 (Code): [syntax check, import check]
+- Tier 2 (Data): [query results, row counts]
+- Tier 3 (Functional): [reflex compile, UI check]
 ### Files changed:
 - [list of files created/modified]
 ### Committed: [git hash] "[commit message]"
 ### Patterns discovered:
- [Any reusable learnings — Reflex quirks, component patterns]
+- [Any reusable learnings — query patterns, matching logic quirks]
 ### Next iteration should:
 - [Explicit guidance for what the next fresh instance should do first]
 - [Note any context that would be lost without writing it here]
@@ -132,8 +198,8 @@ If you discover a failure pattern that future iterations should avoid, add it to

 ## Commit Changes

-1. Stage changed files (styles.py, app_v2.py, etc.)
-2. Use a descriptive commit message referencing the task (e.g., "feat: create design tokens module")
+1. Stage changed files
+2. Use a descriptive commit message referencing the task (e.g., "feat: add drug-indication matching function (Task 2.1)")
 3. Commit after your task is validated and complete — one commit per logical unit of work
 4. If you updated progress.txt with a blocked status, commit that too

@@ -141,7 +207,7 @@ If you discover a failure pattern that future iterations should avoid, add it to

 If ALL tasks in IMPLEMENTATION_PLAN.md are marked `[x]`:

-1. Run `reflex run` and verify the app works end-to-end
+1. Run `reflex compile` to verify app compiles
 2. Verify all completion criteria at the bottom of IMPLEMENTATION_PLAN.md are satisfied
 3. Only then output the completion signal on its own line:

@@ -156,10 +222,15 @@ DO NOT paraphrase, vary, or conditionally output this string.
 ## Rules

 - Complete ONE task per iteration, then update progress and stop
- ALWAYS read progress.txt, guardrails.md, and DESIGN_SYSTEM.md before starting work
- **Use design tokens** — never hardcode hex colors, pixel values, or font names
- **Reflex Var safety** — use `.to()` methods when working with Vars from rx.foreach or computed properties
+- ALWAYS read progress.txt, guardrails.md before starting work
+- **Match drugs to indications** — not just patients to indications
+- **Use DimSearchTerm.csv** for drug-to-Search_Term mapping
+- **Return ALL GP matches** — not just most recent (remove QUALIFY ROW_NUMBER = 1)
+- **Modified UPID format**: `{UPID}|{search_term}` — pipe delimiter is safe
+- **Use PseudoNHSNoLinked** — NOT PersonKey for GP record matching
+- **Substring matching** for drug fragments from DimSearchTerm.csv
 - Keep commits atomic and well-described
 - If stuck on the same issue for more than 2 attempts within one iteration, document it in progress.txt and move to the next ready task
- When in doubt, check the existing `pathways_app.py` for patterns that work
- The goal is a working, beautiful app — correctness and visual quality matter equally
+- When in doubt, check existing code for patterns that work
+- **Pipeline before UI** — processing logic before Reflex changes
+- **Don't change directory charts** — only indication chart matching changes