chore: archive unused files and move legacy code to can_delete

archive/ — unused reference files (no active code references): - LookupSearchTermCleanedDrugName.csv, condition_directorate_mapping.csv - na_directory_rows.csv (diagnostic output), ta-recommendations.xlsx - snomed_indication_mapping_query.sql (source for embedded SQL) - IMPROVEMENT_RECOMMENDATIONS.md, power query.pq archive/can_delete/ — legacy code and logs safe to remove: - dashboard_gui.py (replaced by Reflex app) - pathways_app_old.py.bak (old backup) - Ralph loop iteration logs (iterations 2-8)
2026-02-06 01:01:02 +00:00
parent a31907aa1f
commit bb93c1673e
23 changed files with 121509 additions and 0 deletions
@@ -0,0 +1,80 @@
+
+
+I'll start by reading the required files in order to understand the current state of the project.
+Now checking recent commits and reading the current state of the key file I'll be modifying.
+Good. I've read all the required files. Let me understand the current state:
+
+**Phase 1 is complete** (Tasks 1.1, 1.2, 1.3 all done). The next task is **Task 2.1: Create `assign_drug_indications()` function** ÔÇö this is the core drug-aware matching logic.
+
+**Why Task 2.1**: 
+- All Phase 1 dependencies are complete (Snowflake query returns ALL matches with code_frequency, drug mapping loaded from CSV, asthma merged)
+- Task 2.1 is the foundation for Phase 2 ÔÇö Task 2.2 (tiebreaker logic) is naturally part of 2.1
+- This can be built and tested with mock data (no Snowflake needed)
+- Phase 3 (pipeline integration) depends on this function existing
+
+Let me read the current refresh_pathways.py to understand the integration point and how gp_matches_df is currently used.
+Now I have the full picture. I can see the critical issue in `refresh_pathways.py` lines 424-428 ÔÇö `dict(zip(...))` which only keeps the LAST match per patient (since the new query returns multiple rows per patient). This will be fixed in Task 3.1.
+
+Now I'm ready to build Task 2.1: `assign_drug_indications()`. Let me mark it in-progress and start implementing.
+Now implementing the `assign_drug_indications()` function. The key logic:
+
+1. For each UPID + Drug Name pair in the HCD data:
+   - Look up patient's GP-matched Search_Terms (from gp_matches_df via PseudoNHSNoLinked)
+   - Look up which Search_Terms list this drug (from drug mapping)
+   - Intersect ÔåÆ valid indications
+   - Pick highest code_frequency as tiebreaker (or alphabetical if tied)
+   - Modify UPID: `{UPID}|{search_term}` or fallback `{UPID}|{Directory} (no GP dx)`
+
+2. Build indication_df mapping modified UPID ÔåÆ Search_Term
+
+I'll add this function to `diagnosis_lookup.py` since that's where all the related functions live.
+Now I'll add the `assign_drug_indications()` function. I'll insert it after `get_search_terms_for_drug()` (line 1204) and before the `CLUSTER_MAPPING_SQL` section (line 1207).
+Now I need to update the `__all__` exports to include `assign_drug_indications`.
+Now let me validate the code. Running py_compile and import checks.
+Syntax check passed. Now import check.
+Import check passed. Now let me test with mock data to verify the function works correctly ÔÇö this covers Task 2.2 (tiebreaker logic) as well.
+All tests pass. Let me also test the tiebreaker with equal frequencies to verify alphabetical fallback.
+The alphabetical tiebreaker is comparing with `(-freq, term)` ÔÇö so for equal freq, "crohn's disease" < "psoriatic arthritis" alphabetically, which means crohn's should come first. But the apostrophe in "crohn's disease" might be causing a sorting issue. Let me check.
+Hmm, "crohn's disease" IS alphabetically first. Let me debug what's happening in the function.
+I see the issue. "crohn's disease" from DimSearchTerm.csv is not in the ADALIMUMAB Search_Terms list ÔÇö looking at the output, ADALIMUMAB matches: ankylosing spondylitis, axial spondyloarthritis, plaque psoriasis, psoriatic arthritis, rheumatoid arthritis, ulcerative colitis, uveitis. "crohn's disease" is NOT in that list (ADALIMUMAB doesn't match crohn's in DimSearchTerm.csv). So only psoriatic arthritis is in the intersection, giving a single match ÔÇö not a tiebreaker.
+
+My test was wrong about the data. Let me fix the test to use two Search_Terms that actually overlap with ADALIMUMAB.
+All tests pass:
+- Equal frequency ÔåÆ alphabetical tiebreaker works ("psoriatic arthritis" < "rheumatoid arthritis")
+- Higher frequency wins (47 RA codes vs 3 psoriatic codes ÔåÆ RA)
+- Empty GP matches ÔåÆ fallback to directory
+
+Now let me update the IMPLEMENTATION_PLAN.md to mark Task 2.1 and 2.2 complete, then commit and update progress.txt.
+Now commit the changes.
+Now update progress.txt with the iteration log.
+**Iteration 4 complete.** Here's the summary:
+
+**Task 2.1 + 2.2: `assign_drug_indications()` ÔÇö COMPLETE**
+
+Added the core drug-aware indication matching function to `diagnosis_lookup.py`. For each UPID + Drug Name pair:
+1. Gets patient's GP-matched Search_Terms (via PseudoNHSNoLinked)
+2. Gets which Search_Terms list this drug (from DimSearchTerm.csv mapping)
+3. Intersects to find valid indications
+4. Tiebreaker: highest code_frequency wins, alphabetical if tied
+5. Fallback: `{Directory} (no GP dx)` for unmatched drugs
+6. Modifies UPID: `{UPID}|{search_term}`
+
+All mock data tests pass. Commits: `408976e` (function) and `d9891c8` (progress).
+
+**Next iteration**: Task 3.1 ÔÇö integrate `assign_drug_indications()` into `refresh_pathways.py`, replacing the old `dict(zip())` approach that only kept one match per patient.
+**Iteration 4 complete.** Here's the summary:
+
+**Task 2.1 + 2.2: `assign_drug_indications()` ÔÇö COMPLETE**
+
+Added the core drug-aware indication matching function to `diagnosis_lookup.py`. For each UPID + Drug Name pair:
+1. Gets patient's GP-matched Search_Terms (via PseudoNHSNoLinked)
+2. Gets which Search_Terms list this drug (from DimSearchTerm.csv mapping)
+3. Intersects to find valid indications
+4. Tiebreaker: highest code_frequency wins, alphabetical if tied
+5. Fallback: `{Directory} (no GP dx)` for unmatched drugs
+6. Modifies UPID: `{UPID}|{search_term}`
+
+All mock data tests pass. Commits: `408976e` (function) and `d9891c8` (progress).
+
+**Next iteration**: Task 3.1 ÔÇö integrate `assign_drug_indications()` into `refresh_pathways.py`, replacing the old `dict(zip())` approach that only kept one match per patient.
+