chore: archive unused files and move legacy code to can_delete

archive/ — unused reference files (no active code references): - LookupSearchTermCleanedDrugName.csv, condition_directorate_mapping.csv - na_directory_rows.csv (diagnostic output), ta-recommendations.xlsx - snomed_indication_mapping_query.sql (source for embedded SQL) - IMPROVEMENT_RECOMMENDATIONS.md, power query.pq archive/can_delete/ — legacy code and logs safe to remove: - dashboard_gui.py (replaced by Reflex app) - pathways_app_old.py.bak (old backup) - Ralph loop iteration logs (iterations 2-8)
2026-02-06 01:01:02 +00:00
parent a31907aa1f
commit bb93c1673e
23 changed files with 121509 additions and 0 deletions
@@ -0,0 +1,100 @@
+
+
+I'll start by reading the required files in order, as specified by the Ralph loop instructions.
+Good, I've read the three critical files. Now let me check recent commits and understand the current state.
+
+Reading progress.txt: Last iteration (2) completed Task 1.2 (merge asthma Search_Terms). The recommended next task is **Task 1.1** (update `get_patient_indication_groups()` to return ALL matches with code_frequency) or **Task 2.1** (create `assign_drug_indications()` function).
+
+Reading IMPLEMENTATION_PLAN.md: Tasks 1.2 and 1.3 are `[x]`. Task 1.1 is `[ ]` (ready). Tasks 2.1, 2.2, 3.1, 3.2, 4.1-4.3 are all `[ ]`.
+
+Let me check recent commits and read the current diagnosis_lookup.py to understand the code I need to modify.
+Now I have full context. Let me make my task selection decision.
+
+**Task Selection Reasoning:**
+
+Looking at the remaining tasks:
+- **Task 1.1** (Update `get_patient_indication_groups()` to return ALL matches with code_frequency) ÔÇö This is the next logical dependency. Tasks 2.1 and 2.2 need the GP matches data to be in the right format (multiple rows per patient with code_frequency). The previous iteration explicitly recommended this as the next task.
+- **Task 2.1** (Create `assign_drug_indications()`) ÔÇö Could be built with mock data, but it's cleaner to do 1.1 first so 2.1 can use the real return type.
+
+**Decision: I'll work on Task 1.1** ÔÇö Update `get_patient_indication_groups()` to return ALL matches with `code_frequency` instead of just the most recent match per patient.
+
+The changes needed:
+1. Add `earliest_hcd_date` parameter
+2. Replace `QUALIFY ROW_NUMBER()` with `GROUP BY + COUNT(*)` 
+3. Change return columns from `(PatientPseudonym, Search_Term, EventDateTime)` to `(PatientPseudonym, Search_Term, code_frequency)`
+4. Update empty DataFrame columns to match
+5. Update docstring
+
+Let me start implementing.
+First, let me mark Task 1.1 as in-progress in IMPLEMENTATION_PLAN.md.
+Now implementing the changes to `get_patient_indication_groups()`. The key changes are:
+
+1. **Add `earliest_hcd_date` parameter** ÔÇö restricts GP codes to the HCD data window
+2. **Replace `QUALIFY ROW_NUMBER()` with `GROUP BY + COUNT(*)`** ÔÇö returns ALL matches with frequency
+3. **Update return columns** ÔÇö `code_frequency` instead of `EventDateTime`
+4. **Update empty DataFrame columns** to match new return type
+5. **Update docstring**
+Good, the core function is updated. Let me verify the changes compile correctly.
+Running syntax and import checks to validate the code.
+Syntax and imports both pass. Let me also verify the function signature is correct by checking the inspect output.
+The function signature is correct. Now let me check if any callers of this function need to be updated. The function has a new `earliest_hcd_date` parameter with `None` default, so existing callers remain backward-compatible. But I should verify the callers handle the new return format (multiple rows per patient, `code_frequency` instead of `EventDateTime`).
+The main caller is in `cli/refresh_pathways.py`. Let me check how it uses the result to understand if the caller needs updating (though per the plan, caller changes are Task 3.1, not 1.1).
+I can see the caller at line 424-428 does `dict(zip(gp_matches_df['PatientPseudonym'], gp_matches_df['Search_Term']))`. This creates a 1:1 mapping ÔÇö with the old query this was fine (one row per patient), but with the new query returning multiple rows per patient, `dict(zip(...))` would only keep the last match. 
+
+However, this is the **current** caller code and updating it is **Task 3.1** (Pipeline Integration), not Task 1.1. Task 1.1 only covers updating the function itself. The caller changes will be done in a future iteration when we integrate `assign_drug_indications()`.
+
+The key thing is that the function change is backward-compatible ÔÇö the new parameter `earliest_hcd_date` defaults to `None`, so the existing caller still works. The return format change (multiple rows per patient, `code_frequency` instead of `EventDateTime`) means the caller needs updating, but that's explicitly Task 3.1.
+
+Now let me also verify the empty DataFrame columns match the new return type throughout.
+All validation passes. Let me update the IMPLEMENTATION_PLAN.md to mark Task 1.1 subtasks complete.
+The last subtask "Verify: Query returns more rows than before" requires a live Snowflake connection ÔÇö that will be verified during Phase 3/4 integration testing. The code change itself is complete and validated locally. Let me note this in the implementation plan.
+Now let me update progress.txt with the iteration results.
+There are two instances of "### Blocked items: - None". Let me use more context to uniquely identify the last one.
+Now let me commit the changes.
+I need to commit only the files I changed in this iteration: `data_processing/diagnosis_lookup.py`, `IMPLEMENTATION_PLAN.md`, and `progress.txt`.
+Commit successful. Let me update progress.txt with the commit hash.
+## Iteration 3 Summary
+
+**Task completed: 1.1 ÔÇö Update `get_patient_indication_groups()` to return ALL matches with code_frequency**
+
+**Changes made to `data_processing/diagnosis_lookup.py`:**
+1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1` ÔÇö was returning only the single most recent match per patient
+2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"` ÔÇö now returns ALL matching Search_Terms per patient with a frequency count
+3. **Added** `earliest_hcd_date: Optional[str] = None` parameter ÔÇö restricts GP codes to the HCD data window when provided
+4. **Updated** return columns: `code_frequency` replaces `EventDateTime`
+5. **Updated** all empty DataFrame returns and docstring
+
+**Validation:**
+- py_compile: PASSED
+- Import check: PASSED
+- Function signature: Verified (4 params, new `earliest_hcd_date` defaults to None)
+- Empty DataFrame columns: `['PatientPseudonym', 'Search_Term', 'code_frequency']` verified
+
+**Commits:**
+- `c93417f` ÔÇö feat: return ALL GP matches with code_frequency
+- `947b87a` ÔÇö docs: update progress.txt
+
+**Next iteration should work on Task 2.1:** Create `assign_drug_indications()` function ÔÇö the core logic that cross-references GP matches with drug mappings to produce modified UPIDs.
+## Iteration 3 Summary
+
+**Task completed: 1.1 ÔÇö Update `get_patient_indication_groups()` to return ALL matches with code_frequency**
+
+**Changes made to `data_processing/diagnosis_lookup.py`:**
+1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1` ÔÇö was returning only the single most recent match per patient
+2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"` ÔÇö now returns ALL matching Search_Terms per patient with a frequency count
+3. **Added** `earliest_hcd_date: Optional[str] = None` parameter ÔÇö restricts GP codes to the HCD data window when provided
+4. **Updated** return columns: `code_frequency` replaces `EventDateTime`
+5. **Updated** all empty DataFrame returns and docstring
+
+**Validation:**
+- py_compile: PASSED
+- Import check: PASSED
+- Function signature: Verified (4 params, new `earliest_hcd_date` defaults to None)
+- Empty DataFrame columns: `['PatientPseudonym', 'Search_Term', 'code_frequency']` verified
+
+**Commits:**
+- `c93417f` ÔÇö feat: return ALL GP matches with code_frequency
+- `947b87a` ÔÇö docs: update progress.txt
+
+**Next iteration should work on Task 2.1:** Create `assign_drug_indications()` function ÔÇö the core logic that cross-references GP matches with drug mappings to produce modified UPIDs.
+