chore: archive unused files and move legacy code to can_delete

archive/ — unused reference files (no active code references):
  - LookupSearchTermCleanedDrugName.csv, condition_directorate_mapping.csv
  - na_directory_rows.csv (diagnostic output), ta-recommendations.xlsx
  - snomed_indication_mapping_query.sql (source for embedded SQL)
  - IMPROVEMENT_RECOMMENDATIONS.md, power query.pq

archive/can_delete/ — legacy code and logs safe to remove:
  - dashboard_gui.py (replaced by Reflex app)
  - pathways_app_old.py.bak (old backup)
  - Ralph loop iteration logs (iterations 2-8)
This commit is contained in:
Andrew Charlwood
2026-02-06 01:01:02 +00:00
parent a31907aa1f
commit bb93c1673e
23 changed files with 121509 additions and 0 deletions
+100
View File
@@ -0,0 +1,100 @@
I'll start by reading the required files in order, as specified by the Ralph loop instructions.
Good, I've read the three critical files. Now let me check recent commits and understand the current state.
Reading progress.txt: Last iteration (2) completed Task 1.2 (merge asthma Search_Terms). The recommended next task is **Task 1.1** (update `get_patient_indication_groups()` to return ALL matches with code_frequency) or **Task 2.1** (create `assign_drug_indications()` function).
Reading IMPLEMENTATION_PLAN.md: Tasks 1.2 and 1.3 are `[x]`. Task 1.1 is `[ ]` (ready). Tasks 2.1, 2.2, 3.1, 3.2, 4.1-4.3 are all `[ ]`.
Let me check recent commits and read the current diagnosis_lookup.py to understand the code I need to modify.
Now I have full context. Let me make my task selection decision.
**Task Selection Reasoning:**
Looking at the remaining tasks:
- **Task 1.1** (Update `get_patient_indication_groups()` to return ALL matches with code_frequency) ÔÇö This is the next logical dependency. Tasks 2.1 and 2.2 need the GP matches data to be in the right format (multiple rows per patient with code_frequency). The previous iteration explicitly recommended this as the next task.
- **Task 2.1** (Create `assign_drug_indications()`) ÔÇö Could be built with mock data, but it's cleaner to do 1.1 first so 2.1 can use the real return type.
**Decision: I'll work on Task 1.1** ÔÇö Update `get_patient_indication_groups()` to return ALL matches with `code_frequency` instead of just the most recent match per patient.
The changes needed:
1. Add `earliest_hcd_date` parameter
2. Replace `QUALIFY ROW_NUMBER()` with `GROUP BY + COUNT(*)`
3. Change return columns from `(PatientPseudonym, Search_Term, EventDateTime)` to `(PatientPseudonym, Search_Term, code_frequency)`
4. Update empty DataFrame columns to match
5. Update docstring
Let me start implementing.
First, let me mark Task 1.1 as in-progress in IMPLEMENTATION_PLAN.md.
Now implementing the changes to `get_patient_indication_groups()`. The key changes are:
1. **Add `earliest_hcd_date` parameter** ÔÇö restricts GP codes to the HCD data window
2. **Replace `QUALIFY ROW_NUMBER()` with `GROUP BY + COUNT(*)`** ÔÇö returns ALL matches with frequency
3. **Update return columns** ÔÇö `code_frequency` instead of `EventDateTime`
4. **Update empty DataFrame columns** to match new return type
5. **Update docstring**
Good, the core function is updated. Let me verify the changes compile correctly.
Running syntax and import checks to validate the code.
Syntax and imports both pass. Let me also verify the function signature is correct by checking the inspect output.
The function signature is correct. Now let me check if any callers of this function need to be updated. The function has a new `earliest_hcd_date` parameter with `None` default, so existing callers remain backward-compatible. But I should verify the callers handle the new return format (multiple rows per patient, `code_frequency` instead of `EventDateTime`).
The main caller is in `cli/refresh_pathways.py`. Let me check how it uses the result to understand if the caller needs updating (though per the plan, caller changes are Task 3.1, not 1.1).
I can see the caller at line 424-428 does `dict(zip(gp_matches_df['PatientPseudonym'], gp_matches_df['Search_Term']))`. This creates a 1:1 mapping ÔÇö with the old query this was fine (one row per patient), but with the new query returning multiple rows per patient, `dict(zip(...))` would only keep the last match.
However, this is the **current** caller code and updating it is **Task 3.1** (Pipeline Integration), not Task 1.1. Task 1.1 only covers updating the function itself. The caller changes will be done in a future iteration when we integrate `assign_drug_indications()`.
The key thing is that the function change is backward-compatible ÔÇö the new parameter `earliest_hcd_date` defaults to `None`, so the existing caller still works. The return format change (multiple rows per patient, `code_frequency` instead of `EventDateTime`) means the caller needs updating, but that's explicitly Task 3.1.
Now let me also verify the empty DataFrame columns match the new return type throughout.
All validation passes. Let me update the IMPLEMENTATION_PLAN.md to mark Task 1.1 subtasks complete.
The last subtask "Verify: Query returns more rows than before" requires a live Snowflake connection ÔÇö that will be verified during Phase 3/4 integration testing. The code change itself is complete and validated locally. Let me note this in the implementation plan.
Now let me update progress.txt with the iteration results.
There are two instances of "### Blocked items: - None". Let me use more context to uniquely identify the last one.
Now let me commit the changes.
I need to commit only the files I changed in this iteration: `data_processing/diagnosis_lookup.py`, `IMPLEMENTATION_PLAN.md`, and `progress.txt`.
Commit successful. Let me update progress.txt with the commit hash.
## Iteration 3 Summary
**Task completed: 1.1 ÔÇö Update `get_patient_indication_groups()` to return ALL matches with code_frequency**
**Changes made to `data_processing/diagnosis_lookup.py`:**
1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1` ÔÇö was returning only the single most recent match per patient
2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"` ÔÇö now returns ALL matching Search_Terms per patient with a frequency count
3. **Added** `earliest_hcd_date: Optional[str] = None` parameter ÔÇö restricts GP codes to the HCD data window when provided
4. **Updated** return columns: `code_frequency` replaces `EventDateTime`
5. **Updated** all empty DataFrame returns and docstring
**Validation:**
- py_compile: PASSED
- Import check: PASSED
- Function signature: Verified (4 params, new `earliest_hcd_date` defaults to None)
- Empty DataFrame columns: `['PatientPseudonym', 'Search_Term', 'code_frequency']` verified
**Commits:**
- `c93417f` ÔÇö feat: return ALL GP matches with code_frequency
- `947b87a` ÔÇö docs: update progress.txt
**Next iteration should work on Task 2.1:** Create `assign_drug_indications()` function ÔÇö the core logic that cross-references GP matches with drug mappings to produce modified UPIDs.
## Iteration 3 Summary
**Task completed: 1.1 ÔÇö Update `get_patient_indication_groups()` to return ALL matches with code_frequency**
**Changes made to `data_processing/diagnosis_lookup.py`:**
1. **Removed** `QUALIFY ROW_NUMBER() OVER (PARTITION BY ... ORDER BY EventDateTime DESC) = 1` ÔÇö was returning only the single most recent match per patient
2. **Replaced** with `GROUP BY pc."PatientPseudonym", aic.Search_Term` + `COUNT(*) AS "code_frequency"` ÔÇö now returns ALL matching Search_Terms per patient with a frequency count
3. **Added** `earliest_hcd_date: Optional[str] = None` parameter ÔÇö restricts GP codes to the HCD data window when provided
4. **Updated** return columns: `code_frequency` replaces `EventDateTime`
5. **Updated** all empty DataFrame returns and docstring
**Validation:**
- py_compile: PASSED
- Import check: PASSED
- Function signature: Verified (4 params, new `earliest_hcd_date` defaults to None)
- Empty DataFrame columns: `['PatientPseudonym', 'Search_Term', 'code_frequency']` verified
**Commits:**
- `c93417f` ÔÇö feat: return ALL GP matches with code_frequency
- `947b87a` ÔÇö docs: update progress.txt
**Next iteration should work on Task 2.1:** Create `assign_drug_indications()` function ÔÇö the core logic that cross-references GP matches with drug mappings to produce modified UPIDs.