refactor: reorganize repository to src/ layout

Move 6 packages (core, config, data_processing, analysis, visualization, cli) into src/ to reduce root clutter. Merge tools/data.py into data_processing/transforms.py. Move docs to docs/. Path resolution via .pth file (setup_dev.py), pytest pythonpath config, and sys.path bootstrap in rxconfig.py and CLI entry points. Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python). Fix tomllib import for Python 3.10 compatibility. All 113 tests pass.
2026-02-06 12:03:48 +00:00
parent 1581b1d3dd
commit 76838887e6
40 changed files with 589 additions and 214 deletions
@@ -0,0 +1,192 @@
+# Snowflake Reference
+
+Essential database context for querying NHS data. Read this every iteration when working with Snowflake.
+
+---
+
+## Snowflake MCP Server
+
+Use `mcp__snowflake-mcp__*` functions to explore schema and test queries.
+
+### Schema Discovery (USE THESE FIRST)
+- `test_connection()` - Verify connectivity
+- `list_databases()` - List accessible databases
+- `list_schemas(database_name)` - List schemas in a database
+- `list_tables(database, schema)` - List tables with descriptions
+- `list_views(schema_name, database)` - List views with descriptions
+- `describe_table(table_name, database)` - Get detailed table schema
+- `describe_query(query, database)` - Preview query output columns without execution
+
+### Query Execution
+- `read_data(query, database, max_rows)` - Execute SELECT queries with row limits
+- `read_data_paginated(query, database, page_size, page)` - Paginated results with total count
+- `read_data_pandas(query, database, max_rows, output_format)` - Results in pandas-friendly formats
+
+### Async Query Support (long-running queries)
+- `execute_async(query, database)` - Submit asynchronously, returns query_id
+- `get_query_status(query_id, database)` - Check status
+- `get_async_results(query_id, database, max_rows)` - Retrieve results
+
+### Usage Guidelines
+- **ALWAYS** verify table structures and column names via MCP before writing queries
+- Test with small result sets (`LIMIT 20`) before full execution
+- Use `describe_query` to preview complex query outputs before running
+- Use async queries for operations expected to take >30 seconds
+
+---
+
+## Database Overview
+
+| Database | Purpose |
+|----------|---------|
+| `DATA_HUB` | **Analyst-curated** data warehouse - primary source for most queries |
+| `PRIMARY_CARE` | Raw extracts from EMIS and TPP clinical systems |
+| `NATIONAL` | NHS England national datasets (SUS, ECDS, MHSDS, etc.) |
+| `FACTS_AND_DIMENSIONS_ALL_DATA` | External reference data (BNF, SNOMED, QOF clusters) |
+| `REPORTING_DATASETS_ICB` | Reporting outputs and analyst workspaces (includes SCRATCHPAD) |
+
+**Avoid**: `SYSTEM` database.
+
+---
+
+## Key Tables and Views
+
+### DATA_HUB.DWH (Dimensions)
+
+| View | Purpose | Key Columns |
+|------|---------|-------------|
+| `DimMedicineAndDevice` | Master medication/device reference | `ProductSnomedCode`, `TherapeuticMoietySnomedCode` (VTM), `BNFParagraphCode`, `StrengthDescription`, `ProductDescription` |
+| `DimPerson` | Patient demographics | `PatientPseudonym`, `PersonKey`, `CurrentGeneralPractice`, `IsCurrentNWRegistered`, `YearMonthBirth` |
+| `DimSnomedCode` | SNOMED code descriptions | `SnomedCode`, `SnomedDescription` |
+| `DimOrganisationAndSite` | GP practices and NHS orgs | `SiteCode`, `OrganisationName`, `OrganisationSubType`, `IsSiteNorfolkAndWaveney`, `IsSiteActive` |
+| `DimDate` | Date dimension | |
+| `DimCondition` | Clinical conditions | Long-term condition flags |
+| `DimDeprivation` | Deprivation rankings by area | |
+
+**CRITICAL**:
+- `ProductDescription` is the correct column for product names. `ProductName` does NOT exist.
+- `IsLatest` does NOT exist in `DimMedicineAndDevice`.
+
+### DATA_HUB.CDM (Common Data Model)
+
+| View | Purpose | Key Columns |
+|------|---------|-------------|
+| `Acute__Conmon__PatientLevelDrugs` | HCD activity data | `PseudoNHSNoLinked`, `InterventionDate`, `DrugName`, `Price Actual` |
+
+**Note**: HCD `PseudoNHSNoLinked` = GP `PatientPseudonym` for patient linkage.
+
+### DATA_HUB.PHM (Population Health Management)
+
+| View | Purpose | Key Columns |
+|------|---------|-------------|
+| `PrimaryCareClinicalCoding` | **Unified** clinical coding (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `EventDateTime`, `NumericValue` |
+| `PrimaryCareMedication` | **Unified** medication data (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `DateMedicationStart`, `Quantity` |
+| `ClinicalCodingClusterSnomedCodes` | SNOMED codes grouped by cluster | `ClusterId`, `SnomedCode` |
+| `PersonCohort` | Pre-defined patient cohorts | |
+
+**Prefer DATA_HUB.PHM unified views** over raw PRIMARY_CARE tables.
+
+---
+
+## Patient Identifiers
+
+| Identifier | Source | Usage |
+|------------|--------|-------|
+| `PatientPseudonym` | DATA_HUB, NATIONAL | Primary - use for most joins |
+| `PseudoNHSNoLinked` | DATA_HUB.CDM (HCD data) | Links to PatientPseudonym |
+| `PersonKey` | DATA_HUB.DWH.DimPerson | Integer key for person dimension |
+
+### Standard Join Patterns
+```sql
+-- HCD Activity to GP Diagnosis
+FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
+LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
+  ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"
+
+-- Activity to Person Demographics
+FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
+INNER JOIN DATA_HUB.DWH."DimPerson" dp
+  ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"
+```
+
+---
+
+## CRITICAL: Registered Population Filter
+
+**ALWAYS** apply when counting patients:
+
+```sql
+WHERE dp."IsCurrentNWRegistered" = 'Yes'
+  AND dp."CurrentGeneralPractice" <> '*'
+```
+
+Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).
+
+---
+
+## Query Development Patterns
+
+### Clinical Condition Detection (GP SNOMED Clusters)
+```sql
+-- Get all SNOMED codes for a clinical cluster
+SELECT "SnomedCode"
+FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
+WHERE "ClusterId" = 'RARTH_COD'  -- Rheumatoid arthritis
+
+-- Check if patient has condition
+SELECT DISTINCT pcc."PatientPseudonym"
+FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
+WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
+  AND pcc."PatientPseudonym" IS NOT NULL
+```
+
+### Available SNOMED Clusters for HCD Indications
+- `RARTH_COD` (155 codes) - Rheumatoid arthritis
+- `PSORIASIS_COD` (116 codes) - Psoriasis
+- `CROHNS_COD` (93 codes) - Crohn's disease
+- `ULCCOLITIS_COD` (62 codes) - Ulcerative colitis
+- `MS_COD` (44 codes) - Multiple sclerosis
+- `DM_COD` / `DMTYPE1_COD` / `DMTYPE2AUDIT_COD` - Diabetes
+
+### Sample HCD Activity Query
+```sql
+SELECT
+    hcd."PseudoNHSNoLinked" AS PatientPseudonym,
+    hcd."DrugName",
+    hcd."InterventionDate",
+    hcd."Provider Code",
+    hcd."OrganisationName"
+FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
+WHERE hcd."InterventionDate" >= '2024-01-01'
+LIMIT 20
+```
+
+---
+
+## Snowflake SQL Syntax
+
+- Double-quote identifiers: `"PatientPseudonym"`
+- Date literals: `'2025-04-01'::DATE`
+- Date functions: `DATEADD('MONTH', -3, date)`, `DATEDIFF('YEAR', d1, d2)`, `LAST_DAY(date)`
+- Boolean: `TRUE`/`FALSE`
+- No `TOP N` - use `LIMIT N`
+- `COALESCE()`, `NULLIF()`, `GREATEST()` work as expected
+
+---
+
+## Troubleshooting
+
+### Column not found errors
+1. Use `describe_table(table_name, database)` to get actual column names
+2. Remember: Snowflake identifiers are case-sensitive when quoted
+3. Common mistakes: `ProductName` (wrong) vs `ProductDescription` (correct)
+
+### Empty results
+1. Check patient identifier filtering (`IS NOT NULL`)
+2. Check date ranges
+3. Test with `LIMIT 20` first to see sample data
+
+### Slow queries
+1. Add `LIMIT` during development
+2. Use `describe_query` to validate structure before execution
+3. Consider async execution for large result sets