refactor: reorganize repository to src/ layout

Move 6 packages (core, config, data_processing, analysis, visualization, cli)
into src/ to reduce root clutter. Merge tools/data.py into
data_processing/transforms.py. Move docs to docs/.

Path resolution via .pth file (setup_dev.py), pytest pythonpath config,
and sys.path bootstrap in rxconfig.py and CLI entry points.

Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python).
Fix tomllib import for Python 3.10 compatibility.

All 113 tests pass.
This commit is contained in:
Andrew Charlwood
2026-02-06 12:03:48 +00:00
parent 1581b1d3dd
commit 76838887e6
40 changed files with 589 additions and 214 deletions
+192
View File
@@ -0,0 +1,192 @@
# Snowflake Reference
Essential database context for querying NHS data. Read this every iteration when working with Snowflake.
---
## Snowflake MCP Server
Use `mcp__snowflake-mcp__*` functions to explore schema and test queries.
### Schema Discovery (USE THESE FIRST)
- `test_connection()` - Verify connectivity
- `list_databases()` - List accessible databases
- `list_schemas(database_name)` - List schemas in a database
- `list_tables(database, schema)` - List tables with descriptions
- `list_views(schema_name, database)` - List views with descriptions
- `describe_table(table_name, database)` - Get detailed table schema
- `describe_query(query, database)` - Preview query output columns without execution
### Query Execution
- `read_data(query, database, max_rows)` - Execute SELECT queries with row limits
- `read_data_paginated(query, database, page_size, page)` - Paginated results with total count
- `read_data_pandas(query, database, max_rows, output_format)` - Results in pandas-friendly formats
### Async Query Support (long-running queries)
- `execute_async(query, database)` - Submit asynchronously, returns query_id
- `get_query_status(query_id, database)` - Check status
- `get_async_results(query_id, database, max_rows)` - Retrieve results
### Usage Guidelines
- **ALWAYS** verify table structures and column names via MCP before writing queries
- Test with small result sets (`LIMIT 20`) before full execution
- Use `describe_query` to preview complex query outputs before running
- Use async queries for operations expected to take >30 seconds
---
## Database Overview
| Database | Purpose |
|----------|---------|
| `DATA_HUB` | **Analyst-curated** data warehouse - primary source for most queries |
| `PRIMARY_CARE` | Raw extracts from EMIS and TPP clinical systems |
| `NATIONAL` | NHS England national datasets (SUS, ECDS, MHSDS, etc.) |
| `FACTS_AND_DIMENSIONS_ALL_DATA` | External reference data (BNF, SNOMED, QOF clusters) |
| `REPORTING_DATASETS_ICB` | Reporting outputs and analyst workspaces (includes SCRATCHPAD) |
**Avoid**: `SYSTEM` database.
---
## Key Tables and Views
### DATA_HUB.DWH (Dimensions)
| View | Purpose | Key Columns |
|------|---------|-------------|
| `DimMedicineAndDevice` | Master medication/device reference | `ProductSnomedCode`, `TherapeuticMoietySnomedCode` (VTM), `BNFParagraphCode`, `StrengthDescription`, `ProductDescription` |
| `DimPerson` | Patient demographics | `PatientPseudonym`, `PersonKey`, `CurrentGeneralPractice`, `IsCurrentNWRegistered`, `YearMonthBirth` |
| `DimSnomedCode` | SNOMED code descriptions | `SnomedCode`, `SnomedDescription` |
| `DimOrganisationAndSite` | GP practices and NHS orgs | `SiteCode`, `OrganisationName`, `OrganisationSubType`, `IsSiteNorfolkAndWaveney`, `IsSiteActive` |
| `DimDate` | Date dimension | |
| `DimCondition` | Clinical conditions | Long-term condition flags |
| `DimDeprivation` | Deprivation rankings by area | |
**CRITICAL**:
- `ProductDescription` is the correct column for product names. `ProductName` does NOT exist.
- `IsLatest` does NOT exist in `DimMedicineAndDevice`.
### DATA_HUB.CDM (Common Data Model)
| View | Purpose | Key Columns |
|------|---------|-------------|
| `Acute__Conmon__PatientLevelDrugs` | HCD activity data | `PseudoNHSNoLinked`, `InterventionDate`, `DrugName`, `Price Actual` |
**Note**: HCD `PseudoNHSNoLinked` = GP `PatientPseudonym` for patient linkage.
### DATA_HUB.PHM (Population Health Management)
| View | Purpose | Key Columns |
|------|---------|-------------|
| `PrimaryCareClinicalCoding` | **Unified** clinical coding (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `EventDateTime`, `NumericValue` |
| `PrimaryCareMedication` | **Unified** medication data (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `DateMedicationStart`, `Quantity` |
| `ClinicalCodingClusterSnomedCodes` | SNOMED codes grouped by cluster | `ClusterId`, `SnomedCode` |
| `PersonCohort` | Pre-defined patient cohorts | |
**Prefer DATA_HUB.PHM unified views** over raw PRIMARY_CARE tables.
---
## Patient Identifiers
| Identifier | Source | Usage |
|------------|--------|-------|
| `PatientPseudonym` | DATA_HUB, NATIONAL | Primary - use for most joins |
| `PseudoNHSNoLinked` | DATA_HUB.CDM (HCD data) | Links to PatientPseudonym |
| `PersonKey` | DATA_HUB.DWH.DimPerson | Integer key for person dimension |
### Standard Join Patterns
```sql
-- HCD Activity to GP Diagnosis
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"
-- Activity to Person Demographics
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
INNER JOIN DATA_HUB.DWH."DimPerson" dp
ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"
```
---
## CRITICAL: Registered Population Filter
**ALWAYS** apply when counting patients:
```sql
WHERE dp."IsCurrentNWRegistered" = 'Yes'
AND dp."CurrentGeneralPractice" <> '*'
```
Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).
---
## Query Development Patterns
### Clinical Condition Detection (GP SNOMED Clusters)
```sql
-- Get all SNOMED codes for a clinical cluster
SELECT "SnomedCode"
FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
WHERE "ClusterId" = 'RARTH_COD' -- Rheumatoid arthritis
-- Check if patient has condition
SELECT DISTINCT pcc."PatientPseudonym"
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
AND pcc."PatientPseudonym" IS NOT NULL
```
### Available SNOMED Clusters for HCD Indications
- `RARTH_COD` (155 codes) - Rheumatoid arthritis
- `PSORIASIS_COD` (116 codes) - Psoriasis
- `CROHNS_COD` (93 codes) - Crohn's disease
- `ULCCOLITIS_COD` (62 codes) - Ulcerative colitis
- `MS_COD` (44 codes) - Multiple sclerosis
- `DM_COD` / `DMTYPE1_COD` / `DMTYPE2AUDIT_COD` - Diabetes
### Sample HCD Activity Query
```sql
SELECT
hcd."PseudoNHSNoLinked" AS PatientPseudonym,
hcd."DrugName",
hcd."InterventionDate",
hcd."Provider Code",
hcd."OrganisationName"
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
WHERE hcd."InterventionDate" >= '2024-01-01'
LIMIT 20
```
---
## Snowflake SQL Syntax
- Double-quote identifiers: `"PatientPseudonym"`
- Date literals: `'2025-04-01'::DATE`
- Date functions: `DATEADD('MONTH', -3, date)`, `DATEDIFF('YEAR', d1, d2)`, `LAST_DAY(date)`
- Boolean: `TRUE`/`FALSE`
- No `TOP N` - use `LIMIT N`
- `COALESCE()`, `NULLIF()`, `GREATEST()` work as expected
---
## Troubleshooting
### Column not found errors
1. Use `describe_table(table_name, database)` to get actual column names
2. Remember: Snowflake identifiers are case-sensitive when quoted
3. Common mistakes: `ProductName` (wrong) vs `ProductDescription` (correct)
### Empty results
1. Check patient identifier filtering (`IS NOT NULL`)
2. Check date ranges
3. Test with `LIMIT 20` first to see sample data
### Slow queries
1. Add `LIMIT` during development
2. Use `describe_query` to validate structure before execution
3. Consider async execution for large result sets