refactor: reorganize repository to src/ layout
Move 6 packages (core, config, data_processing, analysis, visualization, cli) into src/ to reduce root clutter. Merge tools/data.py into data_processing/transforms.py. Move docs to docs/. Path resolution via .pth file (setup_dev.py), pytest pythonpath config, and sys.path bootstrap in rxconfig.py and CLI entry points. Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python). Fix tomllib import for Python 3.10 compatibility. All 113 tests pass.
This commit is contained in:
@@ -0,0 +1,192 @@
|
||||
# Snowflake Reference
|
||||
|
||||
Essential database context for querying NHS data. Read this every iteration when working with Snowflake.
|
||||
|
||||
---
|
||||
|
||||
## Snowflake MCP Server
|
||||
|
||||
Use `mcp__snowflake-mcp__*` functions to explore schema and test queries.
|
||||
|
||||
### Schema Discovery (USE THESE FIRST)
|
||||
- `test_connection()` - Verify connectivity
|
||||
- `list_databases()` - List accessible databases
|
||||
- `list_schemas(database_name)` - List schemas in a database
|
||||
- `list_tables(database, schema)` - List tables with descriptions
|
||||
- `list_views(schema_name, database)` - List views with descriptions
|
||||
- `describe_table(table_name, database)` - Get detailed table schema
|
||||
- `describe_query(query, database)` - Preview query output columns without execution
|
||||
|
||||
### Query Execution
|
||||
- `read_data(query, database, max_rows)` - Execute SELECT queries with row limits
|
||||
- `read_data_paginated(query, database, page_size, page)` - Paginated results with total count
|
||||
- `read_data_pandas(query, database, max_rows, output_format)` - Results in pandas-friendly formats
|
||||
|
||||
### Async Query Support (long-running queries)
|
||||
- `execute_async(query, database)` - Submit asynchronously, returns query_id
|
||||
- `get_query_status(query_id, database)` - Check status
|
||||
- `get_async_results(query_id, database, max_rows)` - Retrieve results
|
||||
|
||||
### Usage Guidelines
|
||||
- **ALWAYS** verify table structures and column names via MCP before writing queries
|
||||
- Test with small result sets (`LIMIT 20`) before full execution
|
||||
- Use `describe_query` to preview complex query outputs before running
|
||||
- Use async queries for operations expected to take >30 seconds
|
||||
|
||||
---
|
||||
|
||||
## Database Overview
|
||||
|
||||
| Database | Purpose |
|
||||
|----------|---------|
|
||||
| `DATA_HUB` | **Analyst-curated** data warehouse - primary source for most queries |
|
||||
| `PRIMARY_CARE` | Raw extracts from EMIS and TPP clinical systems |
|
||||
| `NATIONAL` | NHS England national datasets (SUS, ECDS, MHSDS, etc.) |
|
||||
| `FACTS_AND_DIMENSIONS_ALL_DATA` | External reference data (BNF, SNOMED, QOF clusters) |
|
||||
| `REPORTING_DATASETS_ICB` | Reporting outputs and analyst workspaces (includes SCRATCHPAD) |
|
||||
|
||||
**Avoid**: `SYSTEM` database.
|
||||
|
||||
---
|
||||
|
||||
## Key Tables and Views
|
||||
|
||||
### DATA_HUB.DWH (Dimensions)
|
||||
|
||||
| View | Purpose | Key Columns |
|
||||
|------|---------|-------------|
|
||||
| `DimMedicineAndDevice` | Master medication/device reference | `ProductSnomedCode`, `TherapeuticMoietySnomedCode` (VTM), `BNFParagraphCode`, `StrengthDescription`, `ProductDescription` |
|
||||
| `DimPerson` | Patient demographics | `PatientPseudonym`, `PersonKey`, `CurrentGeneralPractice`, `IsCurrentNWRegistered`, `YearMonthBirth` |
|
||||
| `DimSnomedCode` | SNOMED code descriptions | `SnomedCode`, `SnomedDescription` |
|
||||
| `DimOrganisationAndSite` | GP practices and NHS orgs | `SiteCode`, `OrganisationName`, `OrganisationSubType`, `IsSiteNorfolkAndWaveney`, `IsSiteActive` |
|
||||
| `DimDate` | Date dimension | |
|
||||
| `DimCondition` | Clinical conditions | Long-term condition flags |
|
||||
| `DimDeprivation` | Deprivation rankings by area | |
|
||||
|
||||
**CRITICAL**:
|
||||
- `ProductDescription` is the correct column for product names. `ProductName` does NOT exist.
|
||||
- `IsLatest` does NOT exist in `DimMedicineAndDevice`.
|
||||
|
||||
### DATA_HUB.CDM (Common Data Model)
|
||||
|
||||
| View | Purpose | Key Columns |
|
||||
|------|---------|-------------|
|
||||
| `Acute__Conmon__PatientLevelDrugs` | HCD activity data | `PseudoNHSNoLinked`, `InterventionDate`, `DrugName`, `Price Actual` |
|
||||
|
||||
**Note**: HCD `PseudoNHSNoLinked` = GP `PatientPseudonym` for patient linkage.
|
||||
|
||||
### DATA_HUB.PHM (Population Health Management)
|
||||
|
||||
| View | Purpose | Key Columns |
|
||||
|------|---------|-------------|
|
||||
| `PrimaryCareClinicalCoding` | **Unified** clinical coding (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `EventDateTime`, `NumericValue` |
|
||||
| `PrimaryCareMedication` | **Unified** medication data (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `DateMedicationStart`, `Quantity` |
|
||||
| `ClinicalCodingClusterSnomedCodes` | SNOMED codes grouped by cluster | `ClusterId`, `SnomedCode` |
|
||||
| `PersonCohort` | Pre-defined patient cohorts | |
|
||||
|
||||
**Prefer DATA_HUB.PHM unified views** over raw PRIMARY_CARE tables.
|
||||
|
||||
---
|
||||
|
||||
## Patient Identifiers
|
||||
|
||||
| Identifier | Source | Usage |
|
||||
|------------|--------|-------|
|
||||
| `PatientPseudonym` | DATA_HUB, NATIONAL | Primary - use for most joins |
|
||||
| `PseudoNHSNoLinked` | DATA_HUB.CDM (HCD data) | Links to PatientPseudonym |
|
||||
| `PersonKey` | DATA_HUB.DWH.DimPerson | Integer key for person dimension |
|
||||
|
||||
### Standard Join Patterns
|
||||
```sql
|
||||
-- HCD Activity to GP Diagnosis
|
||||
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
||||
LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
|
||||
ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"
|
||||
|
||||
-- Activity to Person Demographics
|
||||
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
||||
INNER JOIN DATA_HUB.DWH."DimPerson" dp
|
||||
ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CRITICAL: Registered Population Filter
|
||||
|
||||
**ALWAYS** apply when counting patients:
|
||||
|
||||
```sql
|
||||
WHERE dp."IsCurrentNWRegistered" = 'Yes'
|
||||
AND dp."CurrentGeneralPractice" <> '*'
|
||||
```
|
||||
|
||||
Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).
|
||||
|
||||
---
|
||||
|
||||
## Query Development Patterns
|
||||
|
||||
### Clinical Condition Detection (GP SNOMED Clusters)
|
||||
```sql
|
||||
-- Get all SNOMED codes for a clinical cluster
|
||||
SELECT "SnomedCode"
|
||||
FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
|
||||
WHERE "ClusterId" = 'RARTH_COD' -- Rheumatoid arthritis
|
||||
|
||||
-- Check if patient has condition
|
||||
SELECT DISTINCT pcc."PatientPseudonym"
|
||||
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
|
||||
WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
|
||||
AND pcc."PatientPseudonym" IS NOT NULL
|
||||
```
|
||||
|
||||
### Available SNOMED Clusters for HCD Indications
|
||||
- `RARTH_COD` (155 codes) - Rheumatoid arthritis
|
||||
- `PSORIASIS_COD` (116 codes) - Psoriasis
|
||||
- `CROHNS_COD` (93 codes) - Crohn's disease
|
||||
- `ULCCOLITIS_COD` (62 codes) - Ulcerative colitis
|
||||
- `MS_COD` (44 codes) - Multiple sclerosis
|
||||
- `DM_COD` / `DMTYPE1_COD` / `DMTYPE2AUDIT_COD` - Diabetes
|
||||
|
||||
### Sample HCD Activity Query
|
||||
```sql
|
||||
SELECT
|
||||
hcd."PseudoNHSNoLinked" AS PatientPseudonym,
|
||||
hcd."DrugName",
|
||||
hcd."InterventionDate",
|
||||
hcd."Provider Code",
|
||||
hcd."OrganisationName"
|
||||
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
||||
WHERE hcd."InterventionDate" >= '2024-01-01'
|
||||
LIMIT 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Snowflake SQL Syntax
|
||||
|
||||
- Double-quote identifiers: `"PatientPseudonym"`
|
||||
- Date literals: `'2025-04-01'::DATE`
|
||||
- Date functions: `DATEADD('MONTH', -3, date)`, `DATEDIFF('YEAR', d1, d2)`, `LAST_DAY(date)`
|
||||
- Boolean: `TRUE`/`FALSE`
|
||||
- No `TOP N` - use `LIMIT N`
|
||||
- `COALESCE()`, `NULLIF()`, `GREATEST()` work as expected
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Column not found errors
|
||||
1. Use `describe_table(table_name, database)` to get actual column names
|
||||
2. Remember: Snowflake identifiers are case-sensitive when quoted
|
||||
3. Common mistakes: `ProductName` (wrong) vs `ProductDescription` (correct)
|
||||
|
||||
### Empty results
|
||||
1. Check patient identifier filtering (`IS NOT NULL`)
|
||||
2. Check date ranges
|
||||
3. Test with `LIMIT 20` first to see sample data
|
||||
|
||||
### Slow queries
|
||||
1. Add `LIMIT` during development
|
||||
2. Use `describe_query` to validate structure before execution
|
||||
3. Consider async execution for large result sets
|
||||
Reference in New Issue
Block a user