76838887e6
Move 6 packages (core, config, data_processing, analysis, visualization, cli) into src/ to reduce root clutter. Merge tools/data.py into data_processing/transforms.py. Move docs to docs/. Path resolution via .pth file (setup_dev.py), pytest pythonpath config, and sys.path bootstrap in rxconfig.py and CLI entry points. Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python). Fix tomllib import for Python 3.10 compatibility. All 113 tests pass.
7.0 KiB
7.0 KiB
Snowflake Reference
Essential database context for querying NHS data. Read this every iteration when working with Snowflake.
Snowflake MCP Server
Use mcp__snowflake-mcp__* functions to explore schema and test queries.
Schema Discovery (USE THESE FIRST)
test_connection()- Verify connectivitylist_databases()- List accessible databaseslist_schemas(database_name)- List schemas in a databaselist_tables(database, schema)- List tables with descriptionslist_views(schema_name, database)- List views with descriptionsdescribe_table(table_name, database)- Get detailed table schemadescribe_query(query, database)- Preview query output columns without execution
Query Execution
read_data(query, database, max_rows)- Execute SELECT queries with row limitsread_data_paginated(query, database, page_size, page)- Paginated results with total countread_data_pandas(query, database, max_rows, output_format)- Results in pandas-friendly formats
Async Query Support (long-running queries)
execute_async(query, database)- Submit asynchronously, returns query_idget_query_status(query_id, database)- Check statusget_async_results(query_id, database, max_rows)- Retrieve results
Usage Guidelines
- ALWAYS verify table structures and column names via MCP before writing queries
- Test with small result sets (
LIMIT 20) before full execution - Use
describe_queryto preview complex query outputs before running - Use async queries for operations expected to take >30 seconds
Database Overview
| Database | Purpose |
|---|---|
DATA_HUB |
Analyst-curated data warehouse - primary source for most queries |
PRIMARY_CARE |
Raw extracts from EMIS and TPP clinical systems |
NATIONAL |
NHS England national datasets (SUS, ECDS, MHSDS, etc.) |
FACTS_AND_DIMENSIONS_ALL_DATA |
External reference data (BNF, SNOMED, QOF clusters) |
REPORTING_DATASETS_ICB |
Reporting outputs and analyst workspaces (includes SCRATCHPAD) |
Avoid: SYSTEM database.
Key Tables and Views
DATA_HUB.DWH (Dimensions)
| View | Purpose | Key Columns |
|---|---|---|
DimMedicineAndDevice |
Master medication/device reference | ProductSnomedCode, TherapeuticMoietySnomedCode (VTM), BNFParagraphCode, StrengthDescription, ProductDescription |
DimPerson |
Patient demographics | PatientPseudonym, PersonKey, CurrentGeneralPractice, IsCurrentNWRegistered, YearMonthBirth |
DimSnomedCode |
SNOMED code descriptions | SnomedCode, SnomedDescription |
DimOrganisationAndSite |
GP practices and NHS orgs | SiteCode, OrganisationName, OrganisationSubType, IsSiteNorfolkAndWaveney, IsSiteActive |
DimDate |
Date dimension | |
DimCondition |
Clinical conditions | Long-term condition flags |
DimDeprivation |
Deprivation rankings by area |
CRITICAL:
ProductDescriptionis the correct column for product names.ProductNamedoes NOT exist.IsLatestdoes NOT exist inDimMedicineAndDevice.
DATA_HUB.CDM (Common Data Model)
| View | Purpose | Key Columns |
|---|---|---|
Acute__Conmon__PatientLevelDrugs |
HCD activity data | PseudoNHSNoLinked, InterventionDate, DrugName, Price Actual |
Note: HCD PseudoNHSNoLinked = GP PatientPseudonym for patient linkage.
DATA_HUB.PHM (Population Health Management)
| View | Purpose | Key Columns |
|---|---|---|
PrimaryCareClinicalCoding |
Unified clinical coding (EMIS + TPP, no duplicates) | PatientPseudonym, SNOMEDCode, EventDateTime, NumericValue |
PrimaryCareMedication |
Unified medication data (EMIS + TPP, no duplicates) | PatientPseudonym, SNOMEDCode, DateMedicationStart, Quantity |
ClinicalCodingClusterSnomedCodes |
SNOMED codes grouped by cluster | ClusterId, SnomedCode |
PersonCohort |
Pre-defined patient cohorts |
Prefer DATA_HUB.PHM unified views over raw PRIMARY_CARE tables.
Patient Identifiers
| Identifier | Source | Usage |
|---|---|---|
PatientPseudonym |
DATA_HUB, NATIONAL | Primary - use for most joins |
PseudoNHSNoLinked |
DATA_HUB.CDM (HCD data) | Links to PatientPseudonym |
PersonKey |
DATA_HUB.DWH.DimPerson | Integer key for person dimension |
Standard Join Patterns
-- HCD Activity to GP Diagnosis
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"
-- Activity to Person Demographics
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
INNER JOIN DATA_HUB.DWH."DimPerson" dp
ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"
CRITICAL: Registered Population Filter
ALWAYS apply when counting patients:
WHERE dp."IsCurrentNWRegistered" = 'Yes'
AND dp."CurrentGeneralPractice" <> '*'
Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).
Query Development Patterns
Clinical Condition Detection (GP SNOMED Clusters)
-- Get all SNOMED codes for a clinical cluster
SELECT "SnomedCode"
FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
WHERE "ClusterId" = 'RARTH_COD' -- Rheumatoid arthritis
-- Check if patient has condition
SELECT DISTINCT pcc."PatientPseudonym"
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
AND pcc."PatientPseudonym" IS NOT NULL
Available SNOMED Clusters for HCD Indications
RARTH_COD(155 codes) - Rheumatoid arthritisPSORIASIS_COD(116 codes) - PsoriasisCROHNS_COD(93 codes) - Crohn's diseaseULCCOLITIS_COD(62 codes) - Ulcerative colitisMS_COD(44 codes) - Multiple sclerosisDM_COD/DMTYPE1_COD/DMTYPE2AUDIT_COD- Diabetes
Sample HCD Activity Query
SELECT
hcd."PseudoNHSNoLinked" AS PatientPseudonym,
hcd."DrugName",
hcd."InterventionDate",
hcd."Provider Code",
hcd."OrganisationName"
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
WHERE hcd."InterventionDate" >= '2024-01-01'
LIMIT 20
Snowflake SQL Syntax
- Double-quote identifiers:
"PatientPseudonym" - Date literals:
'2025-04-01'::DATE - Date functions:
DATEADD('MONTH', -3, date),DATEDIFF('YEAR', d1, d2),LAST_DAY(date) - Boolean:
TRUE/FALSE - No
TOP N- useLIMIT N COALESCE(),NULLIF(),GREATEST()work as expected
Troubleshooting
Column not found errors
- Use
describe_table(table_name, database)to get actual column names - Remember: Snowflake identifiers are case-sensitive when quoted
- Common mistakes:
ProductName(wrong) vsProductDescription(correct)
Empty results
- Check patient identifier filtering (
IS NOT NULL) - Check date ranges
- Test with
LIMIT 20first to see sample data
Slow queries
- Add
LIMITduring development - Use
describe_queryto validate structure before execution - Consider async execution for large result sets