Files
HighCostDrugsDemo/docs/SNOWFLAKE_REFERENCE.md
T
Andrew Charlwood 76838887e6 refactor: reorganize repository to src/ layout
Move 6 packages (core, config, data_processing, analysis, visualization, cli)
into src/ to reduce root clutter. Merge tools/data.py into
data_processing/transforms.py. Move docs to docs/.

Path resolution via .pth file (setup_dev.py), pytest pythonpath config,
and sys.path bootstrap in rxconfig.py and CLI entry points.

Clean up pyproject.toml deps (remove stale pins, add snowflake-connector-python).
Fix tomllib import for Python 3.10 compatibility.

All 113 tests pass.
2026-02-06 12:03:48 +00:00

7.0 KiB

Snowflake Reference

Essential database context for querying NHS data. Read this every iteration when working with Snowflake.


Snowflake MCP Server

Use mcp__snowflake-mcp__* functions to explore schema and test queries.

Schema Discovery (USE THESE FIRST)

  • test_connection() - Verify connectivity
  • list_databases() - List accessible databases
  • list_schemas(database_name) - List schemas in a database
  • list_tables(database, schema) - List tables with descriptions
  • list_views(schema_name, database) - List views with descriptions
  • describe_table(table_name, database) - Get detailed table schema
  • describe_query(query, database) - Preview query output columns without execution

Query Execution

  • read_data(query, database, max_rows) - Execute SELECT queries with row limits
  • read_data_paginated(query, database, page_size, page) - Paginated results with total count
  • read_data_pandas(query, database, max_rows, output_format) - Results in pandas-friendly formats

Async Query Support (long-running queries)

  • execute_async(query, database) - Submit asynchronously, returns query_id
  • get_query_status(query_id, database) - Check status
  • get_async_results(query_id, database, max_rows) - Retrieve results

Usage Guidelines

  • ALWAYS verify table structures and column names via MCP before writing queries
  • Test with small result sets (LIMIT 20) before full execution
  • Use describe_query to preview complex query outputs before running
  • Use async queries for operations expected to take >30 seconds

Database Overview

Database Purpose
DATA_HUB Analyst-curated data warehouse - primary source for most queries
PRIMARY_CARE Raw extracts from EMIS and TPP clinical systems
NATIONAL NHS England national datasets (SUS, ECDS, MHSDS, etc.)
FACTS_AND_DIMENSIONS_ALL_DATA External reference data (BNF, SNOMED, QOF clusters)
REPORTING_DATASETS_ICB Reporting outputs and analyst workspaces (includes SCRATCHPAD)

Avoid: SYSTEM database.


Key Tables and Views

DATA_HUB.DWH (Dimensions)

View Purpose Key Columns
DimMedicineAndDevice Master medication/device reference ProductSnomedCode, TherapeuticMoietySnomedCode (VTM), BNFParagraphCode, StrengthDescription, ProductDescription
DimPerson Patient demographics PatientPseudonym, PersonKey, CurrentGeneralPractice, IsCurrentNWRegistered, YearMonthBirth
DimSnomedCode SNOMED code descriptions SnomedCode, SnomedDescription
DimOrganisationAndSite GP practices and NHS orgs SiteCode, OrganisationName, OrganisationSubType, IsSiteNorfolkAndWaveney, IsSiteActive
DimDate Date dimension
DimCondition Clinical conditions Long-term condition flags
DimDeprivation Deprivation rankings by area

CRITICAL:

  • ProductDescription is the correct column for product names. ProductName does NOT exist.
  • IsLatest does NOT exist in DimMedicineAndDevice.

DATA_HUB.CDM (Common Data Model)

View Purpose Key Columns
Acute__Conmon__PatientLevelDrugs HCD activity data PseudoNHSNoLinked, InterventionDate, DrugName, Price Actual

Note: HCD PseudoNHSNoLinked = GP PatientPseudonym for patient linkage.

DATA_HUB.PHM (Population Health Management)

View Purpose Key Columns
PrimaryCareClinicalCoding Unified clinical coding (EMIS + TPP, no duplicates) PatientPseudonym, SNOMEDCode, EventDateTime, NumericValue
PrimaryCareMedication Unified medication data (EMIS + TPP, no duplicates) PatientPseudonym, SNOMEDCode, DateMedicationStart, Quantity
ClinicalCodingClusterSnomedCodes SNOMED codes grouped by cluster ClusterId, SnomedCode
PersonCohort Pre-defined patient cohorts

Prefer DATA_HUB.PHM unified views over raw PRIMARY_CARE tables.


Patient Identifiers

Identifier Source Usage
PatientPseudonym DATA_HUB, NATIONAL Primary - use for most joins
PseudoNHSNoLinked DATA_HUB.CDM (HCD data) Links to PatientPseudonym
PersonKey DATA_HUB.DWH.DimPerson Integer key for person dimension

Standard Join Patterns

-- HCD Activity to GP Diagnosis
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
  ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"

-- Activity to Person Demographics
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
INNER JOIN DATA_HUB.DWH."DimPerson" dp
  ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"

CRITICAL: Registered Population Filter

ALWAYS apply when counting patients:

WHERE dp."IsCurrentNWRegistered" = 'Yes'
  AND dp."CurrentGeneralPractice" <> '*'

Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).


Query Development Patterns

Clinical Condition Detection (GP SNOMED Clusters)

-- Get all SNOMED codes for a clinical cluster
SELECT "SnomedCode"
FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
WHERE "ClusterId" = 'RARTH_COD'  -- Rheumatoid arthritis

-- Check if patient has condition
SELECT DISTINCT pcc."PatientPseudonym"
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
  AND pcc."PatientPseudonym" IS NOT NULL

Available SNOMED Clusters for HCD Indications

  • RARTH_COD (155 codes) - Rheumatoid arthritis
  • PSORIASIS_COD (116 codes) - Psoriasis
  • CROHNS_COD (93 codes) - Crohn's disease
  • ULCCOLITIS_COD (62 codes) - Ulcerative colitis
  • MS_COD (44 codes) - Multiple sclerosis
  • DM_COD / DMTYPE1_COD / DMTYPE2AUDIT_COD - Diabetes

Sample HCD Activity Query

SELECT
    hcd."PseudoNHSNoLinked" AS PatientPseudonym,
    hcd."DrugName",
    hcd."InterventionDate",
    hcd."Provider Code",
    hcd."OrganisationName"
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
WHERE hcd."InterventionDate" >= '2024-01-01'
LIMIT 20

Snowflake SQL Syntax

  • Double-quote identifiers: "PatientPseudonym"
  • Date literals: '2025-04-01'::DATE
  • Date functions: DATEADD('MONTH', -3, date), DATEDIFF('YEAR', d1, d2), LAST_DAY(date)
  • Boolean: TRUE/FALSE
  • No TOP N - use LIMIT N
  • COALESCE(), NULLIF(), GREATEST() work as expected

Troubleshooting

Column not found errors

  1. Use describe_table(table_name, database) to get actual column names
  2. Remember: Snowflake identifiers are case-sensitive when quoted
  3. Common mistakes: ProductName (wrong) vs ProductDescription (correct)

Empty results

  1. Check patient identifier filtering (IS NOT NULL)
  2. Check date ranges
  3. Test with LIMIT 20 first to see sample data

Slow queries

  1. Add LIMIT during development
  2. Use describe_query to validate structure before execution
  3. Consider async execution for large result sets