193 lines
7.0 KiB
Markdown
193 lines
7.0 KiB
Markdown
# Snowflake Reference
|
|
|
|
Essential database context for querying NHS data. Read this every iteration when working with Snowflake.
|
|
|
|
---
|
|
|
|
## Snowflake MCP Server
|
|
|
|
Use `mcp__snowflake-mcp__*` functions to explore schema and test queries.
|
|
|
|
### Schema Discovery (USE THESE FIRST)
|
|
- `test_connection()` - Verify connectivity
|
|
- `list_databases()` - List accessible databases
|
|
- `list_schemas(database_name)` - List schemas in a database
|
|
- `list_tables(database, schema)` - List tables with descriptions
|
|
- `list_views(schema_name, database)` - List views with descriptions
|
|
- `describe_table(table_name, database)` - Get detailed table schema
|
|
- `describe_query(query, database)` - Preview query output columns without execution
|
|
|
|
### Query Execution
|
|
- `read_data(query, database, max_rows)` - Execute SELECT queries with row limits
|
|
- `read_data_paginated(query, database, page_size, page)` - Paginated results with total count
|
|
- `read_data_pandas(query, database, max_rows, output_format)` - Results in pandas-friendly formats
|
|
|
|
### Async Query Support (long-running queries)
|
|
- `execute_async(query, database)` - Submit asynchronously, returns query_id
|
|
- `get_query_status(query_id, database)` - Check status
|
|
- `get_async_results(query_id, database, max_rows)` - Retrieve results
|
|
|
|
### Usage Guidelines
|
|
- **ALWAYS** verify table structures and column names via MCP before writing queries
|
|
- Test with small result sets (`LIMIT 20`) before full execution
|
|
- Use `describe_query` to preview complex query outputs before running
|
|
- Use async queries for operations expected to take >30 seconds
|
|
|
|
---
|
|
|
|
## Database Overview
|
|
|
|
| Database | Purpose |
|
|
|----------|---------|
|
|
| `DATA_HUB` | **Analyst-curated** data warehouse - primary source for most queries |
|
|
| `PRIMARY_CARE` | Raw extracts from EMIS and TPP clinical systems |
|
|
| `NATIONAL` | NHS England national datasets (SUS, ECDS, MHSDS, etc.) |
|
|
| `FACTS_AND_DIMENSIONS_ALL_DATA` | External reference data (BNF, SNOMED, QOF clusters) |
|
|
| `REPORTING_DATASETS_ICB` | Reporting outputs and analyst workspaces (includes SCRATCHPAD) |
|
|
|
|
**Avoid**: `SYSTEM` database.
|
|
|
|
---
|
|
|
|
## Key Tables and Views
|
|
|
|
### DATA_HUB.DWH (Dimensions)
|
|
|
|
| View | Purpose | Key Columns |
|
|
|------|---------|-------------|
|
|
| `DimMedicineAndDevice` | Master medication/device reference | `ProductSnomedCode`, `TherapeuticMoietySnomedCode` (VTM), `BNFParagraphCode`, `StrengthDescription`, `ProductDescription` |
|
|
| `DimPerson` | Patient demographics | `PatientPseudonym`, `PersonKey`, `CurrentGeneralPractice`, `IsCurrentNWRegistered`, `YearMonthBirth` |
|
|
| `DimSnomedCode` | SNOMED code descriptions | `SnomedCode`, `SnomedDescription` |
|
|
| `DimOrganisationAndSite` | GP practices and NHS orgs | `SiteCode`, `OrganisationName`, `OrganisationSubType`, `IsSiteNorfolkAndWaveney`, `IsSiteActive` |
|
|
| `DimDate` | Date dimension | |
|
|
| `DimCondition` | Clinical conditions | Long-term condition flags |
|
|
| `DimDeprivation` | Deprivation rankings by area | |
|
|
|
|
**CRITICAL**:
|
|
- `ProductDescription` is the correct column for product names. `ProductName` does NOT exist.
|
|
- `IsLatest` does NOT exist in `DimMedicineAndDevice`.
|
|
|
|
### DATA_HUB.CDM (Common Data Model)
|
|
|
|
| View | Purpose | Key Columns |
|
|
|------|---------|-------------|
|
|
| `Acute__Conmon__PatientLevelDrugs` | HCD activity data | `PseudoNHSNoLinked`, `InterventionDate`, `DrugName`, `Price Actual` |
|
|
|
|
**Note**: HCD `PseudoNHSNoLinked` = GP `PatientPseudonym` for patient linkage.
|
|
|
|
### DATA_HUB.PHM (Population Health Management)
|
|
|
|
| View | Purpose | Key Columns |
|
|
|------|---------|-------------|
|
|
| `PrimaryCareClinicalCoding` | **Unified** clinical coding (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `EventDateTime`, `NumericValue` |
|
|
| `PrimaryCareMedication` | **Unified** medication data (EMIS + TPP, no duplicates) | `PatientPseudonym`, `SNOMEDCode`, `DateMedicationStart`, `Quantity` |
|
|
| `ClinicalCodingClusterSnomedCodes` | SNOMED codes grouped by cluster | `ClusterId`, `SnomedCode` |
|
|
| `PersonCohort` | Pre-defined patient cohorts | |
|
|
|
|
**Prefer DATA_HUB.PHM unified views** over raw PRIMARY_CARE tables.
|
|
|
|
---
|
|
|
|
## Patient Identifiers
|
|
|
|
| Identifier | Source | Usage |
|
|
|------------|--------|-------|
|
|
| `PatientPseudonym` | DATA_HUB, NATIONAL | Primary - use for most joins |
|
|
| `PseudoNHSNoLinked` | DATA_HUB.CDM (HCD data) | Links to PatientPseudonym |
|
|
| `PersonKey` | DATA_HUB.DWH.DimPerson | Integer key for person dimension |
|
|
|
|
### Standard Join Patterns
|
|
```sql
|
|
-- HCD Activity to GP Diagnosis
|
|
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
|
LEFT JOIN DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
|
|
ON hcd."PseudoNHSNoLinked" = pcc."PatientPseudonym"
|
|
|
|
-- Activity to Person Demographics
|
|
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
|
INNER JOIN DATA_HUB.DWH."DimPerson" dp
|
|
ON hcd."PseudoNHSNoLinked" = dp."PatientPseudonym"
|
|
```
|
|
|
|
---
|
|
|
|
## CRITICAL: Registered Population Filter
|
|
|
|
**ALWAYS** apply when counting patients:
|
|
|
|
```sql
|
|
WHERE dp."IsCurrentNWRegistered" = 'Yes'
|
|
AND dp."CurrentGeneralPractice" <> '*'
|
|
```
|
|
|
|
Without this filter, counts will be ~2x inflated (includes deceased, deregistered, out-of-area patients).
|
|
|
|
---
|
|
|
|
## Query Development Patterns
|
|
|
|
### Clinical Condition Detection (GP SNOMED Clusters)
|
|
```sql
|
|
-- Get all SNOMED codes for a clinical cluster
|
|
SELECT "SnomedCode"
|
|
FROM DATA_HUB.PHM."ClinicalCodingClusterSnomedCodes"
|
|
WHERE "ClusterId" = 'RARTH_COD' -- Rheumatoid arthritis
|
|
|
|
-- Check if patient has condition
|
|
SELECT DISTINCT pcc."PatientPseudonym"
|
|
FROM DATA_HUB.PHM."PrimaryCareClinicalCoding" pcc
|
|
WHERE pcc."SNOMEDCode" IN (SELECT "SnomedCode" FROM cluster_codes)
|
|
AND pcc."PatientPseudonym" IS NOT NULL
|
|
```
|
|
|
|
### Available SNOMED Clusters for HCD Indications
|
|
- `RARTH_COD` (155 codes) - Rheumatoid arthritis
|
|
- `PSORIASIS_COD` (116 codes) - Psoriasis
|
|
- `CROHNS_COD` (93 codes) - Crohn's disease
|
|
- `ULCCOLITIS_COD` (62 codes) - Ulcerative colitis
|
|
- `MS_COD` (44 codes) - Multiple sclerosis
|
|
- `DM_COD` / `DMTYPE1_COD` / `DMTYPE2AUDIT_COD` - Diabetes
|
|
|
|
### Sample HCD Activity Query
|
|
```sql
|
|
SELECT
|
|
hcd."PseudoNHSNoLinked" AS PatientPseudonym,
|
|
hcd."DrugName",
|
|
hcd."InterventionDate",
|
|
hcd."Provider Code",
|
|
hcd."OrganisationName"
|
|
FROM DATA_HUB.CDM."Acute__Conmon__PatientLevelDrugs" hcd
|
|
WHERE hcd."InterventionDate" >= '2024-01-01'
|
|
LIMIT 20
|
|
```
|
|
|
|
---
|
|
|
|
## Snowflake SQL Syntax
|
|
|
|
- Double-quote identifiers: `"PatientPseudonym"`
|
|
- Date literals: `'2025-04-01'::DATE`
|
|
- Date functions: `DATEADD('MONTH', -3, date)`, `DATEDIFF('YEAR', d1, d2)`, `LAST_DAY(date)`
|
|
- Boolean: `TRUE`/`FALSE`
|
|
- No `TOP N` - use `LIMIT N`
|
|
- `COALESCE()`, `NULLIF()`, `GREATEST()` work as expected
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Column not found errors
|
|
1. Use `describe_table(table_name, database)` to get actual column names
|
|
2. Remember: Snowflake identifiers are case-sensitive when quoted
|
|
3. Common mistakes: `ProductName` (wrong) vs `ProductDescription` (correct)
|
|
|
|
### Empty results
|
|
1. Check patient identifier filtering (`IS NOT NULL`)
|
|
2. Check date ranges
|
|
3. Test with `LIMIT 20` first to see sample data
|
|
|
|
### Slow queries
|
|
1. Add `LIMIT` during development
|
|
2. Use `describe_query` to validate structure before execution
|
|
3. Consider async execution for large result sets
|