Data Validation Report
Every query result on this site is validated against published statistics from the CDC and NCHS. This report shows our automated test suite results.
Methodology
Each test case compares a result from our data against a published value from an official CDC or NCHS source. We run two independent layers of validation for every test:
A hand-written SQL query is executed directly against the DuckDB database on Railway. This tests whether the data itself reproduces published statistics, independent of the AI layer. If Layer 1 fails, the data or our understanding of the codebook is wrong.
A natural language question is sent through the full production pipeline: the question goes to our API, Claude generates SQL, Railway executes it, and the result is checked. This tests the end-to-end system that users interact with. If Layer 2 fails but Layer 1 passes, the AI is misinterpreting the question or generating incorrect SQL.
BRFSS Results
11 testsBehavioral Risk Factor Surveillance System — self-reported survey data, 400K+ respondents/year. Values are weighted prevalence percentages using CDC's _LLCPWT survey weights.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| Adult obesity (national) | 2017 | 30.1% | 30.1% | 0.0 | 30.1% | 0.0 | CDC Obesity Maps |
| Adult obesity (national) | 2018 | 30.9% | 30.9% | 0.0 | 30.9% | 0.0 | CDC Obesity Maps |
| Adult obesity (West Virginia) | 2018 | 39.5% | 39.5% | 0.0 | 39.5% | 0.0 | CDC State Data |
| Adult obesity (Colorado) | 2018 | 22.9% | 22.9% | 0.0 | 22.9% | 0.0 | CDC State Data |
| Current smoking | 2018 | 15.5% | 15.5% | 0.0 | 15.5% | 0.0 | CDC Tobacco Data |
| Adult obesity (national) | 2020 | 31.9% | 31.9% | 0.0 | 31.9% | 0.0 | CDC Obesity Maps |
| Diagnosed diabetes | 2018 | 10.9% | 11.4% | +0.5 | 11.8% | +0.9 | CDC Diabetes |
| Current asthma | 2018 | 9.2% | 9.2% | 0.0 | 9.2% | 0.0 | CDC Asthma |
| Physical inactivity | 2018 | 24.5% | 24.5% | 0.0 | 24.5% | 0.0 | CDC PCD |
| Adult obesity (national) | 2023 | 34.3% | 32.8% | -1.5 | 32.8% | -1.5 | CDC Newsroom |
| Depressive disorder | 2019 | 19.9% | 18.8% | -1.1 | 18.8% | -1.1 | PLOS ONE |
NHANES Results
8 testsNational Health and Nutrition Examination Survey (2021–2023 cycle) — clinical exams + lab measurements. Values are weighted prevalence percentages using WTMEC2YR exam weights.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| Obesity overall (BMI≥30) | 2021–23 | 40.3% | 40.3% | 0.0 | 39.8% | -0.5 | NCHS Brief #508 |
| Obesity, men (BMI≥30) | 2021–23 | 39.2% | 39.2% | 0.0 | 38.7% | -0.5 | NCHS Brief #508 |
| Obesity, women (BMI≥30) | 2021–23 | 41.3% | 41.3% | 0.0 | 40.8% | -0.5 | NCHS Brief #508 |
| Total diabetes (incl. undiagnosed) | 2021–23 | 15.8% | 13.8% | -2.0 | 13.8% | -2.0 | NCHS Brief #516 |
| High cholesterol (≥240 mg/dL) | 2021–23 | 11.3% | 11.4% | +0.1 | 11.1% | -0.2 | NCHS Brief #515 |
| Hypertension (measured + Dx) | 2021–23 | 47.7% | 50.0% | +2.3 | 50.0% | +2.3 | NCHS Brief #511 |
| Severe obesity (BMI≥40) | 2021–23 | 9.4% | 9.4% | 0.0 | 9.3% | -0.1 | NCHS Brief #508 |
| Depression (PHQ-9≥10) | 2021–23 | 13.1% | 12.6% | -0.5 | 12.6% | -0.5 | NCHS Brief #527 |
Notes
Tolerance thresholds
Each test has a pre-defined tolerance (typically 1–2 percentage points for BRFSS, 1.5–5 for NHANES). These account for differences in survey weight versions, age cutoffs, and rounding. A deviation within tolerance is a pass.
BRFSS vs NHANES obesity gap
BRFSS reports ~31–33% obesity; NHANES reports ~40%. This is not an error. BRFSS uses self-reported height/weight (people underreport weight), while NHANES uses clinical measurements. The gap is well-documented in epidemiological literature.
What each layer catches
Layer 1 failures indicate data issues: wrong codebook interpretation, missing survey weights, incorrect variable coding. Layer 2 failures (with Layer 1 passing) indicate AI issues: the NL-to-SQL model is generating incorrect queries. Both layers passing means the data is correct and the AI can reproduce results from plain English questions.