Frequently Asked Questions

Common questions about the data, methodology, and how to interpret the numbers

What is ClinVar?

ClinVar is a free, public database maintained by the U.S. National Institutes of Health (NIH). Clinical laboratories, research groups, and expert panels submit reports about genetic variants they've observed in patients, along with their assessment of whether each variant causes disease.

Think of it as a shared library where labs around the world contribute their findings about genetic variants. When a lab identifies a variant in a patient and determines it's disease-causing, they can submit that finding to ClinVar so others can benefit from the knowledge.


What does "Estimated Patients" mean? Is it the actual number of patients?

No — it's an approximate lower bound, not a precise count. Each number on the dashboard represents the number of times a pathogenic variant has been reported in ClinVar. It is almost certainly an undercount of the true number of diagnosed patients for the following reasons:

Not every diagnosis is submitted to ClinVar. When a lab finds a well-known variant in a patient, they may not bother submitting it to ClinVar because it's already been reported many times. This is especially true for common variants in well-studied diseases.

ClinVar launched in 2012–2013. Patients diagnosed before then aren't captured unless their data was retroactively imported. Many historical diagnoses were never added.

One submission doesn't always equal one patient. A single submission might represent multiple patients from the same family or cohort. Conversely, the same patient could appear in multiple submissions if tested by different labs.


Why does all the data start around 2013?

ClinVar was established by the NIH in 2012 and began accepting submissions in earnest in 2013. When the database launched, existing data sources like OMIM (Online Mendelian Inheritance in Man) bulk-imported their variant records. That's why you see a cluster of entries dated April 4, 2013 — that was the import date, not the date those variants were originally discovered.

The upward trend after 2013 reflects the growing adoption of ClinVar as a standard repository for clinical genetic findings, as well as the broader expansion of genetic testing (including the rise of whole-exome and whole-genome sequencing).


How is the estimated patient count calculated?

The displayed count is the cumulative number of pathogenic / likely pathogenic submissions to ClinVar, counted at face value — one submission per row, regardless of the disease's inheritance mode. We treat this as an approximate lower bound on the number of patients genetically identified for each gene.

An earlier version of this dashboard divided the count by two for autosomal recessive diseases (on the logic that two pathogenic alleles are required per patient). We abandoned that adjustment because ClinVar entries are variant-level observations, not patient-level records: there is no way to confirm that two submissions belong to the same patient, and well-known recessive variants are routinely not resubmitted by labs. CFTR F508del, for example, is present in roughly 70% of cystic fibrosis patients but accounts for only about 1% of CFTR submissions in ClinVar — so halving the count would compound an existing under-report. See the Methods page for the full reasoning and caveats.


What does "Pathogenic" and "Likely Pathogenic" mean?

These are standardized terms used by clinical genetics labs to describe how confident they are that a variant causes disease:

Pathogenic means there is strong evidence that this variant causes the disease in question. This is the highest confidence level.

Likely Pathogenic means there is good evidence that this variant causes disease, but the evidence isn't quite as strong as for "Pathogenic." In clinical practice, both categories are typically treated the same way for patient care decisions.

We exclude variants classified as "Uncertain Significance" (VUS), "Likely Benign," or "Benign" because there isn't sufficient evidence to say they cause disease.


What are the filters applied to the data?

We apply two main filters to ClinVar data to ensure we're counting variants that are truly specific to each gene, rather than large chromosomal events that happen to overlap the gene:

Variant type filter: We include nearly all variant types: point mutations (SNVs), insertions, deletions, duplications, copy number gains and losses, microsatellites (repeat expansions), and indels. We only exclude "Complex" and "Haplotype" types. Copy number variants are included because many diseases — such as Angelman syndrome — are primarily caused by large chromosomal deletions.

Size filter: We exclude any variant that spans more than 50 times the size of the gene (minimum 10 Mb). This removes whole-chromosome or chromosome-arm structural variants that overlap the gene but aren't gene-specific, while still capturing known pathogenic regional deletions (e.g., the ~5 Mb 15q11-q13 deletion in Angelman syndrome).

For full technical details, see the Methods page.


How often is the data updated?

The dashboard fetches fresh data from ClinVar every 1440 hours. ClinVar itself is updated on a rolling basis as labs submit new findings.


Why don't you include DECIPHER data?

DECIPHER (Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources) is a valuable resource for clinical genomics, but access to individual-level variant data requires a formal data access agreement with the DECIPHER consortium.

While DECIPHER provides publicly available aggregate summary statistics, these do not include the per-variant detail needed for variant-level analysis. Adding DECIPHER aggregate counts without deduplication would risk double-counting patients who are already represented in ClinVar.


Why do you only use ClinVar?

ClinVar is the most comprehensive, well-structured, and regularly updated public archive of clinically relevant genetic variants. Each submission (SCV record) includes structured metadata — variant coordinates, classification, submitting laboratory, and submission date — enabling robust temporal and submitter-level analysis.

Other databases such as LOVD and Geno2MP were considered but excluded because they lack the temporal granularity needed for trend analysis, have inconsistent update schedules, or require additional assumptions that reduce confidence in the resulting metrics.


Can I use these numbers in a presentation or publication?

Yes, but with appropriate caveats. We recommend noting that ClinVar submission counts represent an approximate lower bound of identified patients, not a comprehensive epidemiological measure. See the Methods page for language suitable for scientific publications.