Methods

Methodology for quantifying pathogenic and likely pathogenic variant submissions using ClinVar across Mendelian disease genes

Data Sources

ClinVar. All variant data were obtained from ClinVar, a freely accessible public archive of reports describing the relationships between human genetic variants and observed health conditions.1 ClinVar is maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine and aggregates submissions from clinical testing laboratories, research groups, expert panels, and other organisations worldwide.

Data Retrieval

Variant records were retrieved programmatically from ClinVar using the NCBI Entrez Programming Utilities (E-utilities) API.2 For each gene of interest, an initial search was performed against the ClinVar database using the query [GENE][gene] AND clinsig_pathogenic, which returns all variant records in the specified gene that carry a pathogenic or likely pathogenic clinical significance assertion. Full variant records were then retrieved in VCV (Variation Archive) XML format using the efetch endpoint. To comply with NCBI usage guidelines, requests were rate-limited to no more than 3 per second. All results were cached locally for 168 hours.

Gene Coordinate Lookup

Gene coordinates on the GRCh38 reference assembly are retrieved dynamically from the NCBI Datasets API when a gene is first queried. These coordinates are cached locally to avoid repeated lookups. The genomic span filter threshold for each gene is set at three times its gene size, to exclude large copy-number variants that overlap the gene locus but affect numerous other genes. The inheritance mode for each gene is determined from a curated catalog of 4,853 Mendelian disease genes.

Variant Filtering

Two filters were applied to ClinVar data to exclude variants that, while overlapping a gene of interest, are unlikely to represent gene-specific pathogenic events:

Variant type filter. The following variant types were retained: single nucleotide variant (SNV), deletion, duplication, insertion, indel, microsatellite, copy number gain, copy number loss, and the generic "variation" category used by ClinVar for certain records. Copy number variants are included because many diseases — particularly contiguous gene deletion syndromes such as Angelman syndrome (15q11-q13 deletion) — are primarily caused by large chromosomal deletions or duplications. Excluding these would undercount patients for such conditions. Only the "Complex" and "Haplotype" types were excluded.

Genomic span filter. For variants with annotated GRCh38 coordinates, any variant whose genomic span exceeded the gene-specific threshold (50× gene size, minimum 10 Mb) was excluded. This threshold is intentionally generous to retain known pathogenic regional deletions (e.g., the ~5 Mb Angelman critical region deletion) while excluding whole-chromosome or chromosome-arm copy number variants that are unlikely to represent gene-specific pathogenic events.

Classification Criteria

A variant was included in the analysis only if its overall (aggregate) germline classification in ClinVar was one of: Pathogenic, Likely pathogenic, or Pathogenic/Likely pathogenic. ClinVar derives this aggregate classification from the individual submissions (SCVs) according to its review status process.

Importantly, the classification filter is applied only at the variant level, not at the individual submission level. Once a variant is determined to be pathogenic or likely pathogenic by aggregate consensus, all submissions for that variant are counted — including those in which the submitting laboratory classified the variant as uncertain significance (VUS) or used another designation.

Counting Method

The raw metric collected is the cumulative number of individual ClinVar submissions (SCV records). Each SCV record typically represents a distinct clinical testing laboratory or research group reporting a variant.

Interpretation of Observation Counts

The primary metric displayed is the cumulative number of pathogenic variant observations — that is, the total number of independent ClinVar submissions reporting a pathogenic or likely pathogenic variant in a given gene.

Why we do not adjust for inheritance mode

An earlier version of this dashboard divided the observation count by 2 for autosomal recessive diseases, on the logic that two pathogenic alleles are required per patient. This approach was abandoned because ClinVar submissions are variant-level observations, not patient-level records:

Each observation is therefore counted at face value regardless of inheritance mode. The metric is best understood as a lower bound on the number of patients who have been genetically identified, with important caveats:

Discovery Acceleration

The “Discovery Acceleration” view computes a rolling year-over-year growth rate for each gene. At each calendar quarter Q, the metric is calculated as:

Growth (%) = (Observations in trailing 12 months − Observations in prior 12 months) ÷ Observations in prior 12 months × 100

This normalizes for the absolute size of each disease’s submission base, enabling direct comparison of discovery trends across diseases with very different prevalences. A gene showing +200% growth is accelerating faster than one showing +20%, regardless of which has more total observations. The first two years of data for each gene are excluded, as the metric requires two full years of history.

Temporal Analysis

The time series shown on the dashboard uses the DateCreated attribute of each ClinVar submission. The cumulative curve therefore represents the growth of the ClinVar knowledge base over time.

Summary of Inclusion and Exclusion Criteria

Criterion Included Excluded
Genes Any gene from the catalog of 4,853 Mendelian disease genes (user-selected) Genes not in the catalog
Data source ClinVar LOVD, Geno2MP, DECIPHER
Overall germline classification Pathogenic, Likely pathogenic, Pathogenic/Likely pathogenic Uncertain significance, Benign, Likely benign, Conflicting, etc.
Per-submission classification (ClinVar) All submissions for variants meeting the overall classification criterion above None (all submissions counted once the variant qualifies)
Variant types (ClinVar) SNV, deletion, duplication, insertion, indel, microsatellite, variation, copy number gain, copy number loss Complex, Haplotype
Genomic span (ClinVar, GRCh38) ≤ gene-specific threshold (50× gene size, min 10 Mb) > gene-specific threshold

Software and Reproducibility

Data retrieval and processing were implemented in Python using the requests library for API access, the standard library xml.etree.ElementTree module for XML parsing, and concurrent.futures.ThreadPoolExecutor for parallel data fetching. The web dashboard was built with Flask and visualised using Plotly.js. Gene coordinates are retrieved dynamically from the NCBI Datasets API. All source code is available for inspection and the analysis can be reproduced by running the application, which will re-query all data sources with the parameters described above.

Patient Finding Leaderboard

The Patient Finding Leaderboard ranks diseases with FDA-approved treatments by their Discovery Rate, a prevalence-normalized measure of ClinVar submission activity.

The Discovery Rate is calculated as:

Discovery Rate = (ClinVar submissions in trailing N months) ÷ (disease prevalence per 100,000)

Disease prevalence estimates are sourced from Orphanet (Orphadata en_product9_prev.xml, version 2025-12-09), using point prevalence where available and birth prevalence otherwise. Three window sizes are provided: 3-month, 6-month, and 12-month rolling windows. A 3-month trailing average is applied to the resulting rates to smooth month-to-month noise caused by laboratories uploading variant data in periodic batches rather than continuously.

Limitations and Caveats

ClinVar submissions are not equivalent to unique patients identified. Multiple laboratories may independently submit the same variant for the same patient, and a single patient with a compound heterozygous genotype may generate two variant submissions. The Discovery Rate should therefore be interpreted as a measure of ClinVar submission activity rather than a direct count of patients found.

The leaderboard reflects ClinVar submission culture, not solely patient-finding effectiveness. Diseases diagnosed primarily through repeat-expansion assays (e.g., Huntington disease, Friedreich’s ataxia) or deletion/duplication testing (e.g., spinal muscular atrophy) appear lower on the leaderboard because the clinical laboratories performing these tests do not routinely submit per-patient results to ClinVar. These assays (PCR fragment analysis, MLPA, Southern blot) are typically performed by specialized laboratories with different data-sharing practices than the next-generation sequencing laboratories (e.g., GeneDx, Labcorp/Invitae, Ambry Genetics) that generate the majority of ClinVar submissions. A low Discovery Rate for these diseases does not necessarily indicate poor patient identification—it indicates that the testing workflow bypasses ClinVar.

Batch uploads create transient spikes. Individual laboratories periodically upload large numbers of variant classifications in a single batch, which can temporarily inflate the Discovery Rate for a given disease. The 3-month trailing average mitigates but does not eliminate this effect. Users should consider the Lab Contributions chart on the Dashboard page to identify whether a spike is driven by a single laboratory’s batch upload or by broad-based growth in submissions.

Prevalence estimates vary by geography and methodology. Orphanet prevalence figures represent best available estimates but may differ substantially from true prevalence in specific populations. For example, sickle cell disease prevalence is approximately 10 per 100,000 in Europe but approximately 30 per 100,000 in the United States. Fabry disease has a clinical point prevalence of 0.15 per 100,000 but newborn screening studies suggest a birth prevalence of 6.66 per 100,000, reflecting a large population of asymptomatic or late-onset individuals.

References

  1. Landrum MJ, Lee JM, Benson M, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018;46(D1):D1062–D1067. doi:10.1093/nar/gkx1153
  2. Sayers E. E-utilities Quick Start. In: Entrez Programming Utilities Help. Bethesda (MD): National Center for Biotechnology Information (US); 2008–. https://www.ncbi.nlm.nih.gov/books/NBK25500/
  3. NCBI ClinVar. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/clinvar/
  4. Köhler S, Gargano M, Matentzoglu N, et al. The Human Phenotype Ontology in 2021. Nucleic Acids Research. 2021;49(D1):D1207–D1217. doi:10.1093/nar/gkaa1043
Cache duration: 168 hours
Data source: ClinVar
Gene catalog: 4,853 Mendelian disease genes with curated inheritance modes