Methodology for quantifying pathogenic and likely pathogenic variant submissions using ClinVar across Mendelian disease genes
ClinVar. All variant data were obtained from ClinVar, a freely accessible public archive of reports describing the relationships between human genetic variants and observed health conditions.1 ClinVar is maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine and aggregates submissions from clinical testing laboratories, research groups, expert panels, and other organisations worldwide.
Variant records were retrieved programmatically from ClinVar using the
NCBI Entrez Programming Utilities (E-utilities) API.2 For each
gene of interest, an initial search was performed against the ClinVar database using
the query [GENE][gene] AND clinsig_pathogenic, which returns all variant
records in the specified gene that carry a pathogenic or likely pathogenic clinical
significance assertion. Full variant records were then retrieved in VCV (Variation
Archive) XML format using the efetch endpoint. To comply with NCBI usage
guidelines, requests were rate-limited to no more than 3 per second. All results
were cached locally for 168 hours.
Gene coordinates on the GRCh38 reference assembly are retrieved dynamically from the NCBI Datasets API when a gene is first queried. These coordinates are cached locally to avoid repeated lookups. The genomic span filter threshold for each gene is set at three times its gene size, to exclude large copy-number variants that overlap the gene locus but affect numerous other genes. The inheritance mode for each gene is determined from a curated catalog of 4,853 Mendelian disease genes.
Two filters were applied to ClinVar data to exclude variants that, while overlapping a gene of interest, are unlikely to represent gene-specific pathogenic events:
Variant type filter. The following variant types were retained: single nucleotide variant (SNV), deletion, duplication, insertion, indel, microsatellite, copy number gain, copy number loss, and the generic "variation" category used by ClinVar for certain records. Copy number variants are included because many diseases — particularly contiguous gene deletion syndromes such as Angelman syndrome (15q11-q13 deletion) — are primarily caused by large chromosomal deletions or duplications. Excluding these would undercount patients for such conditions. Only the "Complex" and "Haplotype" types were excluded.
Genomic span filter. For variants with annotated GRCh38 coordinates, any variant whose genomic span exceeded the gene-specific threshold (50× gene size, minimum 10 Mb) was excluded. This threshold is intentionally generous to retain known pathogenic regional deletions (e.g., the ~5 Mb Angelman critical region deletion) while excluding whole-chromosome or chromosome-arm copy number variants that are unlikely to represent gene-specific pathogenic events.
A variant was included in the analysis only if its overall (aggregate) germline classification in ClinVar was one of: Pathogenic, Likely pathogenic, or Pathogenic/Likely pathogenic. ClinVar derives this aggregate classification from the individual submissions (SCVs) according to its review status process.
Importantly, the classification filter is applied only at the variant level, not at the individual submission level. Once a variant is determined to be pathogenic or likely pathogenic by aggregate consensus, all submissions for that variant are counted — including those in which the submitting laboratory classified the variant as uncertain significance (VUS) or used another designation.
The raw metric collected is the cumulative number of individual ClinVar submissions (SCV records). Each SCV record typically represents a distinct clinical testing laboratory or research group reporting a variant.
The primary metric displayed is the cumulative number of pathogenic variant observations — that is, the total number of independent ClinVar submissions reporting a pathogenic or likely pathogenic variant in a given gene.
An earlier version of this dashboard divided the observation count by 2 for autosomal recessive diseases, on the logic that two pathogenic alleles are required per patient. This approach was abandoned because ClinVar submissions are variant-level observations, not patient-level records:
Each observation is therefore counted at face value regardless of inheritance mode. The metric is best understood as a lower bound on the number of patients who have been genetically identified, with important caveats:
The “Discovery Acceleration” view computes a rolling year-over-year growth rate for each gene. At each calendar quarter Q, the metric is calculated as:
Growth (%) = (Observations in trailing 12 months − Observations in prior 12 months) ÷ Observations in prior 12 months × 100
This normalizes for the absolute size of each disease’s submission base, enabling direct comparison of discovery trends across diseases with very different prevalences. A gene showing +200% growth is accelerating faster than one showing +20%, regardless of which has more total observations. The first two years of data for each gene are excluded, as the metric requires two full years of history.
The time series shown on the dashboard uses the DateCreated attribute of
each ClinVar submission. The cumulative curve therefore represents the growth of the
ClinVar knowledge base over time.
| Criterion | Included | Excluded |
|---|---|---|
| Genes | Any gene from the catalog of 4,853 Mendelian disease genes (user-selected) | Genes not in the catalog |
| Data source | ClinVar | LOVD, Geno2MP, DECIPHER |
| Overall germline classification | Pathogenic, Likely pathogenic, Pathogenic/Likely pathogenic | Uncertain significance, Benign, Likely benign, Conflicting, etc. |
| Per-submission classification (ClinVar) | All submissions for variants meeting the overall classification criterion above | None (all submissions counted once the variant qualifies) |
| Variant types (ClinVar) | SNV, deletion, duplication, insertion, indel, microsatellite, variation, copy number gain, copy number loss | Complex, Haplotype |
| Genomic span (ClinVar, GRCh38) | ≤ gene-specific threshold (50× gene size, min 10 Mb) | > gene-specific threshold |
Data retrieval and processing were implemented in Python using the
requests library for API access, the standard library
xml.etree.ElementTree module for XML parsing, and
concurrent.futures.ThreadPoolExecutor for parallel data fetching.
The web dashboard was built with Flask and visualised using Plotly.js. Gene
coordinates are retrieved dynamically from the NCBI Datasets API. All source
code is available for inspection and the analysis can be reproduced by running the
application, which will re-query all data sources with the parameters described above.
The Patient Finding Leaderboard ranks diseases with FDA-approved treatments by their Discovery Rate, a prevalence-normalized measure of ClinVar submission activity.
The Discovery Rate is calculated as:
Discovery Rate = (ClinVar submissions in trailing N months) ÷ (disease prevalence per 100,000)
Disease prevalence estimates are sourced from Orphanet (Orphadata en_product9_prev.xml, version 2025-12-09), using point prevalence where available and birth prevalence otherwise. Three window sizes are provided: 3-month, 6-month, and 12-month rolling windows. A 3-month trailing average is applied to the resulting rates to smooth month-to-month noise caused by laboratories uploading variant data in periodic batches rather than continuously.
ClinVar submissions are not equivalent to unique patients identified. Multiple laboratories may independently submit the same variant for the same patient, and a single patient with a compound heterozygous genotype may generate two variant submissions. The Discovery Rate should therefore be interpreted as a measure of ClinVar submission activity rather than a direct count of patients found.
The leaderboard reflects ClinVar submission culture, not solely patient-finding effectiveness. Diseases diagnosed primarily through repeat-expansion assays (e.g., Huntington disease, Friedreich’s ataxia) or deletion/duplication testing (e.g., spinal muscular atrophy) appear lower on the leaderboard because the clinical laboratories performing these tests do not routinely submit per-patient results to ClinVar. These assays (PCR fragment analysis, MLPA, Southern blot) are typically performed by specialized laboratories with different data-sharing practices than the next-generation sequencing laboratories (e.g., GeneDx, Labcorp/Invitae, Ambry Genetics) that generate the majority of ClinVar submissions. A low Discovery Rate for these diseases does not necessarily indicate poor patient identification—it indicates that the testing workflow bypasses ClinVar.
Batch uploads create transient spikes. Individual laboratories periodically upload large numbers of variant classifications in a single batch, which can temporarily inflate the Discovery Rate for a given disease. The 3-month trailing average mitigates but does not eliminate this effect. Users should consider the Lab Contributions chart on the Dashboard page to identify whether a spike is driven by a single laboratory’s batch upload or by broad-based growth in submissions.
Prevalence estimates vary by geography and methodology. Orphanet prevalence figures represent best available estimates but may differ substantially from true prevalence in specific populations. For example, sickle cell disease prevalence is approximately 10 per 100,000 in Europe but approximately 30 per 100,000 in the United States. Fabry disease has a clinical point prevalence of 0.15 per 100,000 but newborn screening studies suggest a birth prevalence of 6.66 per 100,000, reflecting a large population of asymptomatic or late-onset individuals.