Methods

Methodology for quantifying pathogenic and likely pathogenic variant submissions using ClinVar across Mendelian disease genes

Data Sources

ClinVar. All variant data were obtained from ClinVar, a freely accessible public archive of reports describing the relationships between human genetic variants and observed health conditions.¹ ClinVar is maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine and aggregates submissions from clinical testing laboratories, research groups, expert panels, and other organisations worldwide.

Data Retrieval

Variant records were retrieved programmatically from ClinVar using the NCBI Entrez Programming Utilities (E-utilities) API.² For each gene of interest, an initial search was performed against the ClinVar database using the query [GENE][gene] AND clinsig_pathogenic, which returns all variant records in the specified gene that carry a pathogenic or likely pathogenic clinical significance assertion. Full variant records were then retrieved in VCV (Variation Archive) XML format using the efetch endpoint. To comply with NCBI usage guidelines, requests were rate-limited to no more than 3 per second. All results were cached locally for 1440 hours.

Gene Coordinate Lookup

Gene coordinates on the GRCh38 reference assembly are retrieved dynamically from the NCBI Datasets API when a gene is first queried. These coordinates are cached locally to avoid repeated lookups and are used by the genomic span filter described under Variant Filtering below. The inheritance mode for each gene is determined from a curated catalog of 4,853 Mendelian disease genes.

Variant Filtering

Two filters were applied to ClinVar data to exclude variants that, while overlapping a gene of interest, are unlikely to represent gene-specific pathogenic events:

Variant type filter. The following variant types were retained: single nucleotide variant (SNV), deletion, duplication, insertion, indel, microsatellite, copy number gain, copy number loss, and the generic "variation" category used by ClinVar for certain records. Copy number variants are included because many diseases — particularly contiguous gene deletion syndromes such as Angelman syndrome (15q11-q13 deletion) — are primarily caused by large chromosomal deletions or duplications. Excluding these would undercount patients for such conditions. Only the "Complex" and "Haplotype" types were excluded.

Genomic span filter. For variants with annotated GRCh38 coordinates, any variant whose genomic span exceeded the gene-specific threshold (50× gene size, minimum 10 Mb) was excluded. This threshold is intentionally generous to retain known pathogenic regional deletions (e.g., the ~5 Mb Angelman critical region deletion) while excluding whole-chromosome or chromosome-arm copy number variants that are unlikely to represent gene-specific pathogenic events.

Classification Criteria

A variant was included in the analysis only if its overall (aggregate) germline classification in ClinVar was one of: Pathogenic, Likely pathogenic, or Pathogenic/Likely pathogenic. ClinVar derives this aggregate classification from the individual submissions (SCVs) according to its review status process.

Importantly, the classification filter is applied only at the variant level, not at the individual submission level. Once a variant is determined to be pathogenic or likely pathogenic by aggregate consensus, all submissions for that variant are counted — including those in which the submitting laboratory classified the variant as uncertain significance (VUS) or used another designation.

Counting Method

The raw metric collected is the cumulative number of individual ClinVar submissions (SCV records). Each SCV record typically represents a distinct clinical testing laboratory or research group reporting a variant.

Interpretation of Observation Counts

The primary metric displayed is the cumulative number of pathogenic variant observations — that is, the total number of independent ClinVar submissions reporting a pathogenic or likely pathogenic variant in a given gene.

Why we do not adjust for inheritance mode

An earlier version of this dashboard divided the observation count by 2 for autosomal recessive diseases, on the logic that two pathogenic alleles are required per patient. This approach was abandoned because ClinVar submissions are variant-level observations, not patient-level records:

A submission says “Lab X observed variant V and classified it as pathogenic for disease D.” Each submission represents at least one patient in whom the lab identified that variant.
For a recessive disease patient who is compound heterozygous (variants V1 + V2), the lab may submit both variants, only one, or neither to ClinVar. There is no mechanism to link the two submissions to the same patient.
Dividing by 2 assumes both alleles are always independently submitted — an assumption that is particularly violated for well-characterized variants. For example, CFTR F508del is carried by roughly 70% of cystic fibrosis patients but represents only about 1% of CFTR submissions in ClinVar (1.3% of pathogenic / likely pathogenic submissions), because labs rarely resubmit well-known variants.
The result is a double penalty for recessive diseases: common variants are underreported (deflating the numerator), and the total is then halved (deflating the estimate further).

Each observation is therefore counted at face value regardless of inheritance mode. The metric is best understood as a lower bound on the number of patients who have been genetically identified, with important caveats:

The true patient count is almost certainly higher, as many patients are never submitted to any public variant database.
For recessive diseases, some double-counting may occur when both alleles of the same patient are independently submitted. However, this is likely outweighed by the much larger number of patients whose variants are not submitted at all.
A single submission may encompass multiple patients tested by the same lab.
Different laboratories may submit the same variant for the same patient independently.

Discovery Acceleration

The “Discovery Acceleration” view computes a rolling year-over-year growth rate for each gene. At each calendar quarter Q, the metric is calculated as:

Growth (%) = (Observations in trailing 12 months − Observations in prior 12 months) ÷ Observations in prior 12 months × 100

This normalizes for the absolute size of each disease’s submission base, enabling direct comparison of discovery trends across diseases with very different prevalences. A gene showing +200% growth is accelerating faster than one showing +20%, regardless of which has more total observations. The first two years of data for each gene are excluded, as the metric requires two full years of history.

Temporal Analysis

The time series shown on the dashboard uses the DateCreated attribute of each ClinVar submission. The cumulative curve therefore represents the growth of the ClinVar knowledge base over time.

Summary of Inclusion and Exclusion Criteria

Criterion	Included	Excluded
Genes	Any gene from the catalog of 4,853 Mendelian disease genes (user-selected)	Genes not in the catalog
Data source	ClinVar	LOVD, Geno2MP, DECIPHER
Overall germline classification	Pathogenic, Likely pathogenic, Pathogenic/Likely pathogenic	Uncertain significance, Benign, Likely benign, Conflicting, etc.
Per-submission classification (ClinVar)	All submissions for variants meeting the overall classification criterion above	None (all submissions counted once the variant qualifies)
Variant types (ClinVar)	SNV, deletion, duplication, insertion, indel, microsatellite, variation, copy number gain, copy number loss	Complex, Haplotype
Genomic span (ClinVar, GRCh38)	≤ gene-specific threshold (50× gene size, min 10 Mb)	> gene-specific threshold

Software and Reproducibility

Data retrieval and processing were implemented in Python using the requests library for API access, the standard library xml.etree.ElementTree module for XML parsing, and concurrent.futures.ThreadPoolExecutor for parallel data fetching. The web dashboard was built with Flask and visualised using Lightweight-Charts. Gene coordinates are retrieved dynamically from the NCBI Datasets API. All source code is available for inspection and the analysis can be reproduced by running the application, which will re-query all data sources with the parameters described above.

Patient Finding Leaderboard

The Patient Finding Leaderboard ranks diseases with FDA-approved treatments by their Discovery Rate, a prevalence-normalized measure of ClinVar submission activity.

The Discovery Rate is calculated as:

Discovery Rate = (ClinVar submissions in trailing N months) ÷ (disease prevalence per 100,000)

Disease prevalence estimates are sourced from Orphanet (Orphadata en_product9_prev.xml, version 2025-12-09), using point prevalence where available and birth prevalence otherwise. Three window sizes are provided: 3-month, 6-month, and 12-month rolling windows. A 3-month trailing average is applied to the resulting rates to smooth month-to-month noise caused by laboratories uploading variant data in periodic batches rather than continuously.

Limitations and Caveats

ClinVar submissions are not equivalent to unique patients identified. Multiple laboratories may independently submit the same variant for the same patient, and a single patient with a compound heterozygous genotype may generate two variant submissions. The Discovery Rate should therefore be interpreted as a measure of ClinVar submission activity rather than a direct count of patients found.

The leaderboard reflects ClinVar submission culture, not solely patient-finding effectiveness. Diseases diagnosed primarily through repeat-expansion assays (e.g., Huntington disease, Friedreich’s ataxia) or deletion/duplication testing (e.g., spinal muscular atrophy) appear lower on the leaderboard because the clinical laboratories performing these tests do not routinely submit per-patient results to ClinVar. These assays (PCR fragment analysis, MLPA, Southern blot) are typically performed by specialized laboratories with different data-sharing practices than the next-generation sequencing laboratories (e.g., GeneDx, Labcorp/Invitae, Ambry Genetics) that generate the majority of ClinVar submissions. A low Discovery Rate for these diseases does not necessarily indicate poor patient identification—it indicates that the testing workflow bypasses ClinVar.

Batch uploads create transient spikes. Individual laboratories periodically upload large numbers of variant classifications in a single batch, which can temporarily inflate the Discovery Rate for a given disease. The 3-month trailing average mitigates but does not eliminate this effect. Users should consider the Lab Contributions chart on each gene's analysis page to identify whether a spike is driven by a single laboratory’s batch upload or by broad-based growth in submissions.

Prevalence estimates vary by geography and methodology. Orphanet prevalence figures represent best available estimates but may differ substantially from true prevalence in specific populations. For example, sickle cell disease prevalence is approximately 10 per 100,000 in Europe but approximately 30 per 100,000 in the United States. Fabry disease has a clinical point prevalence of 0.15 per 100,000 but newborn screening studies suggest a birth prevalence of 6.66 per 100,000, reflecting a large population of asymptomatic or late-onset individuals.

References

Landrum MJ, Lee JM, Benson M, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018;46(D1):D1062–D1067. doi:10.1093/nar/gkx1153
Sayers E. E-utilities Quick Start. In: Entrez Programming Utilities Help. Bethesda (MD): National Center for Biotechnology Information (US); 2008–. https://www.ncbi.nlm.nih.gov/books/NBK25500/
NCBI ClinVar. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/clinvar/
Köhler S, Gargano M, Matentzoglu N, et al. The Human Phenotype Ontology in 2021. Nucleic Acids Research. 2021;49(D1):D1207–D1217. doi:10.1093/nar/gkaa1043

Cache duration: 1440 hours
Data source: ClinVar
Gene catalog: 4,853 Mendelian disease genes with curated inheritance modes