What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Variant Calling in Whole Exome Sequencing (WES): Principles, Algorithms, and Veterinary Applications

Whole exome sequencing (WES) has become a cornerstone of genomic investigation in both human and veterinary medicine. By targeting the protein-coding regions of the genome, WES reduces sequencing costs and data complexity while retaining the most interpretable fraction of the genome for variant discovery. The process of identifying single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number variants from WES data is termed variant calling. This article provides an exhaustive technical review of the variant calling pipeline for WES, with a focus on algorithmic principles, comparative performance of calling tools, and considerations for veterinary diagnostics.

The WES Variant Calling Pipeline

The variant calling pipeline for WES consists of a series of computational steps that transform raw sequencing reads into a list of high-confidence genetic variants. Each step introduces potential sources of error and bias, and the choice of tools and parameters at each stage directly affects the sensitivity and specificity of the final variant set.

Step 1: Raw Data Processing and Quality Control

Raw sequencing data from high-throughput sequencers are generated as base call files (BCL) and demultiplexed into FASTQ files. Quality control at this stage involves assessment of per-base quality scores, GC content, adapter contamination, and duplication rates. Tools such as FastQC and MultiQC are commonly used for this purpose. Reads failing quality thresholds (e.g., Phred score below 20) are filtered or trimmed.

Step 2: Read Alignment to a Reference Genome

Filtered reads are aligned to a reference genome using a splice-aware aligner. For WES data, the Burrows-Wheeler Aligner (BWA-MEM) is the most widely used tool. The alignment process generates Sequence Alignment/Map (SAM) files, which are subsequently compressed to Binary Alignment/Map (BAM) files. Post-alignment processing includes sorting, marking of duplicate reads (arising from PCR amplification during library preparation), and base quality score recalibration (BQSR). These steps are critical for reducing systematic errors that can mimic true variants.

Step 3: Variant Discovery

Variant calling algorithms operate on the aligned BAM files to identify positions where the sequenced sample differs from the reference genome. Two primary paradigms exist: germline variant calling, which assumes the sample is a diploid organism with two alleles at each locus, and somatic variant calling, which accounts for subclonal populations of cells (e.g., in tumor samples). The choice of caller depends on the biological question and the expected variant allele frequency (VAF).

Step 4: Variant Filtering and Annotation

Raw variant calls are subjected to hard filtering or machine learning-based filtering to remove false positives. Common filtering criteria include read depth (DP), genotype quality (GQ), mapping quality (MQ), and strand bias (SB). After filtering, variants are annotated with functional information (e.g., gene name, amino acid change, predicted pathogenicity) using databases such as Ensembl Variant Effect Predictor (VEP) or SnpEff.

Algorithmic Approaches to Variant Calling

Variant callers employ different statistical and heuristic models to distinguish true genetic variation from sequencing and alignment errors. The major algorithmic categories are described below.

Haplotype-Based Callers

Haplotype-based callers reassemble the local region around a candidate variant to generate a list of possible haplotypes. Reads are then realigned to these haplotypes, and the most likely genotype is determined using a Bayesian model. The Genome Analysis Toolkit (GATK) HaplotypeCaller is the most prominent example of this approach. This method is particularly effective for detecting indels, as it avoids the misalignment issues that plague simple pileup-based callers.

Pileup-Based Callers

Pileup-based callers examine the aligned reads at each genomic position and apply a statistical test to determine whether the observed allele counts deviate from the expected error rate. SAMtools mpileup and BCFtools are classic examples. These tools are computationally efficient but are more sensitive to alignment artifacts, especially near indels.

Machine Learning-Based Callers

Recent advances have introduced machine learning models to improve variant calling accuracy. DeepVariant, for example, uses a convolutional neural network to classify read pileup images as variant or non-variant. This approach has demonstrated high accuracy, particularly for challenging regions such as homopolymers and GC-rich areas. Yan et al. [1] demonstrated that machine learning models can reduce the burden of orthogonal confirmation by identifying high-confidence germline variants with high precision.

Comparative Performance of Variant Calling Tools

Several benchmarking studies have evaluated the performance of variant calling tools for WES data. These studies typically use gold standard datasets (e.g., Genome in a Bottle) or synthetic spike-in data to establish ground truth.

Somatic Variant Calling

Lopez-Cade et al. [2] conducted a comparative evaluation of Mutect2, Strelka2, and FreeBayes for somatic SNV detection in synthetic and clinical WES data. Their results indicated that Mutect2 achieved the highest sensitivity for low-VAF variants, while Strelka2 demonstrated superior precision. FreeBayes showed intermediate performance but was more computationally intensive. The choice of caller should therefore be guided by the expected VAF and the tolerance for false positives versus false negatives.

Germline Variant Calling

Song et al. [3] compared the performance of germline variant calling tools in sporadic disease cohorts. They found that GATK HaplotypeCaller and DeepVariant produced the most concordant results, with DeepVariant showing slightly higher recall in difficult-to-map regions. Wong et al. [4] benchmarked multiple variant calling software packages using gold standard datasets and reported that no single tool outperformed others across all metrics. They recommended using a consensus approach, where variants called by two or more tools are retained for downstream analysis.

Joint Calling and Cohort Analysis

Nguyen et al. [5] developed GermVarX, a robust workflow for joint germline variant exploration in WES cohorts. Joint calling simultaneously analyzes multiple samples to improve variant detection at low-coverage sites and to reduce batch effects. This approach is particularly valuable in veterinary population studies where sample sizes are often limited and sequencing depth may be variable.

Workflow Diagram

The following Mermaid diagram illustrates the typical WES variant calling workflow, from raw data to annotated variants.

flowchart TD
    A[Raw FASTQ Files] --> B[Quality Control & Trimming]
    B --> C[Read Alignment to Reference Genome]
    C --> D[Post-Alignment Processing]
    D --> D1[Sort BAM]
    D --> D2[Mark Duplicates]
    D --> D3[Base Quality Score Recalibration]
    D3 --> E[Variant Calling]
    E --> E1[Germline Caller]
    E --> E2[Somatic Caller]
    E1 --> F[Raw Variant VCF]
    E2 --> F
    F --> G[Variant Filtering]
    G --> H[Variant Annotation]
    H --> I[High-Confidence Variant List]

Challenges Specific to Veterinary WES

Veterinary WES presents unique challenges that must be addressed in the variant calling pipeline.

Reference Genome Quality

Many domestic animal species have reference genomes that are less complete and less well-annotated than the human reference genome. Gaps in the reference, misassembled regions, and incomplete gene models can lead to alignment errors and false variant calls. For example, the canine reference genome (CanFam3.1) has known issues in repetitive regions and on the Y chromosome.

Population Diversity

Veterinary populations often exhibit high genetic diversity due to breed structure, recent admixture, and lack of large-scale sequencing projects. This diversity can cause reference bias, where reads carrying non-reference alleles are less likely to align correctly. The use of pangenome references or graph-based alignment methods may mitigate this issue.

Sample Quality and Sequencing Depth

Clinical veterinary samples are frequently obtained from non-ideal sources (e.g., necropsy tissue, fecal samples, or blood smears) with variable DNA quality. Degraded DNA can lead to uneven coverage and increased error rates. Tsuji et al. [6] demonstrated that a robust target-enrichment strategy and careful variant calling are essential for achieving clinical-grade performance in such scenarios.

Integration with Other Diagnostic Modalities

Variant calling from WES data is often integrated with other diagnostic approaches to provide a comprehensive molecular picture. For example, in cases of suspected hereditary disease in dogs, WES findings may be correlated with clinical pathology data, such as those obtained from automated impedance analyzers or commercial ELISA kits. In infectious disease contexts, WES can be used to identify host genetic factors that influence susceptibility to pathogens like Canine Parvovirus variants or Feline Leukemia Virus progressive infection.

Quality Metrics and Validation

The accuracy of variant calls is assessed using several key metrics.

Metric	Definition	Acceptable Threshold
Sensitivity (Recall)	Proportion of true variants detected	> 95% for high-confidence regions
Precision (PPV)	Proportion of called variants that are true	> 99% for clinical reporting
F1 Score	Harmonic mean of sensitivity and precision	> 0.97
Concordance	Agreement between replicate calls or orthogonal methods	> 99.5%
Transition/Transversion Ratio (Ti/Tv)	Ratio of transition to transversion substitutions	2.0 - 2.2 for whole exome

Orthogonal validation using Sanger sequencing or targeted genotyping assays is recommended for variants of clinical significance. Yan et al. [1] proposed a machine learning approach to reduce the need for such validation by identifying high-confidence calls with high predictive probability.

Conclusion

Variant calling in WES is a multi-step computational process that requires careful selection of algorithms, parameters, and quality control measures. The choice between germline and somatic callers [2, 3, 4] and the use of joint calling in cohorts [5] influence the final variant set. For veterinary applications, additional considerations include reference genome quality, population diversity, and sample integrity, with validated workflows addressing these challenges [6]. These benchmarking studies and validated workflows provide a foundation for building robust WES analysis pipelines in veterinary molecular diagnostics.

References

[1] Yan M, Zeng Q, Zhang Z et al. Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation. BMC Genomics. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40770305/

[2] Lopez-Cade I, Gomez-Sanz A, Sanvicente A et al. Comparative Evaluation of Mutect2, Strelka2, and FreeBayes for Somatic SNV Detection in Synthetic and Clinical Whole-Exome Sequencing Data. Biomolecules. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/41301450/

[3] Song Q, Zhai J, Chen C et al. Performance comparison of germline variant calling tools in sporadic disease cohorts. Mol Genet Genomics. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40913732/

[4] Wong M, Liew B, Hum M et al. Benchmarking of variant calling software for whole-exome sequencing using gold standard datasets. Sci Rep. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40258889/

[5] Nguyen TTP, Nguyen DD, Mai TV et al. GermVarX: A Robust Workflow for Joint Germline Variant Exploration in whole-exome sequencing cohorts. PLoS One. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41926483/

[6] Tsuji J, Rickles-Young M, Abreu J et al. Clinical validation of a high-performance somatic exome sequencing assay: from target-enrichment strategy to variant calling. NPJ Genom Med. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41963325/ *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.