What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Variant Calling GATK: Structural Analysis and Computational Methodologies in Bioinformatics

Introduction

The Genome Analysis Toolkit (GATK) has served as the de facto standard for short-read variant calling since its initial release [1]. Developed to address the computational challenges of next-generation sequencing (NGS) data, GATK provides a rigorously maintained collection of analysis tools and a core bioinformatics engine [1]. In veterinary genomics, variant calling is essential for identifying genetic markers associated with disease susceptibility, drug resistance, and production traits in livestock, companion animals, and wildlife [2, 3]. The GATK Best Practices pipeline, originally optimized for human data, has been widely adopted for non-human species, though its performance requires careful evaluation [2, 3].

This article provides a structural analysis of the GATK framework, detailing its computational methodologies, algorithmic components, and workflow optimization strategies relevant to veterinary diagnostics and research. Emphasis is placed on the HaplotypeCaller, filtering approaches, and scalability improvements that enable application to diverse animal genomes.

Overview of the GATK Architecture

GATK integrates over 430 analysis tools organized around a common engine [1]. The toolkit is built on a MapReduce-style paradigm that decomposes genomic intervals into independent units for parallel processing [4, 5]. The core variant calling algorithm, the HaplotypeCaller, uses a local de novo assembly strategy to reconstruct haplotypes in active regions before performing pairwise alignment using a Pair Hidden Markov Model (Pair-HMM) forward algorithm [6, 7]. This approach contrasts with pileup-based methods such as Bcftools mpileup, which sum allele counts at each position without local reassembly [2].

The Pair-HMM forward algorithm computes the probability of a read given a candidate haplotype, accounting for base quality scores and gap penalties [6]. This computation has quadratic time complexity relative to read length and is computationally intensive [6]. Optimizations using graphics processing units (GPUs) have been developed to accelerate this step, achieving substantial speedups over CPU implementations [6, 8]. The gpuPairHMM scheme uses wavefront and warp-shuffle techniques to minimize memory accesses and instructions, attaining close-to-peak performance on modern CUDA-enabled architectures [6].

The GATK Best Practices Pipeline

The standard germline short variant discovery pipeline consists of several sequential stages: read alignment, preprocessing, per-sample variant calling with HaplotypeCaller, joint genotyping across samples, and variant filtration [3, 1]. The following Mermaid diagram illustrates the core workflow.

flowchart TD
    A[Raw FASTQ Reads], > B[Read Alignment BWA-MEM]
    B, > C[Sort & Mark Duplicates]
    C, > D[Base Quality Score Recalibration BQSR]
    D, > E[HaplotypeCaller per sample]
    E, > F[GVCF per sample]
    F, > G[Joint Genotyping GenotypeGVCFs]
    G, > H[Variant Filtration VQSR/Hard Filtering]
    H, > I[Final VCF]

Read Alignment and Preprocessing

Raw sequencing reads are aligned to a reference genome using a short-read aligner such as BWA-MEM [9, 10]. The choice of aligner can impact downstream variant calling accuracy; BWA-MEM and Isaac have shown superior performance compared to Bowtie2 for medical variant calling [10]. After alignment, preprocessing steps include sorting reads by coordinate, marking duplicate reads originating from PCR amplification, and base quality score recalibration (BQSR) [3, 1]. BQSR applies a machine learning model to adjust base quality scores based on covariates such as read group, cycle, and dinucleotide context [11, 1].

HaplotypeCaller Algorithm

The HaplotypeCaller identifies regions with significant variation (active regions) using a likelihood-based threshold [7]. Within each active region, it performs local de Bruijn graph assembly to construct candidate haplotypes [7, 12]. These haplotypes are aligned to each read using the Pair-HMM forward algorithm [6]. The resulting likelihoods are used to calculate genotype probabilities for each variant site [6, 7]. The default output for single-sample calling is a genomic VCF (gVCF) that records reference confidence blocks alongside variant calls [3, 1].

Joint Genotyping

Joint genotyping combines gVCFs from multiple samples into a cohort-level VCF using the GenotypeGVCFs tool [13, 14]. This step employs a Bayesian model that incorporates prior information about population allele frequencies to improve genotype accuracy, especially at low coverage [14]. The scalability of joint genotyping has been demonstrated using serverless computing frameworks, achieving linear cost and runtime scaling with sample size [14].

Variant Filtration

Raw variant calls require quality filtration to reduce false positives [11, 15]. GATK provides two principal filtration approaches: variant quality score recalibration (VQSR) and hard filtering [11, 1]. VQSR uses a Gaussian mixture model trained on known truth sites (e.g., from high-confidence variant databases) to assign a probability that each call is true [1]. However, for non-human species, truth sets are often unavailable, making VQSR impractical [2]. Hard filtering applies fixed thresholds to variant annotations such as QualByDepth (QD), FisherStrand (FS), and ReadPosRankSumTest (RPRS) [11, 15]. The optimal filter thresholds depend on the specific genome and coverage characteristics; simulation-based calibration using classification trees can improve performance [11].

A dedicated false positive filter (FPfilter) has been developed specifically for GATK whole-genome sequencing data. By modeling distinct patterns between false positives in heterozygous versus homozygous mutations, FPfilter achieves a higher false positive to true positive filtration ratio than standard GATK hard filtering [15].

Performance in Non-Human Species

Several studies have evaluated GATK performance in non-human organisms [2, 16, 9]. Benchmarking with simulated insect populations revealed that Bcftools mpileup outperformed GATK HaplotypeCaller in recovery rate and accuracy, regardless of mapping software [2]. The majority of false positives from GATK originated from repetitive genomic regions [2]. Moreover, variant scores produced by GATK did not reliably distinguish true positives from false positives in most cases, indicating that hard filtering may be challenging without additional validation data [2]. These findings are critical for veterinary applications where reference genomes may be less well annotated and repetitive elements are abundant.

In plant and animal genomes, optimized GATK4 pipelines have been deployed on high-performance computing (HPC) clusters for large-scale variant discovery [3, 16]. The OVarFlow workflow, which uses optimized Java garbage collection and heap size settings for GATK tools, reduced overall analysis time by half [3]. For maize, sorghum, rice, and soybean, HPC-based variant calling workflows called tens to hundreds of millions of single nucleotide polymorphisms (SNPs) relative to high-quality reference assemblies [16].

RNA-Seq Variant Calling with GATK

GATK also supports variant calling from RNA sequencing data using the joint genotyping workflow [13, 17, 18]. This approach is particularly useful for identifying expressed variants in veterinary transcriptomics, such as those associated with immune response or drug metabolism [13]. However, RNA-seq data present additional challenges due to splicing, variable coverage across exons, and the absence of intronic regions [19, 18]. The GATK RNA-seq pipeline recommends using split-read mapping tools (e.g., STAR) and applying a separate set of hard filtering thresholds [13, 17]. Variant calling from single-cell RNA-seq data has been employed to assess cellular identity in patient-derived cell lines, demonstrating the utility of GATK for detecting sample swaps or contamination [18].

Comparative Performance with Other Tools

Numerous comparative studies have assessed GATK against other variant callers, including Bcftools, FreeBayes, DeepVariant, Strelka2, and VarScan2 [9, 20, 10, 12]. DeepVariant, which uses a deep convolutional neural network to classify variant candidates from pileup images, consistently achieves high precision and recall in human and microbial genomes [21, 20, 12]. In a systematic benchmark across 14 gold standard datasets, DeepVariant outperformed GATK, Strelka2, and FreeBayes in coding sequence variant discovery [10]. However, GATK remains competitive, particularly for caller-specific applications where its algorithmic transparency allows fine-tuning of parameters [11, 22].

The following table summarizes key performance characteristics from representative benchmarks.

Tool	Alignment Dependency	Primary Algorithm	Non-Human Performance	Filtering Method
GATK HaplotypeCaller	Moderate	Local assembly + Pair-HMM	Lower recovery than Bcftools in repeats [2]	VQSR or hard filtering [11, 15]
Bcftools mpileup	Low	Pileup + Bayesian	Higher recovery, fewer false positives in repeats [2]	Hard filtering [2]
DeepVariant	High	Deep CNN	Achieves highest precision in human benchmarks [20, 10]	ML-based [12]
FreeBayes	Low	Bayesian	High sensitivity but lower specificity than DeepVariant [20]	Hard filtering [20]

In somatic variant calling, GATK Mutect2 is widely used, with ensemble consensus approaches combining calls from multiple tools improving reproducibility [8, 4]. For wastewater-based epidemiology using mixed viral populations, BCFtools, FreeBayes, and VarScan2 showed higher precision and recall than GATK, although GATK identified more expected defining mutations [23]. These trade-offs must be considered when selecting a pipeline for veterinary surveillance of pathogens.

Computational Optimization and Scalability

The computational burden of GATK pipelines has motivated numerous optimization strategies [3, 24, 25, 26, 5]. The LUSH toolkit reimplements core GATK steps with optimized algorithms, achieving 17-fold speedup on 30x whole-genome data while maintaining over 99% precision and recall [25]. The Sentieon DNASeq toolkit provides a closed-source alternative that replicates GATK accuracy with greater computational efficiency [26]. For cluster environments, VC@Scale integrates Apache Spark with native data representations using Apache Arrow, enabling efficient preprocessing and variant calling at scale [5]. Halvade Somatic specifically addresses somatic variant calling by distributing WGS and WES analyses across multiple nodes, reducing runtime from 84.5 hours to 1.36 hours on 16 nodes [4].

Improvements in Pair-HMM computation via GPU acceleration have been particularly impactful. The gpuPairHMM implementation achieves speedups of at least 11.7x over prior GPU implementations and 14.2x over CPU versions [6]. Real-time variant calling frameworks, such as RVC, process reads incrementally during sequencing, enabling low-latency variant detection for time-sensitive applications in veterinary outbreak response [24].

Structural Considerations for Veterinary Applications

In veterinary genomics, the reference genomes of many species are still incomplete or contain errors, which exacerbates issues with repetitive regions and pseudogenes [27, 28]. Pseudogene-associated errors during germline variant calling are a known problem, as high sequence similarity leads to false positive identification of potentially clinically relevant variants [27]. GATK HaplotypeCaller is particularly susceptible to such errors in processed pseudogenes, whereas DeepVariant shows greater robustness [27]. For pangenome-based variant calling, the PanVariants framework extends GATK best practices to multiple reference genomes, improving detection of structural variants not captured by linear references [28].

The application of GATK to bisulfite-converted sequencing data for epigenomic studies requires specialized preprocessing to distinguish true polymorphisms from bisulfite-induced mutations [29]. The double-masking approach manipulates base quality scores to enable per-strand analysis, allowing conventional Bayesian variant callers like GATK to be applied to methylome data [29]. This technique is valuable for veterinary studies investigating host-pathogen interactions where both genetic and epigenetic variation are relevant.

Conclusion

GATK remains a foundational tool in bioinformatics for variant calling due to its robust algorithmic design, extensive documentation, and adaptability to diverse sequencing strategies [1]. Its application in veterinary medicine continues to expand, although careful benchmarking is required for each non-host species and sequencing modality [2, 3, 30]. Future developments, including tighter integration of deep learning models and more efficient cloud-based implementations, will further enhance its utility [24, 5]. Researchers are advised to evaluate multiple pipelines for their specific organism and data type, leveraging optimized filtering and acceleration strategies to maximize accuracy and computational efficiency.

References

[1] Blazyte A, Le L, Lee J, et al. Fifteen Years of the Genome Analysis Toolkit as the De Facto Standard in Short-Read Variant Calling. International Journal of Molecular Sciences. 2026. https://www.semanticscholar.org/paper/62734912c0577c63370ae909165d429990d6fca4

[2] Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Scientific Reports. 2022. https://www.semanticscholar.org/paper/49d825c022bc5eed050984bf57d027aaaca1ac7c

[3] Bathke J, Lühken G. OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow. BMC Bioinformatics. 2021. https://www.semanticscholar.org/paper/0300fb46929ffd296cc37389bd8f2659942fde3e

[4] Decap D, de Schaetzen van Brienen L, Larmuseau M, et al. Halvade somatic: Somatic variant calling with Apache Spark. GigaScience. 2022. https://www.semanticscholar.org/paper/613e1955534a7fb5bab567b66665b96bfa2874c6

[5] Ahmad T, Al Ars Z, Hofstee HP. VC@Scale: Scalable and high-performance variant calling on cluster environments. GigaScience. 2021. https://www.semanticscholar.org/paper/cbd8f2d34ceeb6ba470ea90373eb807752340366

[6] Schmidt B, Kallenborn F, Wichmann A, et al. gpuPairHMM: High-Speed Pair-HMM Forward Algorithm for DNA Variant Calling on GPUs. IEEE Transactions on Computational Biology and Bioinformatics. 2024. https://www.semanticscholar.org/paper/29648f52db7a9146a4ca5b4b1826fc2f85dc97b4

[7] Freed D, Pan R, Chen H, et al. DNAscope: High accuracy small variant calling using machine learning. bioRxiv. 2022. https://www.semanticscholar.org/paper/f45ef0a90482166c700a0294297ec31cd7b56f2

[8] Chen PY, Tao MH, Ko TM. Abstract 6860: Ensemble somatic variant calling and transcript reconstruction for high-fidelity neoantigen discovery in mRNA cancer vaccine design. Cancer Research. 2026. https://www.semanticscholar.org/paper/f58f04b08d4e8777ff9c0f5134443a2cdbc06050

[9] Bhadhadhara K, Balamurugan M, Bharti N, et al. Performance Evaluation of Variant Calling Tools for Human and Microbial Genomes. 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC). 2023. https://www.semanticscholar.org/paper/3034423c898c649d2a8a2714ee81f23aa1eee3c9

[10] Abasov R, Tvorogova VE, Glotov A, et al. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2021. https://www.semanticscholar.org/paper/96e97755492ac6df7ca8cabc3d443da673538520

[11] Summa S, Malerba G, Pinto R, et al. GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC Bioinformatics. 2017. https://www.semanticscholar.org/paper/7845c19856075a6fb361170b8d7901b5fa196f7d

[12] Zhao S, Agafonov O, Azab A, et al. Accuracy and efficiency of germline variant calling pipelines for human genome data. Scientific Reports. 2020. https://www.semanticscholar.org/paper/73ec11e398f52d29f6866c58e5ba0bd935264e95

[13] Brouard JS, Bissonnette N. Variant Calling from RNA-seq Data Using the GATK Joint Genotyping Workflow. Methods in Molecular Biology. 2022. https://www.semanticscholar.org/paper/e875d7fd4fb3787643e8d4bb39901323ccd571ee

[14] John A, Muenzen K, Ausmees K. Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLoS ONE. 2021. https://www.semanticscholar.org/paper/b2f43a856d670dbb75b5d094175d64894bf2d97f

[15] Tan Y, Zhang Y, Yang H, et al. FPfilter: A false-positive-specific filter for whole-genome sequencing variant calling from GATK. bioRxiv. 2020. https://www.semanticscholar.org/paper/be373aa0b88e0b7efa002c3534dcae285832411a

[16] Zhou Y, Kathiresan N, Yu Z, et al. HPC-based genome variant calling workflow (HPC-GVCW). bioRxiv. 2023. https://www.semanticscholar.org/paper/73ae282dd6ddee3513281b0ba97e5b4b610a5fae

[17] Wang S. Scaling up the GATK RNA-seq Variant Calling Pipeline with Apache Spark. 2018. https://www.semanticscholar.org/paper/1848b0f9032d8efc582b1d52233e581f558f8355

[18] Ramazzotti D, Angaroni F, Maspero D, et al. Variant calling from scRNA-seq data allows the assessment of cellular identity in patient-derived cell lines. Nature Communications. 2021. https://www.semanticscholar.org/paper/82045b56ddd7fe1265422d5c4964fec89a5386ef *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.

[19] Maurya S, Jain C. A comprehensive review of variant calling tools for RNA-seq: Challenges and advances. Mutation Research. Reviews in Mutation Research. 2026. https://www.semanticscholar.org/paper/3f2320e15a92d8a35c8c1998f10ec61349254d97

[20] Pinto V, Sousa L, Silva C. Variant calling in genomics: A comparative performance analysis and decision guide. PLoS ONE. 2026. https://www.semanticscholar.org/paper/90f8bb492f7e2ede7776afffe12a71c55c9bc526

[21] Song Q, Zhai J, Chen C, et al. Performance comparison of germline variant calling tools in sporadic disease cohorts. Zeitschrift für Induktive Abstammungs- und Vererbungslehre. 2025. https://www.semanticscholar.org/paper/5ad40ff8b19fa761286f49a4788762dddab15a99

[22] Zanti M, Michailidou K, Loizidou M, et al. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics. 2021. https://www.semanticscholar.org/paper/c44d3791c50953615402b115617b3c0b0f3f74c9

[23] Bassano I, Ramachandran VK, Khalifa M, et al. Evaluation of variant calling algorithms for wastewater-based epidemiology using mixed populations of SARS-CoV-2 variants in synthetic and wastewater samples. Microbial Genomics. 2023. https://www.semanticscholar.org/paper/ebdb64e3cad0d3f7276e9d1609b7a11430e0fa00

[24] Cui M, Yu X, Jiang T, et al. RVC: A Real-Time Variant Calling Framework for Short-Read Sequencing Data. IEEE International Conference on Bioinformatics and Biomedicine. 2025. https://www.semanticscholar.org/paper/3120bc08c74fce1f65fc20777a82cc7c63ddaade

[25] Wang T, Zhang Y, Wang H, et al. Fast and accurate DNASeq variant calling workflow composed of LUSH toolkit. bioRxiv. 2023. https://www.semanticscholar.org/paper/1fedadadac107838073ee92a58ebf8c18a8e172c

[26] Kendig KI, Baheti S, Bockol M, et al. Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy. Frontiers in Genetics. 2019. https://www.semanticscholar.org/paper/dcf78a1b162d78d69004ccb4f569327e1c964c6a

[27] Podvalnyi A, Kopernik A, Sayganova M, et al. Quantitative Analysis of Pseudogene-Associated Errors During Germline Variant Calling. International Journal of Molecular Sciences. 2025. https://www.semanticscholar.org/paper/1dbcef1abd35252b28627cdf942fe6a6b40cb02f

[28] Yi H, Wang L, Chen X, et al. PanVariants: Best Practice for Pangenome-based Variant Calling Pipeline and Framework. bioRxiv. 2026. https://www.semanticscholar.org/paper/17a6bcfef1fe9026d6eedc1d5fd3437e9eb1c714

[29] Nunn A, Otto C, Fasold M, et al. Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional bayesian approaches. BMC Genomics. 2021. https://www.semanticscholar.org/paper/dfe2e9a90ebb1bed48091cc986e28136fc166565

[30] Niaré K, Greenhouse B, Bailey J. An optimized GATK4 pipeline for Plasmodium falciparum whole genome sequencing variant calling and analysis. Malaria Journal. 2023. https://www.semanticscholar.org/paper/07d7117c53240856a7311cf4461ad754c84d1808

[31] Alfayyadh MM, Maksemous N, Sutherland H, et al. PathVar: A Customisable NGS Variant Calling Algorithm Implicates Novel Candidate Genes and Pathways in Hemiplegic Migraine. Clinical Genetics. 2024. https://www.semanticscholar.org/paper/dfb53953cbe6f54b0eaf8d92a84e4ad0019e5f17

[32] Variant Calling (BWA-GATK) pipeline benchmark with Dell EMC Ready Bundle for HPC Life Sciences 13 G / 14 G server performance comparisons with Dell EMC Isilon. 2018. https://www.semanticscholar.org/paper/42e7d06d0cb385ea6e9efce707e1c07d23ec73cd

[33] Wang J, Robinson MD. Systematic benchmarking of small variant calling pipelines for long-read RNA sequencing data. bioRxiv. 2026. https://www.semanticscholar.org/paper/5fc8298af86e75864743e09ec7b21e48fa9026bd

[34] Adapting Google DeepVariant to Ultima Genomics Reads for Improved Variant Calling. 2022. https://www.semanticscholar.org/paper/6424eac65cabf9f5be68613c8780aacafe28202f

[35] Khazeeva G, Šablauskas K, van der Sanden B, et al. DeNovoCNN: a deep learning approach to de novo variant calling in next generation sequencing data. bioRxiv. 2021. https://www.semanticscholar.org/paper/7873ae1e3769ab2e17ca44942bb887630bc0d798