From Raw Reads to Variants: A Diagnostic Blueprint for Next-Generation Sequencing (NGS) Workflows
Introduction
The adoption of next-generation sequencing (NGS) in veterinary medicine has transformed the detection and characterization of pathogens, host genetic markers, and antimicrobial resistance determinants [1, 2]. The transition from raw sequencing data to clinically actionable variant calls requires a carefully orchestrated bioinformatics workflow. This article provides a comprehensive, publication-grade blueprint covering each step of that workflow: from quality control of raw reads through read alignment, post-processing, variant discovery, filtering, and annotation. The principles described are applicable to both whole-genome sequencing (WGS) and whole-exome sequencing (WES) in veterinary contexts, including pathogen surveillance, host genetics, and molecular epidemiology [3, 4].
Raw Read Acquisition and Quality Control
FASTQ Format and Phred Scores
Raw sequencing data from high-throughput sequencers are delivered in FASTQ format, which encodes nucleotide calls along with per-base quality scores. The quality score, known as the Phred quality score Q, is defined as Q = -10 log_{10}(P), where P is the probability of an incorrect base call [5]. Higher Q values indicate greater confidence. Veterinary diagnostic samples often yield lower quality scores due to suboptimal sample quality, low pathogen load, or host DNA contamination [1, 4].
Quality control begins with assessment of read length distributions, GC content, overrepresented sequences, and adapter contamination. Tools commonly used for this purpose include FastQC and MultiQC, which aggregate statistics across multiple samples [5]. For a detailed treatment of the FASTQ format and QC metrics, refer to the dedicated article FASTQ File Format: Decoding Phred Quality Scores and Quality Control Workflows.
Read Trimming and Filtering
Raw reads often contain adapter sequences, low-quality tails, and technical artifacts that must be removed before alignment. Trimming tools such as Trimmomatic, cutadapt, and fastp perform sliding-window quality trimming, adapter removal, and length filtering [5, 6]. Typical parameters include: trimming bases with Q < 20 over a four-base window, removing reads shorter than 36 bases, and discarding reads that fail to meet a minimum mean quality threshold. For metagenomic pathogen detection, aggressive trimming is recommended to reduce false-positive variant calls caused by sequencing errors [1, 4].
Alignment to a Reference Genome
Choice of Reference and Indexing
The reference genome may be the host species genome (e.g., Canis lupus familiaris for canine samples) or a pathogen reference (e.g., a specific viral or bacterial genome). Genome indexing is required to allow rapid alignment. Commonly used aligners include BWA-MEM (for short reads up to ~1 kb), Bowtie2, and minimap2 (for long reads) [5, 7]. The reference must be downloaded from a public repository such as NCBI or Ensembl and indexed with the appropriate algorithm.
Mapping Algorithms and Output SAM Files
During alignment, each read is placed onto the reference, and a mapping quality (MAPQ) score is assigned. The output is a Sequence Alignment/Map (SAM) file containing the alignment information. The SAM format is described in detail in the article SAM & BAM Formats. Key fields include the RNAME (reference sequence name), POS (leftmost mapping position), MAPQ, CIGAR string (gapped alignment description), and paired-end flags [7]. For paired-end data, proper pair orientation and insert size distributions are critical for confident alignment.
Post-Alignment Processing
SAM files are converted to the binary BAM format for efficiency, sorted by genomic coordinates, and indexed. Duplicate reads arising from PCR amplification during library preparation must be marked or removed to avoid inflating variant allele frequencies. Tools such as Picard MarkDuplicates or Sambamba are used for this purpose [6]. Duplicate marking is especially important in veterinary WGS where input DNA may be limited, leading to high duplication rates [3].
Local realignment around indels and base quality score recalibration (BQSR) are steps historically performed in the GATK Best Practices workflow [6]. BQSR models systematic errors in base quality scores using known variant sites. However, for non-human genomes, known variant databases may be absent, so care must be taken. An alternative is to skip BQSR for non-model organisms and rely on variant callers that do not require recalibrated scores [2, 3].
Variant Calling
Core Principles
Variant calling identifies positions where the sequenced sample differs from the reference genome. The two major classes of callers are: (a) heuristic/statistical callers (e.g., GATK HaplotypeCaller, FreeBayes) and (b) deep learning based callers (e.g., DeepVariant) [6, 8]. All callers require a sorted, indexed BAM file and optionally a set of known variant sites for calibration.
The article Variant Calling Pipelines provides a focused comparison of these tools.
GATK HaplotypeCaller
The Genome Analysis Toolkit (GATK) HaplotypeCaller uses a local de novo assembly of haplotypes around regions with evidence of variation. For each active region, it constructs a De Bruijn graph, realigns reads to candidate haplotypes, and calculates the likelihood of each allele. A Bayesian model then assigns genotype likelihoods. The output is a genomic VCF (gVCF) file containing raw variant calls and reference confidence blocks [6]. This approach is robust to complex variation such as indels but is computationally intensive.
FreeBayes
FreeBayes is a haplotype-based variant caller that uses a Bayesian framework. It detects variants by modeling the probability of the observed read data given a set of candidate haplotypes. Unlike GATK, FreeBayes does not perform local assembly; instead it uses read pair information and base quality. It can detect multi-allelic variants and is often used for bacterial and viral genomes where ploidy may vary [2, 7].
DeepVariant
DeepVariant transforms the variant calling problem into an image classification task. It converts pileup images (alignments around each candidate position) into a tensor and uses a convolutional neural network (CNN) to predict genotype probabilities. DeepVariant achieves high accuracy for both SNPs and indels and can be retrained for specific species or sequencing chemistries [8]. Its use in veterinary contexts is expanding as computational resources become more accessible [4].
Variant Calling Formats and Multi-Sample Calling
Single-sample calling is typical for clinical diagnostics, but population studies benefit from joint calling across cohorts. GATK CombineGVCFs and GenotypeGVCFs merge gVCFs and re-genotype all samples simultaneously, improving sensitivity for low-frequency alleles [6]. The output is a multi-sample VCF file. The Variant Call Format (VCF) is described in depth in the article Variant Call Format (VCF).
Hard Filtering and Quality Control of Variants
Raw variant calls contain many false positives. Hard filtering establishes thresholds based on annotation fields such as QUAL, QD (Variant Confidence/Quality by Depth), FS (Fisher Strand), SOR (Strand Odds Ratio), MQ (Mapping Quality), and ReadPosRankSum. Typical SNP filters for mammalian WGS include QD < 2.0, FS > 60.0, MQ < 40.0, and ReadPosRankSum < -8.0 [6]. Indel filters are generally less stringent: QD < 2.0, FS > 200.0, and ReadPosRankSum < -20.0.
For non-model organisms, these thresholds must be empirically adjusted. A common practice is to filter using only QUAL and depth (DP) to avoid overfitting to known variant distributions [2]. Filtering can be performed using GATK VariantFiltration or BCFtools.
Visual Representation of the Workflow
The following Mermaid diagram illustrates the end-to-end variant calling pipeline for veterinary diagnostic NGS.
flowchart TD
A[Raw FASTQ Reads], > B[Quality Control: FastQC]
B, > C{Pass QC?}
C, Yes, > D[Read Trimming: Trimmomatic]
C, No, > E[Re-sequencing or Discard]
D, > F[Alignment: BWA-MEM to Reference]
F, > G[Post-processing: Sort, Index, MarkDuplicates]
G, > H[Variant Calling]
H, > I{GATK HaplotypeCaller / FreeBayes / DeepVariant}
I, > J[Raw VCF / gVCF]
J, > K[Hard Filtering: VariantFiltration / BCFtools]
K, > L[Filtered VCF]
L, > M[Annotation: SnpEff / VEP]
M, > N[Clinically Interpretable Variants]
Annotation and Prioritization
After filtering, variants must be annotated to determine their functional impact. Annotation tools such as SnpEff, Ensembl VEP, and ANNOVAR predict the effect of each variant on known transcripts, including missense, nonsense, splice-site, and synonymous changes [5, 6]. For veterinary diagnostics, annotation databases for the target species must be available. For pathogens, databases such as NCBI's RefSeq or specific gene annotations (e.g., antimicrobial resistance genes from ResFinder) are used [2].
Prioritization focuses on nonsynonymous changes, frameshifts, stop gains/losses, and variants in highly conserved regions. For host samples, variants in disease-associated genes (e.g., those causing inherited disorders in dogs or cats) are highlighted. For pathogen samples, variants that potentially alter drug susceptibility or virulence are flagged [1, 3].
Integration with Long-Read and Single-Cell Approaches
The described workflow is primarily for short-read platforms. Long-read sequencing technologies, such as those discussed in Long-Read Sequencing Technologies: PacBio and Oxford Nanopore, produce raw signals that require specialized basecalling (see Basecalling Algorithms for Nanopore Sequencing) and alignment tools (minimap2). Variant calling for long reads often uses tools like Clair3 or PEPPER-Margin-DeepVariant, which incorporate the error profiles unique to these platforms [8].
Single-cell RNA sequencing (scRNA-seq) also generates FASTQ data, but its analysis involves additional steps for read count quantification and expression analysis rather than germline variant calling (see Master Guide: Single-Cell RNA Sequencing Bioinformatics Workflows). However, somatic variant detection in single-cell data is an active research area.
Workflow Management and Reproducibility
Complex NGS pipelines require robust workflow management systems to ensure reproducibility. Tools such as Snakemake and Nextflow enable modular, scalable pipeline construction. Their architectural differences are explored in Workflow Management: Snakemake vs. Nextflow: Architectural Comparisons and Workflow Design Rules. Containerization (Docker, Singularity) and environment management (Conda) further enhance reproducibility by locking software versions [5, 7].
Challenges in Veterinary Diagnostics
Several specific challenges arise when applying NGS variant calling in veterinary settings. First, reference genomes for many animal species and pathogens are incomplete or poorly annotated, leading to misalignment and false negatives [3]. Second, sample quality may be variable due to field collection conditions, resulting in higher rates of sequencing artifacts [1]. Third, validation of variant calls via orthogonal methods (e.g., Sanger sequencing) is often limited by cost or lack of validated assays. Fourth, the interpretation of variants in the context of pathogenicity requires curated databases that are not as mature as those for human medicine [2, 4].
Despite these hurdles, ongoing efforts to build species-specific variant databases and to adapt deep learning models to non-human genomes are improving the accuracy and clinical utility of NGS diagnostics [8]. The integration of NGS data with other diagnostic modalities such as serology, ELISA, and point-of-care testing (see Biosensors and Point-of-Care (POC) Veterinary Diagnostics) provides a comprehensive diagnostic picture.
Conclusion
This blueprint outlines the critical steps from raw reads to annotated variants in a veterinary NGS diagnostic pipeline. Each stage, from quality control to variant prioritization, demands careful technical decisions tailored to the species and clinical question. The use of established tools such as GATK, FreeBayes, and DeepVariant, combined with rigorous filtering and annotation, yields high-confidence variant calls that can inform clinical management, outbreak investigations, and genetic disease screening.
References
[1] OIE (World Organisation for Animal Health). Manual of Diagnostic Tests and Vaccines for Terrestrial Animals. OIE, 2023.
[2] Markey, B., Leonard, F., Archambault, M., Cullinan, A., & Maguire, D. Clinical Veterinary Microbiology. 2nd ed., Elsevier, 2013.
[3] Swayne, D. E., Glisson, J. R., McDougald, L. R., Nolan, L. K., Suarez, D. L., & Nair, V. (Eds.). Diseases of Poultry. 14th ed., Wiley-Blackwell, 2020.
[4] Hirsh, D. C., & Zee, Y. C. Veterinary Microbiology. 2nd ed., Blackwell Publishing, 2004.
[5] Mount, D. W. Bioinformatics: Sequence and Genome Analysis. 2nd ed., Cold Spring Harbor Laboratory Press, 2004.
[6] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., & DePristo, M. A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303, 2010.
[7] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079, 2009.
[8] Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T., Gross, S. S., Dorfman, L., McLean, C. Y., & DePristo, M. A. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology, 36(10), 983–987, 2018. *** Disclaimer This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.