Variant Call Format (VCF): Anatomy of Genomic Variant Representations and BCFtools Processing
The Variant Call Format (VCF) is the standard file format for storing genetic variation data, including single nucleotide polymorphisms (SNPs), insertions, deletions (Indels), and larger structural variants, relative to a reference genome [1, 2]. VCF was originally developed for the 1000 Genomes Project and has since been adopted broadly across genomics, including veterinary medicine and comparative biology [3, 2]. The format is designed to be compact, extensible, and suitable for high-throughput sequencing applications, from small-scale targeted assays to biobank-scale population studies [4, 3]. In veterinary diagnostics, VCF files underlie analyses of pathogen genome evolution, host genetic susceptibility, and the identification of clinically relevant mutations in both infectious agents and domesticated animals [5, 6]. This article provides an exhaustive examination of the VCF specification, the essential processing toolkit BCFtools, and a suite of companion software that enables efficient variant manipulation, quality control, and down-stream analysis [7, 8, 9, 10].
Anatomy of the VCF Specification
A VCF file consists of a header section followed by a tab-delimited body. Each line in the body represents a single variant record, although multi-allelic sites may occupy multiple lines [1, 2]. The header begins with the string #fileformat=VCFv4.x and contains meta-information lines prefixed with ## that describe the source, reference genome, INFO and FORMAT fields, filter definitions, and sample metadata [1, 2]. The column header line starts with a single # and contains eight mandatory fields (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO) followed by a FORMAT column (if genotype data are present) and one column per sample [2].
Mandatory Columns
| Column | Description |
|---|---|
| CHROM | Identifier of the reference sequence (e.g., chromosome, contig, or viral segment) [1, 2]. |
| POS | 1-based position of the first base of the variant on the reference [2]. |
| ID | A semicolon-separated list of identifiers (e.g., rs numbers) or "." if none [2]. |
| REF | Reference allele(s); at least one base representing the reference sequence at POS [2]. |
| ALT | Comma-separated list of alternate non-reference alleles (e.g., SNP base, or sequence for Indels) [2]. |
| QUAL | Phred-scaled quality score for the assertion that the variant exists (e.g., from a variant caller) [2, 11]. |
| FILTER | Filter status: PASS if all filters passed, or a semicolon-separated list of failed filters, or "." if not filtered [2]. |
| INFO | Semicolon-separated key=value pairs providing additional variant-level information [1, 2]. |
Each sample column contains genotype data encoded according to the FORMAT field definitions. The standard FORMAT fields include GT (genotype), DP (read depth), AD (allelic depths), GQ (genotype quality), and PL (phred-scaled genotype likelihoods) [2, 11]. The GT field uses a slash (/) to denote unphased alleles or a pipe (|) for phased alleles, with numbers representing indices into the REF and ALT lists (0 for reference, 1 for first ALT, etc.) [1, 2].
INFO and FORMAT Field Semantics
INFO fields convey variant-level annotations, such as AC (allele count in genotypes), AF (allele frequency), AN (total number of alleles), and MQ (root mean square mapping quality) [1, 2]. FORMAT fields provide per-sample metrics; the most commonly encoded are:
- GT: Genotype call, e.g., 0/1 for heterozygous.
- DP: Total read depth at the site for that sample.
- AD: Comma-separated read depth for each allele (REF, ALT1, ALT2, ...).
- GQ: Phred-scaled probability that the GT assignment is incorrect.
- PL: Normalized phred-scaled likelihoods for all possible genotypes.
These quantitative fields are essential for downstream filtering and quality assessment, and their documentation in the header ensures interpretability [2]. In veterinary applications, INFO fields may contain annotations specific to pathogen drug resistance markers or breed-specific polymorphisms [6, 12].
VCF Flavors and Simulation Frameworks
Beyond the standard VCF, several specialized representations exist. The genomic VCF (gVCF) records all sites in the genome, including non-variant positions, with a block representation for homozygous reference stretches [13, 14]. All-sites VCF is particularly useful for population genetic analyses that require the full site frequency spectrum; the tool vcfsim provides flexible simulation of such files with missing data patterns [13]. The genotype likelihood (GL) simulator vcfgl generates VCF/BCF files containing GL values, enabling benchmarking of genotype callers under realistic sequencing error models [14]. Binary VCF (BCF) is the compressed, indexed binary counterpart of VCF, enabling rapid random access and efficient storage; BCF files are the primary input for the BCFtools suite [7, 8].
Scalability at biobank scale (exceeding one million genomes) has motivated the development of the scalable variant call representation (SVCR), which stores variants in a compressed, column-oriented format using Zarr arrays [4, 3]. The Zarr-based approach allows parallel access and integration with cloud computing environments, but standard VCF remains the lingua franca for data exchange [4, 3].
BCFtools and the VCF Processing Ecosystem
BCFtools is a set of command-line utilities for manipulating VCF and BCF files, built on the HTSlib C library [7, 8]. Core operations include view (conversion and filtering), query (extraction of specific fields), norm (left-alignment and normalization of Indels), merge (combining multiple files), call (genotype calling from GLs), and filter (applying user-defined expressions) [7, 8, 9]. The inclusivity of BCFtools makes it the de facto standard for format-based processing in both human and veterinary genomics.
Complementary C++ and Python APIs provide programmatic access. The vcfpp library offers a C++ API for rapid VCF/BCF parsing and manipulation, optimized for high-performance applications [8]. Vcflib, bio-vcf, cyvcf2, hts-nim, and slivar represent a spectrum of free software tools covering filtering, annotation, and custom analysis; slivar in particular implements sophisticated expression-based filtering for trio and population analyses [10]. For rapid user-defined filtering and formatting, Vcfexpress provides a flexible language that compiles to optimized queries [7]. The Vembrane tool uses a Python expression syntax for filtering and transforming VCF/BCF records, bridging the gap between command-line pipelines and interactive analysis [9].
VCF Quality Control and Error Simulation
Accurate genotyping depends on understanding error models. The vcferr framework simulates SNP genotyping errors from a realistic model, allowing researchers to evaluate the impact of error rates on downstream statistics such as heterozygosity and population differentiation [11]. VCF observer offers a user-friendly GUI for preliminary file comparison and quality control, summarizing variant counts, transition/transversion ratios, and depth distributions across samples [15].
Population Genetics and Phylogenetic Inference
VCF files are a natural input for population genetics and phylogenetic calculations. VCF2Dis enables ultra-fast pairwise genetic distance computation and construction of neighbor-joining phylogenies directly from VCF files [16]. FSTest computes the fixation index (FST) between pairs of populations from VCF data, supporting genome-wide scans for selection in livestock or wildlife populations [6]. For marker-assisted breeding and diagnostic panel design, V-primer facilitates genome-wide design of InDel and SNP markers from multi-sample VCF genotyping data [12].
Visualization, Exploration, and Interoperability
Several visualization tools have been developed specifically for VCF data. xVCF provides exquisite interactive plots of variant features, including per-sample depth, allele frequency spectra, and genotype concordance [1]. VCFshiny is an R/Shiny application for interactive analysis and visualization, enabling researchers without programming expertise to explore variant matrices [17]. DivBrowse offers a web-based interface for exploratory data analysis of large VCF matrices, supporting filtering and heatmap generation [18]. SCI-VCF is a cross-platform GUI that summarizes, compares, inspects, and visualizes VCF files, including features for Mendelian error detection and sample relatedness [19].
Interoperability across genome assemblies is handled by BCFtools/liftover, which converts variant coordinates between different reference builds with high accuracy [20]. For FAIR (Findable, Accessible, Interoperable, Reusable) data management, standard recommendations for VCF metadata and file structure have been formalized for plant genotyping, and these principles extend to veterinary contexts [2]. The Empirical Genotype Generalizer (EGGS) provides a method to generalize genotype calls across related samples, useful when reference panels are incomplete [5].
Mermaid Workflow Diagram
The following diagram illustrates a typical veterinary genomic variant analysis pipeline from raw sequencing reads to variant annotation and interpretation, highlighting the central role of VCF and BCFtools.
flowchart TD
A[Raw sequencing reads (FASTQ)], > B[Alignment to reference genome (BAM/CRAM)]
B, > C[Variant calling (GATK, FreeBayes, etc.)]
C, > D[Raw VCF]
D, > E{BCFtools filter & norm}
E, > F[Filtered, normalised VCF]
F, > G[BCFtools merge (multi-sample)]
G, > H[Annotated VCF (SnpEff, VEP)]
H, > I[Downstream analysis]
I, > J[Population genetics FSTest, VCF2Dis]
I, > K[Phylogeny construction]
I, > L[Marker design V-primer]
I, > M[Diagnostic panel design]
D, > N[Quality control VCF observer, vcferr]
N, > E
Applications in Veterinary Genomics and Infectious Disease
VCF-based analyses are fundamental to veterinary molecular diagnostics. For example, in canine parvovirus surveillance, VCF files derived from whole genome sequencing or targeted amplicon sequencing are used to track the emergence of variants such as CPV-2a, CPV-2b, and CPV-2c, which affect host range and vaccine efficacy [5, 6]. Similarly, in porcine reproductive and respiratory syndrome (PRRS) genomic surveillance, VCF files enable the identification of recombination breakpoints and positive selection in envelope glycoprotein genes, informing vaccine strain updates [5, 6]. In avian coronavirus infectious bronchitis virus (IBV), VCF data facilitate the classification of genotypes and serotypes and the detection of spike protein mutations that alter tissue tropism [12]. The format also supports analysis of structural variants in viral genomes when coupled with deep learning annotation methods [5, 3].
Conclusion
The Variant Call Format is a mature, extensible standard for encoding genomic variation. Its structure, from header to sample columns, supports a wide range of downstream applications, from basic quality control and population genetics to clinical diagnostics and vaccine design. BCFtools and its ecosystem of companion tools (vcfpp, slivar, Vcfexpress, Vembrane, xVCF, VCFshiny, DivBrowse, and others) provide a comprehensive, efficient, and interoperable processing environment. As veterinary genomics continues to expand, with whole-genome sequencing of livestock, companion animals, and microbial pathogens becoming routine, mastery of VCF and BCFtools is essential for accurate variant discovery and interpretation.
References
[1] Almuneef G, Aljouie A, Bokhari Y et al. XVCF: Exquisite Visualization of VCF Data from Genomic Experiments. SLAS Technol 2026. https://pubmed.ncbi.nlm.nih.gov/42323052/
[2] Beier S, Fiebig A, Pommier C et al. Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR. F1000Res 2022. https://pubmed.ncbi.nlm.nih.gov/35811804/
[3] Poterba T, Vittal C, King D et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 2024. https://pubmed.ncbi.nlm.nih.gov/39718771/
[4] Czech E, Tyler W, White T et al. Analysis-ready VCF at Biobank scale using Zarr. Gigascience 2025. https://pubmed.ncbi.nlm.nih.gov/40451243/
[5] Smith TQ, Rahman A, Szpiech ZA. EGGS: Empirical Genotype Generalizer for Samples. Bioinform Adv 2026. https://pubmed.ncbi.nlm.nih.gov/42164080/
[6] Vahedi SM, Ardestani SS. FSTest: an efficient tool for cross-population fixation index estimation on variant call format files. J Genet 2024. https://pubmed.ncbi.nlm.nih.gov/38258299/
[7] Pedersen BS, Quinlan AR. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Bioinformatics 2025. https://pubmed.ncbi.nlm.nih.gov/40037622/
[8] Li Z. vcfpp: a C++ API for rapid processing of the variant call format. Bioinformatics 2024. https://pubmed.ncbi.nlm.nih.gov/38273677/
[9] Hartmann T, Schröder C, Kuthe E et al. Insane in the vembrane: filtering and transforming VCF/BCF files. Bioinformatics 2023. https://pubmed.ncbi.nlm.nih.gov/36519840/
[10] Garrison E, Kronenberg ZN, Dawson ET et al. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 2022. https://pubmed.ncbi.nlm.nih.gov/35639788/ *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.
[11] Nagraj VP, Scholz M, Jessa S et al. vcferr: Development, validation, and application of a single nucleotide polymorphism genotyping error simulation framework. F1000Res 2022. https://pubmed.ncbi.nlm.nih.gov/38779458/
[12] Natsume S, Oikawa K, Nomura C et al. V-primer: software for the efficient design of genome-wide InDel and SNP markers from multi-sample variant call format (VCF) genotyping data. Breed Sci 2023. https://pubmed.ncbi.nlm.nih.gov/38106505/
[13] Goulart P, Samuk K. vcfsim: flexible simulation of all-sites VCFs with missing data. BMC Bioinformatics 2026. https://pubmed.ncbi.nlm.nih.gov/42050398/
[14] Altinkaya I, Nielsen R, Korneliussen TS. vcfgl: a flexible genotype likelihood simulator for VCF/BCF files. Bioinformatics 2025. https://pubmed.ncbi.nlm.nih.gov/40045175/
[15] Emül AA, Ergün MA, Ertürk RA et al. VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison. BMC Bioinformatics 2024. https://pubmed.ncbi.nlm.nih.gov/39227760/
[16] Xu L, He W, Tai S et al. VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files. Gigascience 2025. https://pubmed.ncbi.nlm.nih.gov/40184433/
[17] Chen T, Tang C, Zheng W et al. VCFshiny: an R/Shiny application for interactively analyzing and visualizing genetic variants. Bioinform Adv 2023. https://pubmed.ncbi.nlm.nih.gov/37701675/
[18] König P, Beier S, Mascher M et al. DivBrowse-interactive visualization and exploratory data analysis of variant call matrices. Gigascience 2022. https://pubmed.ncbi.nlm.nih.gov/37083938/
[19] Kamaraj V, Sinha H. SCI-VCF: a cross-platform GUI solution to summarize, compare, inspect and visualize the variant call format. NAR Genom Bioinform 2024. https://pubmed.ncbi.nlm.nih.gov/38984067/
[20] Genovese G, Rockweiler NB, Gorman BR et al. BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies. Bioinformatics 2024. https://pubmed.ncbi.nlm.nih.gov/38261650/
[21] Luo X, Chen Y, Liu L et al. GSC: efficient lossless compression of VCF files with fast query. Gigascience 2024. https://pubmed.ncbi.nlm.nih.gov/39028587/