What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

SAM and BAM Formats: Mapping, Alignment Representation, and Indexing with Samtools

Introduction

The Sequence Alignment/Map (SAM) format and its binary counterpart (BAM) have become the de facto standards for storing nucleotide sequence alignments against reference genomes [1, 2]. These formats were developed to accommodate the massive data volumes generated by high-throughput sequencing platforms, enabling efficient storage, retrieval, and manipulation of alignment records [3, 4]. In veterinary virology and diagnostics, SAM and BAM files underpin a wide range of analyses, including pathogen detection, variant calling, and transcript quantification [5, 6]. The Samtools suite provides a comprehensive set of tools for processing these files, including sorting, indexing, filtering, and format conversion [7, 8].

The SAM format is a tab-delimited text format that stores alignment information in a human-readable manner [1, 2]. Each alignment line consists of 11 mandatory fields followed by optional tags [3, 4]. The BAM format is a compressed, binary representation of the same data, optimized for computational performance and reduced storage footprint [9, 10]. The CRAM format offers further compression [11, 12], but SAM and BAM remain the most widely used. Accurate representation of read mapping against reference sequences is critical for downstream analyses such as Variant Calling in Whole Exome Sequencing (WES) and Single-Cell RNA-Seq Analysis Pipelines for Veterinary Immunology [13, 14].

This article provides an exhaustive technical review of the SAM and BAM formats, covering their structural specifications, the principles of alignment representation, indexing mechanisms, and the functionality of the Samtools software suite. Emphasis is placed on applications in veterinary medicine and diagnostics, drawing on examples from pathogen genomics and host transcriptomics.

SAM Format Structure

The SAM format is defined by the SAM/BAM Format Specification Working Group [3, 4]. A SAM file consists of a header section and an alignment section. The header begins with the character @ and contains metadata about the reference sequences, read groups, and processing programs [1, 2]. The alignment section contains one line per read or mate pair, with 11 mandatory tab-separated fields [3, 4].

Mandatory Fields

Field	Description
QNAME	Query template name (read identifier)
FLAG	Bitwise flag indicating mapping properties (e.g., paired, mapped, mate unmapped)
RNAME	Reference sequence name (or `*` if unmapped)
POS	1-based leftmost mapping position
MAPQ	Mapping quality (Phred-scaled probability of misalignment)
CIGAR	Compact Idiosyncratic Gapped Alignment Report string describing alignment operations
RNEXT	Reference name of the mate pair (or `=` if same, `*` if unknown)
PNEXT	Position of the mate
TLEN	Observed template length (insert size)
SEQ	Read sequence (or `*` if stored elsewhere)
QUAL	ASCII-encoded Phred quality scores (or `*` if not stored)

The FLAG field is a decimal integer that represents a binary bitmask [1, 2]. For example, a value of 99 (binary 1100011) indicates a paired-end read that is properly paired, mapped in a proper pair, and has its mate mapped. The CIGAR string is essential for alignment representation, encoding operations such as matches (M), insertions (I), deletions (D), skipped regions (N), and soft clipping (S) [3, 4]. The CIGAR P operator allows representation of multiple sequence alignments, and version 1.5 of the specification introduced extensions for padded reference sequences and annotation tags [15].

Optional Tags

SAM files support a wide range of optional tags in the format TAG:TYPE:VALUE [3, 4]. Common tags include the number of mismatches (NM), the number of gaps (MD), and the alignment score (AS). Tags are essential for storing downstream analysis results and metadata, such as the lowest common ancestor in metagenomic analyses [16] or splice junction information [13]. The SAM specification defines standard tags, but custom tags can be added [4].

BAM Format

BAM is the binary compressed version of SAM [9, 10]. It uses the BGZF (Blocked GNU Zip Format) compression scheme, which enables random access to blocks of data [7, 17]. Each BAM record is encoded in a compact binary representation that mirrors the SAM fields but with variable-length integer encoding and binary flags [10]. The BAM format is substantially smaller than the equivalent SAM file, often reducing file size by 60-80% [9, 18].

The BAM file structure consists of a magic number (BAM\1), a binary header, a list of reference sequence names and lengths, and a series of alignment blocks. Each block contains a block size, a record of variable length, and optional auxiliary data [10]. The BGZF compression allows efficient indexing for rapid access to specific genomic regions [7, 19]. Several libraries provide API support for reading and writing BAM files, including HTSlib [7], BamTools [10], SeqLib [20], and bíogo/hts [19]. High-performance frameworks such as Sam2bam [8] and RabbitBAM [21] optimize BAM processing on multi-core platforms, achieving significant speedups over standard implementations.

Compression and Efficiency

Compression of alignment data remains an active area of research [9, 18, 22]. The CSAM format compresses SAM files using advanced modeling of alignment fields [9, 22]. Genozip achieves compression ratios up to 2.7 times better than CRAM version 3.1 for BAM files [12]. AliCo provides lossless and lossy compression modes, leveraging reference sequence information to reduce file sizes [18]. For veterinary genomics, where large numbers of samples may be stored, efficient compression is critical for data management and transfer.

Mapping and Alignment Representation

Alignment representation in SAM/BAM is based on the concept of mapping reads to a linear reference genome [1, 2]. The alignment algorithm determines the optimal placement of each read, producing a CIGAR string that describes the alignment operations [3, 4]. The CIGAR string uses base-level operations:

M: Alignment match (can be sequence match or mismatch)
I: Insertion relative to the reference
D: Deletion relative to the reference
N: Skipped region (e.g., splice junction in RNA-seq)
S: Soft clipping (bases present in the read but not aligned)
H: Hard clipping (bases removed from the read and not stored)
P: Padding (alignment gap)
=: Sequence match
X: Sequence mismatch

The N operation is particularly important for transcriptome alignments where reads span splice junctions [13, 14]. For RNA-seq data, aligners often output CIGAR strings with N operators to indicate introns. The SAM format also supports paired-end reads, with mapping positions for both ends stored in the mandatory fields [3].

Mapping quality (MAPQ) is a Phred-scaled estimate of the probability that the alignment is incorrect [2]. A MAPQ of 30 corresponds to a 1 in 1000 chance of misalignment. MAPQ values are used in downstream filtering and variant calling. The SAM specification also defines flags for secondary and supplementary alignments, allowing representation of multi-mapping reads and chimeric alignments [3, 4]. For metagenomic analyses, the SAM flags can be used to retain or filter alignments that map to multiple species [16].

Alignment Workflow

graph TD
    A[Raw Sequencing Reads (FASTQ)], > B[Read Alignment Tool]
    B, > C[Unsorted SAM]
    C, > D[Sort by Coordinate]
    D, > E[Sorted BAM]
    E, > F[Indexing (BAI/CSI)]
    F, > G[Downstream Analysis: Variant Calling, Coverage, Visualization]
    G, > H[Filtering with Samtools]
    H, > I[Report (VCF, BED, Counts)]

The workflow begins with raw reads in FASTQ File Format: Decoding Phred Quality Scores and Quality Control Workflows. Aligners such as generic short-read mappers produce SAM output. The SAM file is then sorted by genomic coordinate, indexed, and used for downstream analyses [8, 23]. Tools like sambamba and Samblaster can perform duplicate marking and structural variant extraction during sorting [23, 24].

Indexing

Indexing is essential for rapid retrieval of alignments overlapping a specific genomic region [7, 19]. The most common index formats are BAI (BAM Index) and CSI (Coordinate-Sorted Index) [15]. BAI is a simple uncompressed index that stores chunk boundaries for each reference sequence. CSI is a more flexible index that supports arbitrary bin sizes and is required for genomes with many reference sequences or very large chromosomes [7, 15].

The indexing algorithm builds a hierarchical binning structure. The BAM file is divided into BGZF blocks; the index records the file offset for each block, enabling fast seeking [7, 19]. The bíogo/hts library implements a Go native version of the BAI and CSI indexing schemes [19]. HTSlib provides the reference implementation for index creation and reading [7].

Samtools index is the primary command for generating BAI or CSI indices. Indexing a sorted BAM file allows tools like Samtools view to quickly retrieve reads from a specified region (e.g., samtools view -b aligned.bam chr1:1000-2000). This capability is crucial for veterinary diagnostics when interrogating specific pathogen genomic regions in host metagenomic samples [16].

In addition to BAI and CSI, tabix indexing is used for tab-delimited format files like VCF [7, 15]. The SAM specification also includes a padded reference sequence option, facilitating indexing of de novo assemblies [15].

Samtools Toolkit

Samtools is a widely used software suite for manipulating SAM, BAM, and CRAM files [7, 11]. The core library, HTSlib, provides C API functions for reading and writing these formats [7]. Multiple programming language bindings exist, including R (rbamtools) [13], Ruby (Bio-samtools) [25], and Python via pysam. The Samtools command-line tools include:

samtools view: Convert between SAM and BAM, filter by region or flags, extract specific records.
samtools sort: Sort alignments by leftmost coordinate, name, or tag.
samtools index: Generate BAI or CSI index for sorted BAM files.
samtools merge: Combine multiple BAM files.
samtools flagstat: Report alignment statistics.
samtools depth: Compute per-base read depth.
samtools mpileup: Generate pileup summaries for variant calling.
samtools fastq: Extract read sequences from BAM.

Additional tools for quality control include Qualimap, which evaluates alignment metrics from BAM files [26]. For visualization, viewers such as Alview [27], GenoViewer [28], and Bambino [29] provide graphical interfaces for browsing BAM alignments. The SAMMate GUI tool facilitates processing of short-read alignments [30].

Samdbamba is an alternative implementation in D language that offers faster sorting and indexing on multi-core architectures [23]. SAMBLASTER performs duplicate marking as a piped post-pass on aligner output, reducing overall pipeline runtime [24]. The Scramble conversion tool handles SAM, BAM, and CRAM interconversion [11].

For veterinary applications, Samtools is frequently used in pipelines for detecting African Swine Fever: Computational Models for Early Detection and Spread Prediction in Wild Boar Populations. Alignment of sequencing reads from clinical or environmental samples to a reference genome (e.g., the ASFV genome) is followed by sorting and indexing to assess coverage and call variants. The samtools depth command provides coverage information critical for evaluating the sensitivity of pathogen detection [31]. Tools like covtobed extract coverage tracks for downstream visualization [31].

Computational modeling of viral evolution, as described in Evolutionary Dynamics and Computational Modeling of Viral Mutation Rates, relies on accurate alignment data to identify single nucleotide variants. Samtools pileup and variant callers leverage the alignment information in BAM files to produce variant calls. Integration with cloud computing frameworks such as Hadoop-BAM enables scalable processing of large BAM datasets [32].

Veterinary Applications

In veterinary genomics, SAM/BAM files are central to studies on host-pathogen interactions, genetic diversity, and diagnostic test development. For example, in Structural Comparison of Avian Versus Mammalian Influenza Receptor Binding, alignments of influenza virus reads from avian and mammalian hosts are used to examine receptor binding preferences. The SAM flags and mapping quality values help differentiate true viral reads from host contamination.

Metagenomic analysis of fecal samples from livestock, as discussed in Swine Gut Microbiota and Bacterial Pathogens, produces BAM files after aligning reads to a combined reference database of host and pathogen genomes. Tools like sam2lca utilize SAM alignments to assign taxonomic labels by computing the lowest common ancestor of matching sequences [16]. Proper indexing and filtering are necessary to handle the large number of sequences in such databases.

For RNA-seq experiments in veterinary immunology, splice-aware aligners produce SAM records with N operators in the CIGAR string. Downstream quantification tools parse these alignments to estimate gene expression levels. The SAM format’s optional tags can store junction information, enabling alternative splicing analysis [13, 14].

Diagnostic pipelines for notifiable diseases, such as avian influenza, require rapid and accurate alignment of sequencing data. Samtools’ ability to quickly subset BAM files by genomic region allows targeted analysis of hemagglutinin and neuraminidase segments. The comprehensive workflow from raw FASTQ to indexed BAM is described in the alignment workflow diagram above.

Conclusion

SAM and BAM formats remain foundational to computational genomics and veterinary diagnostics. Their structured representation of sequence alignments, combined with powerful indexing and processing tools like Samtools, enables efficient analysis of high-throughput sequencing data. Ongoing improvements in compression and parallel processing continue to enhance the utility of these formats for large-scale studies. As sequencing technologies advance and become more accessible in veterinary medicine, mastery of SAM/BAM handling and indexing will be critical for accurate pathogen detection, genomic surveillance, and host response analysis.

References

[1] Zhang H. Overview of Sequence Data Formats. Methods Mol Biol. 2016. https://pubmed.ncbi.nlm.nih.gov/27008007/

[2] Robinson P, Hansen P. SAM/BAM Format. 2017. https://www.semanticscholar.org/paper/46aa3b9b2c364cc522fad8e0528749eeb00ea7f6

[3] The SAM/BAM Format Specification Working Group. Sequence Alignment / Map Optional Fields Specification. 2017. https://www.semanticscholar.org/paper/ae70304022f3e97f6bb5fcd6feb171f0cbd099f1

[4] The SAM/BAM Format Specification Working Group. Sequence Alignment/Map Optional Fields Specification. 2018. https://www.semanticscholar.org/paper/bddbb9d850b95360a0e66a969b0f81769607e9ca

[5] Lee CT, Maragkakis M. SamQL: a structured query language and filtering tool for the SAM/BAM file format. BMC Bioinformatics. 2021. https://pubmed.ncbi.nlm.nih.gov/34600480/

[6] Menschaert G, Wang X, Jones AR, et al. The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data. Genome Biol. 2018. https://pubmed.ncbi.nlm.nih.gov/29386051/

[7] Bonfield JK, Marshall J, Danecek P, et al. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience. 2021. https://pubmed.ncbi.nlm.nih.gov/33594436/

[8] Ogasawara T, Cheng Y, Tzeng TK. Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools. PLoS One. 2016. https://pubmed.ncbi.nlm.nih.gov/27861637/

[9] Cánovas R, Moffat A, Turpin A. CSAM: Compressed SAM format. Bioinformatics. 2016. https://pubmed.ncbi.nlm.nih.gov/27540265/

[10] Barnett DW, Garrison EK, Quinlan A, et al. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011. https://www.semanticscholar.org/paper/d971173f9debd66e2b5e099739f72fd9cdc7dae1

[11] Bonfield JK. The Scramble conversion tool. Bioinformatics. 2014. https://pubmed.ncbi.nlm.nih.gov/24930138/

[12] Lan D, Llamas B. Genozip 14 - advances in compression of BAM and CRAM files. bioRxiv. 2022. https://www.semanticscholar.org/paper/cc0819a8fae1e51062cd9f27627b4b6f8996d6af

[13] Kaisers W, Schaal H, Schwender H. rbamtools: an R interface to samtools enabling fast accumulative tabulation of splicing events over multiple RNA-seq samples. Bioinformatics. 2015. https://pubmed.ncbi.nlm.nih.gov/25563331/

[14] Li P, Ji G, Dong M, et al. CBrowse: a SAM/BAM-based contig browser for transcriptome assembly visualization and analysis. Bioinformatics. 2012. https://pubmed.ncbi.nlm.nih.gov/22789590/

[15] Cock P, Bonfield J, Chevreux B, et al. SAM/BAM format v1.5 extensions for de novo assemblies. bioRxiv. 2015. https://www.semanticscholar.org/paper/39ca604a3a3da0ab5d6ae5ba3f7f2bddb8b8670d

[16] Borry M, Hübner A, Warinner C. sam2lca: Lowest Common Ancestor for SAM/BAM/CRAM alignment files. Journal of Open Source Software. 2022. https://www.semanticscholar.org/paper/b1f2a23a1e23f3a1d2984597f808eb07fbad418c

[17] Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS One. 2013. https://pubmed.ncbi.nlm.nih.gov/23533605/

[18] Ochoa I, Li H, Baumgarte F, et al. AliCo: A New Efficient Representation for SAM Files. Data Compression Conference. 2019. https://www.semanticscholar.org/paper/79bfa4b0581b58e12207f8455816e3479fbfe583

[19] Kortschak R, Pedersen BS, Adelson D. bíogo/hts: high throughput sequence handling for the Go language. Journal of Open Source Software. 2017. https://www.semanticscholar.org/paper/b6766266d3e3f03152b2500b77388fe8411cea41

[20] Wala J, Beroukhim R. SeqLib: a C++ API for rapid BAM manipulation, sequence alignment and sequence assembly. Bioinformatics. 2016. https://www.semanticscholar.org/paper/ad2a1b0600f6bb793bb77092a85195ba28e3890b8

[21] Yan L, Zhao Z, Yin Z, et al. RabbitBAM: Accelerating BAM File Manipulation on Multi-Core Platforms. IEEE Transactions on Computational Biology and Bioinformatics. 2025. https://www.semanticscholar.org/paper/8d75e1c674c099107c111bc5a563ab4e3e16e045

[22] Cánovas R, Moffat A, Turpin A. CSAM: Compressed SAM Format. 2016. https://www.semanticscholar.org/paper/d39c6bb8dbabd4aed46f1e572478912ff149eea5

[23] Tarasov A, Vilella AJ, Cuppen E, et al. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015. https://pubmed.ncbi.nlm.nih.gov/25697820/

[24] Faust G, Hall IM. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics. 2014. https://www.semanticscholar.org/paper/5565e44b8a7d3feefd1d40e6b028a1f5a17ca121

[25] Ramirez-Gonzalez RH, Bonnal R, Caccamo M, et al. Bio-samtools: Ruby bindings for SAMtools, a library for accessing BAM files containing high-throughput sequence alignments. Source Code Biol Med. 2012. https://pubmed.ncbi.nlm.nih.gov/22640879/

[26] García-Alcalde F, Okonechnikov K, Carbonell J, et al. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics. 2012. https://pubmed.ncbi.nlm.nih.gov/22914218/

[27] Finney RP, Chen QR, Nguyen CV, et al. Alview: Portable Software for Viewing Sequence Reads in BAM Formatted Files. Cancer Inform. 2015. https://pubmed.ncbi.nlm.nih.gov/26417198/

[28] Laczik M, Tukacs E, Uzonyi B, et al. Geno viewer, a SAM/BAM viewer tool. Bioinformation. 2012. https://www.semanticscholar.org/paper/d67c3076c4924c21e7172f32a7fcd223a6da019b

[29] Edmonson MN, Zhang J, Yan C, et al. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics. 2011. https://pubmed.ncbi.nlm.nih.gov/21278191/

[30] Xu G, Deng N, Zhao Z, et al. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format. Source Code Biol Med. 2011. https://pubmed.ncbi.nlm.nih.gov/21232146/

[31] Birolo G, Telatin A. covtobed: a simple and fast tool to extract coverage tracks from BAM files. Journal of Open Source Software. 2020. https://www.semanticscholar.org/paper/ea36554ff46797e23c3d5c93ff3137ee59b6ec8c

[32] Niemenmaa M, Kallio A, Schumacher A, et al. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012. https://www.semanticscholar.org/paper/4168105671e8c63a956602ea65e659d1f9999081 *** Disclaimer This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.

[33] Takeuchi T, Yamada A, Aoki T, et al. cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code Biol Med. 2016. https://pubmed.ncbi.nlm.nih.gov/27536334/

[34] Francesco F, Aless P, Ra et al. SAM-Profiler: A Graphical Tool for Qualitative Profiling of Next Generation Sequencing Alignment Data. 2013. https://www.semanticscholar.org/paper/52d2ac657c0ac20b591d99dbd818bfb0d9ca079e