What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Metagenomics Taxonomic Classification: Kraken2 and Functional Annotation Pipelines

Introduction

Metagenomic sequencing enables the unbiased characterization of microbial communities directly from environmental, clinical, or veterinary specimens without prior culture [1, 2, 3]. In veterinary diagnostics, metagenomics has become essential for detecting novel pathogens, monitoring antimicrobial resistance genes, and profiling the microbiota of production animals [4, 5, 35]. The core computational challenge in metagenomics is the accurate taxonomic classification of sequencing reads and the subsequent assignment of functional annotations to the detected microbial genomes [6, 7, 33]. Among the many classifiers developed, Kraken2 has emerged as a widely used tool owing to its speed, precision, and low memory footprint [8, 9, 10]. This article provides an exhaustive review of Kraken2, its algorithmic foundations, comparative performance benchmarks, and its role within broader functional annotation pipelines relevant to veterinary science.

Kraken2: Algorithmic Architecture and Database Design

Kraken2 performs taxonomic classification by mapping k-mers from sequencing reads to a reference database of genomes using exact matching [8, 11]. The database is constructed by decomposing all reference genomes into overlapping k-mers (default k=35) and mapping each k-mer to the lowest common ancestor (LCA) of all genomes containing that k-mer [9, 12]. During classification, each read is split into its constituent k-mers, and the LCA of all matching k-mers is computed to assign a taxonomic label [2, 13]. Kraken2 introduced a confidence threshold that reduces false positives by requiring a minimum fraction of k-mers to support the assigned taxon, a feature critical for high-precision analyses [8, 14].

The primary advantage of Kraken2 is its computational efficiency. The k-mer database is stored in a compact hash table, allowing classification at rates exceeding millions of reads per minute [4, 10]. However, the fixed k-mer length can limit sensitivity for evolutionarily divergent species [1, 15]. To address this, post-processing tools such as Bracken (Bayesian Reestimation of Abundance after Classification) recompute abundance estimates by redistributing reads assigned at higher taxonomic levels [10, 32]. Several studies have refined filtering criteria for Kraken family tools, particularly in challenging contexts such as ancient metagenomes or samples with high host DNA content [8, 34].

Comparative Performance of Taxonomic Classifiers

Numerous benchmarking studies have evaluated Kraken2 against other classifiers using mock communities and simulated datasets [16, 17, 18, 4]. The general consensus is that no single classifier outperforms all others across every metric; choice depends on the application scope (e.g., short‑read vs. long‑read, bacterial vs. viral targets) [16, 2, 18].

Table 1: Selected taxonomic classifiers and their key features

Classifier	Algorithm Type	Input	Key Strength	Reference
Kraken2	k-mer exact matching	Short reads	Speed, low memory	[8, 9]
Kaiju	Protein-level BWT	Short reads	Sensitivity for divergent taxa	[1, 19, 15]
Centrifuge	FM-index (genomic)	Short reads	Memory efficiency	[9, 2]
MetaMaps	Long-read mapping	Long reads	Species-level precision	[18, 20]
CAT/BAT	LCA on contigs/bins	Assembled contigs	Robustness to novel lineages	[21]
EURYALE	Nextflow pipeline	Short reads	Modular functional annotation	[22]
KAPTAIN	KMA-based (long-read)	Long reads	High precision filtering	[14]

For short-read Illumina data, Kraken2 combined with Bracken provides high recall and competitive precision [17, 10]. In a benchmark using 19 mock communities, bioBakery4 (which includes MetaPhlAn) achieved the best overall accuracy, while pipelines like JAMS and WGSA2 offered higher sensitivity [17]. For long-read Oxford Nanopore data, classifiers designed specifically for longer fragments (e.g., KAPTAIN, MetaMaps, BugSeq) often outperform short-read classifiers, especially when reads exceed 2 kb [16, 14, 18]. Kraken2 can still be applied to long reads but may exhibit lower precision without aggressive abundance filtering [16, 14].

Protein-level classifiers such as Kaiju use the Burrows‑Wheeler transform on a translated reference database to maximize exact matches, enabling detection of distant homologs [1, 15]. The Core‑Kaiju extension further improves abundance estimation by focusing on core protein domain families [19]. Deep learning approaches, including DeepMicrobes and Phylo‑HS, are gaining traction for their potential to incorporate phylogenetic structure and reduce memory footprints through hierarchical softmax outputs [6, 23, 7]. However, these methods require extensive training and may not yet match reference‑based tools for species‑level resolution [13, 24].

Functional Annotation Pipelines

Beyond taxonomic classification, metagenomic analysis requires linking microbial identities to biological functions. Functional annotation pipelines integrate gene prediction, orthology assignment, and pathway mapping [22, 21, 25]. Several comprehensive workflows have been developed.

EURYALE is a Nextflow pipeline that inherits sensitive taxonomic classification from MEDUSA and couples it with flexible functional annotation using tools like DIAMOND and HMMER [22]. It supports Docker and Singularity containers, making it deployable on high‑performance computing clusters or cloud platforms [22]. The modular architecture allows users to customize reference databases and parameters for veterinary applications, such as detecting antimicrobial resistance genes or virulence factors.

CAT and BAT (Contig Annotation Tool / Bin Annotation Tool) provide robust classification of assembled contigs and metagenome‑assembled genomes by integrating multiple signals (e.g., sequence similarity, coverage, tetra‑nucleotide frequencies) [21]. CAT/BAT automatically assign taxonomy at the appropriate rank, preventing over‑classification of novel lineages [21]. This is invaluable for veterinary metagenomics where many unknown environmental organisms may be encountered.

VIRify specializes in virus detection and taxonomic classification using profile hidden Markov models (HMMs) tailored to viral protein families [25]. VIRify identifies viral contigs and prophages, then annotates them with functional HMMs, achieving >95% accuracy at genus‑to‑family levels in mock communities [25]. For livestock viruses, this pipeline can complement broader metagenomic efforts.

KAPTAIN is a KMA‑based pipeline optimized for Nanopore data that applies stringent post‑classification filtering to achieve median precision of 95% while maintaining recall above 90% [14]. The pipeline explicitly models the effect of sequencing yield on limit of detection, providing guidance for minimum sequencing depth [14]. Such pipelines are directly applicable to veterinary diagnostics where rapid, portable sequencing is increasingly used.

Table 2: Functional annotation pipelines and their target applications

Pipeline	Base Classifier	Annotation Modules	Veterinary Relevance	Reference
EURYALE	Kraken2 or Kaiju	DIAMOND, HMMER	Modular, customizable	[22]
CAT/BAT	DIAMOND/BLAST	LCA + binning	Robust to novel taxa	[21]
VIRify	Custom HMMs	Viral protein HMMs	Virus‑specific detection	[25]
KAPTAIN	KMA	Post‑filtering	Long‑read, high precision	[14]
RIEMS	Cascading BLAST	Multi‑tool	Comprehensive read assignment	[5]
META²	Deep learning	MIL for abundance	Memory‑efficient	[7]

A key component of functional annotation is the assignment of reads or contigs to metabolic pathways using databases such as KEGG, MetaCyc, or Carbohydrate‑Active Enzymes (CAZy) [22, 33, 35]. The integration of taxonomic and functional profiles allows researchers to infer community dynamics and pathogenic potential in livestock populations [4, 10]. For instance, detecting Pasteurella multocida alongside specific virulence genes can aid in diagnosing fowl cholera [26, 12].

Workflow Integration: From Raw Reads to Annotations

The following Mermaid diagram outlines a typical metagenomics analysis pipeline combining taxonomic classification and functional annotation.

flowchart TD
    A[Raw Sequencing Reads], > B[Quality Control & Trimming]
    B, > C{Host Read Removal?}
    C, >|Yes| D[Map to Host Genome]
    D, > E[Filtered Non-Host Reads]
    C, >|No| E
    E, > F[Taxonomic Classification]
    F, > G[Kraken2 / Kaiju / Others]
    G, > H[Abundance Estimation<br>(Bracken / custom)]
    E, > I[Assembly (metaSPAdes, MEGAHIT)]
    I, > J[Contig Binning]
    J, > K[Taxonomic Annotation of Bins<br>(CAT / BAT)]
    E, > L[Gene Prediction (Prodigal)]
    L, > M[Functional Annotation<br>(DIAMOND / HMMER)]
    H, > N[Integrated Report]
    K, > N
    M, > N
    N, > O[Veterinary Interpretation]

The workflow begins with quality control (e.g., FastQC, Trimmomatic) to remove adapter sequences and low‑quality bases [5, 35]. Host reads may be subtracted using alignment to the reference genome of the animal species under study [17, 4]. The remaining non‑host reads are then processed through two parallel arms: taxonomic classification (often with Kraken2) and metagenomic assembly. Assembly algorithms such as metaSPAdes or MEGAHIT generate contigs that can be binned into metagenome‑assembled genomes (MAGs) [21, 27]. Each MAG can be taxonomically classified using CAT/BAT or similar tools [21, 28]. Simultaneously, gene prediction on contigs yields protein sequences that are searched against functional databases for enzyme commission numbers, KEGG orthologs, or Pfam domains [22, 25]. All results are aggregated into a unified report for clinical interpretation.

Taxonomic Classification in the Context of Veterinary Metagenomics

Veterinary metagenomics poses unique challenges: the presence of high host DNA (e.g., from blood or tissue), variable sample quality, and the need to differentiate commensals from pathogens [16, 10, 5]. Kraken2’s speed enables rapid screening, but careful database curation is essential. Using a database that includes both pathogenic and non‑pathogenic strains of livestock‑associated species (e.g., Mycoplasma gallisepticum, Avibacterium paragallinarum) improves specificity [3, 35]. Additionally, machine learning and semi‑supervised methods, such as incremental VSEARCH and Taxometer, can refine classification by incorporating abundance profiles and tetra‑nucleotide frequencies, especially for contig‑level assignments [9, 11, 29].

Long‑read sequencing technologies (e.g., Oxford Nanopore, PacBio) are increasingly applied to veterinary samples due to their portability and ability to resolve complex genomic regions [16, 18, 20]. However, higher error rates demand dedicated classifiers or post‑processing filters. KAPTAIN, for example, demonstrated that filtering based on relative abundance and read coverage can push precision to >99% for Nanopore data while maintaining acceptable recall [14]. Similarly, taxMaps offers a sensitive yet computationally efficient alternative for short reads by using spaced‑seed hashing [32].

Functional annotation pipelines must also account for viral RNA genomes and single‑stranded DNA viruses, which are abundant in veterinary samples [25, 31]. The sequence‑based classification framework proposed for marine picornaviruses [31] can be adapted for animal viruses, providing a model for taxonomic assignment of uncultured viral sequences based on RNA‑dependent RNA polymerase phylogeny.

Conclusion

The selection of a taxonomic classifier and functional annotation pipeline depends on the specific veterinary question, sequencing technology, and available computational resources. Kraken2 remains a robust short‑read classifier, especially when paired with abundance estimation tools and confidence filtering. For long‑read applications, pipelines like KAPTAIN and CAT/BAT offer superior precision. Comprehensive pipelines such as EURYALE and VIRify integrate both classification and annotation in modular, reproducible frameworks. Future advances in deep learning and incremental learning promise to further improve accuracy and scalability [6, 23, 7, 29]. Adherence to benchmarking standards and the use of mock communities will remain essential for validating these tools in veterinary contexts [16, 17, 4, 35].

Table 3: Recommendations for veterinary metagenomics studies

Sample Type	Sequencing Platform	Recommended Classifier	Recommended Pipeline	Key Considerations
Fecal microbiome	Short-read (Illumina)	Kraken2 + Bracken	EURYALE	Database must include livestock‑associated taxa
Respiratory swab	Long-read (Nanopore)	KAPTAIN or MetaMaps	KAPTAIN	Minimum yield 500M bases for reliable LOD
Viral enrichment	Short-read	Kaiju (protein‑level)	VIRify	Use viral‑specific HMMs
Soil/Environmental	Short- or long-read	CAT/BAT + Taxometer	Custom Nextflow	Novel taxa expected; avoid over‑classification

In summary, the integration of Kraken2 with well‑designed functional annotation pipelines provides a powerful framework for veterinary metagenomic analysis, enabling both the identification of microbial taxa and the characterization of functional gene repertoires relevant to animal health.

References

[1] Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications. 2016. URL: https://www.semanticscholar.org/paper/7cd164015a1f98d243ce3ab25c7db28b85cb64bd

[2] Ye S, Siddle K, Park DJ, et al. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell. 2019. URL: https://www.semanticscholar.org/paper/873d2ae67ee0db53e5a671de22a7e8cf28ef2ce3

[3] Khachatryan L, de Leeuw RH, Kraakman M, et al. Taxonomic classification and abundance estimation using 16S and WGS-A comparison using controlled reference samples. Forensic Science International: Genetics. 2020. URL: https://www.semanticscholar.org/paper/773ea5409b8277bd53cd0628ef007b6168fcae56

[4] Martins I, Silva JM, Almeida J. A systematic review and benchmarking of modern metagenomic tools for taxonomic classification. Comput. Biol. Medicine. 2026. URL: https://www.semanticscholar.org/paper/0225e440fea6df0b064e0892098f3076efb22807

[5] Scheuch M, Höper D, Beer M. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets. BMC Bioinformatics. 2015. URL: https://www.semanticscholar.org/paper/035dfb6e2398c95bbf6b65db97b443908a43e093

[6] Liang Q, Bible P, Liu Y, et al. DeepMicrobes: taxonomic classification for metagenomics with deep learning. bioRxiv. 2019. URL: https://www.semanticscholar.org/paper/81eb1671e74ba01d0c14576395d20a1e6b7163d8

[7] Georgiou A, Fortuin V, Mustafa H, et al. META²: Memory-efficient taxonomic classification and abundance estimation for metagenomics with deep learning. Journal. 2019. URL: https://www.semanticscholar.org/paper/93611c852865e1a0584270f35e6a13c7b3894f56

[8] Oskolkov N. Refining filtering criteria of Kraken family of tools for accurate taxonomic profiling of ancient metagenomic data. Front Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42232910/

[9] Kutuzova S, Nielsen M, Piera P, et al. Taxometer: Improving taxonomic classification of metagenomics contigs. bioRxiv. 2023. URL: https://www.semanticscholar.org/paper/000d16521d0192ad2bb1c7311672ccc4e838cf23

[10] Buffet-Bataillon S, Rizk G, Cattoir V, et al. Efficient and Quality-Optimized Metagenomic Pipeline Designed for Taxonomic Classification in Routine Microbiological Clinical Tests. Microorganisms. 2022. URL: https://www.semanticscholar.org/paper/e21a37d1e231d95e339f04329a3785b5eb569929

[11] Fasino A, Ozdogan E, Sokhansanj B, et al. Semi-Supervised and Incremental Sequence Analysis for Taxonomic Classification. IEEE Symposium Series on Computational Intelligence. 2023. URL: https://www.semanticscholar.org/paper/6842b83036fc7d1220eafacacd05e3f6e131d2cc

[12] Hou T, Liu F, Liu Y, et al. Classification of Metagenomics Data at Lower Taxonomic Level Using a Robust Supervised Classifier. Evolutionary bioinformatics online. 2015. URL: https://www.semanticscholar.org/paper/aa9e8ddfbdf017c171ab0afa69155fa58cbe6218

[13] Verma B, Parkinson J. HiTaxon: a hierarchical ensemble framework for taxonomic classification of short reads. bioRxiv. 2023. URL: https://www.semanticscholar.org/paper/5ff8dfa6c3259dd38f8d5b460b14f0d0fece980b

[14] Van Uffelen A, Gobbo A, Fraiture M-A, et al. Filtering for truth: high-precision taxonomic classification in nanopore shotgun metagenomics data through a KMA-based bioinformatic pipeline (KAPTAIN). BMC Genomics. 2026. URL: https://www.semanticscholar.org/paper/b23dd4d1b32a3a5ff7e96d084990fa85e9e6fe97

[15] Menzel P, Ng KL, Krogh A. Kaiju: Fast and sensitive taxonomic classification for metagenomics. bioRxiv. 2015. URL: https://www.semanticscholar.org/paper/ca426a006dc2af4ae5214b4d95c49df3ce139300

[16] Van Uffelen A, Posadas A, Roosens N, et al. Benchmarking bacterial taxonomic classification using nanopore metagenomics data of several mock communities. Scientific Data. 2024. URL: https://www.semanticscholar.org/paper/4f69b9ff419e535a961903bbd064e51c25abe8be

[17] Valencia E, Maki K, Dootz JN, et al. Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines. Scientific Data. 2024. URL: https://www.semanticscholar.org/paper/b9b2f90201e95eb4e46bfd2aabc6448164a0a925

[18] Portik DM, Brown CT, et al. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/7303282bef732e56bb700874b51e2ba30f50d550

[19] Tovo A, Menzel P, Krogh A, et al. Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju. bioRxiv. 2020. URL: https://www.semanticscholar.org/paper/e43838c5fa38ead2f4088ff3d788ea5673c3b88a

[20] Friganović K, Stanojević D, Chen PB, et al. Metaxa: A Transformer-Based Deep Learning Model for Taxonomic Classification of Long Nanopore Reads. bioRxiv. 2026. URL: https://www.semanticscholar.org/paper/8ccd80de28ef533e88a6eb3ac8624b30d095351e

[21] van Meijenfeldt FV, Arkhipova K, Cambuy DD, et al. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biology. 2019. URL: https://www.semanticscholar.org/paper/8f975cffe4f396c7905d095cddb0472e3ac331b7

[22] Cavalcante JV, Souza IDD, Morais DAA, et al. EURYALE: A versatile Nextflow pipeline for taxonomic classification and functional annotation of metagenomics data. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 2024. URL: https://www.semanticscholar.org/paper/ed1ae3b1d9f0f38e17a0259c62f2fe5a0583ed2e

[23] Menegaux R. Phylo-HS: A phylogenetic hierarchical softmax for taxonomic classification. bioRxiv. 2025. URL: https://www.semanticscholar.org/paper/d93cff70862312b89eb43da2d6949ab5188f49a3

[24] Rakshitha A, Priyanka P S, Navya Shree M U, et al. Toward Interpretable Metagenomic Analysis: A Compositionally-Aware Explainable AI Pipeline for Taxonomic Classification and Functional Prediction. International Journal of Creative and Open Research in Engineering and Management. 2026. URL: https://www.semanticscholar.org/paper/f816f710514b669bc3d30631c31f662dde994d17

[25] Rangel-Pineros G, Almeida A, Beracochea M, et al. VIRify: An integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/8e105f5432de2df5d759a81a7f6c65aab1fe08c9

[26] Altin Karagöz M, Nalbantoglu ÖU. Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning. Biomedical Signal Processing and Control. 2021. URL: https://www.semanticscholar.org/paper/0b6ef0db81a6f9198a1ae4c4f70816a4373e5ed7

[27] Satti M, Cai Z. MetaMAG Explorer: A Database-Augmenting Pipeline for Genome-Resolved Metagenomics and Enhanced Microbial Classification. bioRxiv. 2026. URL: https://www.semanticscholar.org/paper/a622f1dc0d04600fda040503bd5af12858e91692

[28] Kahlke T, Ralph P. BASTA – Taxonomic classification of sequences and sequence bins using last common ancestor estimations. Methods in Ecology and Evolution. 2018. URL: https://www.semanticscholar.org/paper/a066cd

[29] Ozdogan E, Sabin NC, Gracie T, et al. Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification. IEEE Symposium Series on Computational Intelligence. 2021. URL: https://www.semanticscholar.org/paper/5c180ac0c786ab9349014b73bc5d32432bce6fbb

[30] He Y, Du Y, Nguyen L, et al. Mitag4taxa: Extracting SSU rRNA Illumina reads from metagenomes for taxonomic classification. bioRxiv. 2026. URL: https://www.semanticscholar.org/paper/fe41ac90df6d7bec80ff896343906846a1ebb86c