What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

De Bruijn Graphs in Genome Assembly

1. Introduction

De Bruijn graphs are a foundational data structure in modern de novo genome assembly, transforming the computational challenge of reconstructing a genome from billions of short sequencing reads into a tractable graph traversal problem [1]. Originally developed in combinatorial mathematics, de Bruijn graphs were adapted for sequence assembly in the late 1990s and have since become the dominant paradigm for short-read assemblers [2]. The central innovation is to represent reads as overlapping fixed-length substrings (k-mers) and to model the genome as a walk through a directed graph whose edges correspond to those k-mers.

In veterinary medicine, de Bruijn graph-based assemblers are routinely employed to reconstruct the genomes of emerging pathogens, including avian influenza viruses, canine coronaviruses, and feline leukemia virus, enabling rapid characterization of virulence factors and transmission patterns. The ability to perform de novo assembly without a reference genome is particularly valuable for veterinary diagnostics targeting novel or highly divergent strains. This article provides an exhaustive technical review of de Bruijn graphs in genome assembly, covering formal definitions, construction algorithms, variant graph types, error correction strategies, and applications in veterinary genomics.

2. Formal Definition and Construction

A de Bruijn graph of order k (dBG_k) constructed from a set of reads R is a directed graph G = (V, E) defined as follows:

Each vertex v in V represents a unique (k-1)-mer (a substring of length k-1) that appears in at least one read.
Each directed edge e = (u, v) represents a k-mer whose prefix of length k-1 is u and whose suffix of length k-1 is v.
Multiple edges between the same pair of vertices may exist, representing distinct occurrences of the same k-mer (or different k-mers sharing the same prefix and suffix).

Construction proceeds by enumerating all k-mers from the reads, recording each k-mer as a directed edge linking its (k-1)-length prefix and suffix vertices. The resulting graph is a compact representation that collapses redundant overlaps, as each k-mer appears only once even if present in many reads [2]. The parameter k directly controls the trade-off between graph connectivity and repeat resolution: small k values produce highly connected graphs prone to collapsing true repeats, while large k values yield fragmented graphs that may fail to link low-coverage regions [3].

In practice, construction must handle sequencing errors, which generate spurious k-mers (erroneous edges) that create complex branching structures. A common filtering step retains only k-mers whose observed abundance exceeds a coverage-dependent threshold, under the assumption that true k-mers derive from the genome at higher multiplicity than error-derived k-mers [2]. The filtered graph is then compacted by merging unambiguous linear paths into unitigs (maximal non-branching sequences), substantially reducing graph size while preserving topological information [2].

2.1. Eulerian Path Formulation

The genome reconstruction problem reduces to finding an Eulerian path (a walk traversing each edge exactly once) in the de Bruijn graph, provided the graph is balanced and connected. Pevzner et al. formalized this reduction, showing that the genome corresponds to a shortest superstring consistent with the read set when error-free [4]. In real datasets, the graph is neither balanced nor free of errors, so assemblers employ heuristic traversal and repeat resolution strategies rather than solving the Eulerian path problem exactly [68].

3. Variants of De Bruijn Graphs

3.1. Bidirected and Doubled De Bruijn Graphs

Standard de Bruijn graphs do not inherently account for the double-stranded nature of DNA; reads may originate from either strand. To handle this, assemblers construct a bidirected de Bruijn graph in which each vertex represents a (k-1)-mer and its reverse complement, or a doubled graph that symmetrizes forward and reverse strands [73]. The doubled de Bruijn graph ensures that palindromic sequences are not artificially split, though recent work has shown that unitig algorithms on bidirected graphs can produce unsafe contigs (substrings not guaranteed to appear in the true genome) under low coverage [73].

3.2. Colored De Bruijn Graphs

Colored de Bruijn graphs (cDBGs) extend the standard graph by assigning a color (or set of colors) to each k-mer, representing membership in different samples or strains [5, 6]. This variant enables simultaneous representation of multiple genomes, facilitating variant detection, pangenomics, and metagenomic analysis. For example, Cortex [5] uses cDBGs to call genetic variants across human populations, and Kleuren [44] reconstructs phylogenetic trees from whole-genome colored graphs. In veterinary diagnostics, cDBGs can represent multiple isolates of a pathogen (e.g., H5N1 avian influenza) to rapidly identify conserved and divergent genomic regions.

3.3. Variable-Order de Bruijn Graphs

Fixed-order graphs require a single k value, which may be suboptimal for genomes with heterogeneous repeat structure or non-uniform coverage. Variable-order de Bruijn graphs (voDBGs) combine multiple orders into a single structure, allowing the assembler to dynamically adapt k to local sequence context [3]. Nodes in a voDBG connect k-mer labels of different lengths through contextual relationships, enabling the identification of (k, h)-tigs that are probabilistically guaranteed to be correct within a specified frequency range [3]. Diaz-Dominguez et al. [3] proved that for frequency ranges where ℓ > h/2, such tigs spell genome subsequences with high probability.

3.4. Multiplex De Bruijn Graphs

Bankevich et al. introduced multiplex de Bruijn graphs (mdBG) in the LJA assembler to handle long, accurate HiFi reads [7, 8]. Instead of a single k, the graph embeds multiple k-mer lengths simultaneously, with edges representing read-supported transitions between nodes. This design reduces graph tangling (branching due to repeats) while preserving contiguity. Verkko2 [9, 10] extends this concept by integrating proximity ligation data (Hi-C) with multiplex graphs to achieve telomere-to-telomere assembly of diploid genomes.

3.5. Minimizer-Space De Bruijn Graphs

Minimizer-space de Bruijn graphs (mdBG) replace nucleotide k-mers with minimizer tokens derived from a sliding window, projecting sequences into an ordered list of minimizers [11]. k-min-mers (k-mers over the minimizer alphabet) are used as graph nodes, drastically reducing memory and computational requirements. The rust-mdbg assembler [11] assembles a human genome in under 10 minutes using 8 cores and 10 GB RAM, demonstrating orders-of-magnitude improvement over conventional approaches. In metagenomic contexts, metaMDBG [12] leverages minimizer-space graphs with multi-k strategies for efficient reconstruction of bacterial, viral, and plasmid sequences from complex communities.

4. Error Correction and Graph Simplification

Sequencing errors introduce false k-mers that create erroneous branching (tips) and bubbles in the de Bruijn graph. Error correction proceeds through multiple stages:

k-mer abundance filtering: Remove k-mers with abundance below a threshold (typically based on a Poisson model of coverage). This eliminates most singletons arising from substitution errors [2].
Tip clipping: Remove short dead-end paths (tips) that extend from the main graph, as they likely represent low-frequency errors.
Bubble merging: Identify pairs of divergent paths that rejoin (bubbles) and collapse them into a single consensus path, handling heterozygous variants or sequencing errors.
Unitig extraction: Compact linear paths into unitigs, which are sequences that are unambiguous (no branching). Unitigs form the backbone of contig assembly [80].

Advanced methods incorporate machine learning for error detection. GnnDebugger [13] uses graph neural networks to classify edges as correct or erroneous directly on the de Bruijn graph, outperforming heuristic thresholds on diploid genomes with coverage below 35x. Conditional random fields have also been applied to estimate node and arc multiplicities, improving repeat resolution [14].

5. Comparison with Overlap Graphs

The two principal assembly graph models are de Bruijn graphs and overlap graphs. Table 1 summarizes key differences.

Feature	De Bruijn Graph	Overlap Graph
Vertex representation	(k-1)-mer	Read (or read prefix)
Edge representation	k-mer	Overlap between read ends
Memory scaling	Proportional to distinct k-mers (collapsed)	Proportional to read pairs (quadratic in worst case)
Repeat handling	Collapses short repeats; long repeats cause tangling	Retains full read information; repeats cause ambiguity
Suitability	Short reads (< 150 bp)	Long reads (> 1 kbp)
Loss of information	Loses read-order information (fixed k)	Preserves all overlap relationships

Composite assemblers attempt to combine both paradigms. Huang and Liao [14] integrated string and de Bruijn graphs to improve contiguity, while the hybrid approach in [27] uses de Bruijn graphs for initial quasicontig formation and overlap graphs for subsequent assembly.

6. Assembly Pipeline Using De Bruijn Graphs

A typical de novo assembly workflow is depicted in Figure 1.

graph TD
    A[Raw Sequencing Reads] --> B[k-mer Counting & Filtering]
    B --> C[Construct dBG]
    C --> D["Error Correction: Tip Clipping, Bubble Removal"]
    D --> E["Graph Compaction: Unitig Generation"]
    E --> F[Scaffolding using Paired-End / Long-Range Info]
    F --> G[Contig & Scaffold Output]
    
    B --> H[Abundance Histogram]
    H --> B
    
    D --> I["Iterative k-mer Selection (Multi-k / Variable Order)"]
    I --> C

Step 1: k-mer enumeration. All k-mers present in the read set are counted. The k-mer size is chosen based on read length and genome complexity, typically between 21 and 127 for short reads, or larger (e.g., 500-1000) for long accurate reads.

Step 2: Graph construction. The dBG is built by inserting each solid k-mer (abundance ≥ threshold) as an edge.

Step 3: Graph simplification. The multigraph is cleared of low-coverage tips and merged bubbles. Unitigs are extracted.

Step 4: Scaffolding. Paired-end or mate-pair reads provide long-range connectivity information, allowing contigs to be ordered and oriented. Linked de Bruijn graphs [15] embed this connectivity directly into the graph structure.

Step 5: Polishing and validation. Reads are mapped back to contigs to correct residual errors, and misassemblies are detected by analyzing read pair support.

7. Applications in Veterinary Genomics

De Bruijn graph assemblers are widely used to reconstruct viral and bacterial genomes from clinical and environmental samples. Key applications include:

Avian influenza virus: Assembly of H5N1/H9N2 genomes from mixed infections, enabling identification of reassortment events and phylogenetic tracking [see Avian Influenza articles].
Canine coronavirus: Variant detection (pantropic vs. enteric) through comparative assembly of multiple isolates using colored de Bruijn graphs [5].
Feline leukemia virus: Recovery of proviral and exogenous FeLV sequences to differentiate progressive from regressive infections.
Metagenomic pathogen discovery: Mining of complex microbiomes for novel viruses and plasmids using minimizer-space dBGs [11, 12].
Antimicrobial resistance gene surveillance: Direct detection of resistance genes in de Bruijn graphs without prior assembly [13].

7.1. Example: Viral Assembly from High-Throughput Sequencing

Veterinary diagnostic laboratories increasingly employ untargeted sequencing (metagenomics) to detect unknown pathogens. De novo assembly of viral genomes from such data requires handling of low viral titers and host contamination. Variable-order and multiplex dBGs improve recovery of fragmented genomes, while error correction via de Bruijn graphs (e.g., LoRMA [16]) yields high-quality consensus sequences even from error-prone long reads.

8. Computational Considerations

Construction and traversal of de Bruijn graphs for large genomes require substantial memory. Compact representation techniques include:

Bloom filters: Probabilistic membership testing reduces memory but introduces false positives [17].
FM-index-based k-mer stores: The kFM-index uses 5 bits per vertex plus overhead, enabling human-scale graphs to fit in 1.5 GB [18].
External memory algorithms: Cuttlefish 3 [19] achieves 3-4x speedup over previous colored compressed dBG construction using parallel disk-based methods.
Streaming construction: Buffered updates allow dynamic addition and deletion of k-mers without full recomputation [29].

9. Open Problems and Future Directions

Despite decades of progress, several challenges remain:

Repeat resolution: Complex segmental duplications and low-complexity repeats still cause misassemblies. Supregraphs [94] offer a theoretically optimal representation by iteratively multiplexing dBG nodes to retain all read information.
Diploid/polyploid assembly: Untangling haplotype-specific paths in de Bruijn graphs requires novel approaches; Verkko2 [9] addresses this through phased scaffolding.
Graph neural networks: Deep learning methods for error correction and pathfinding are emerging but require large training datasets [13].
Pangenome graphs: Colored and compacted dBGs are being scaled to thousands of bacterial genomes for antimicrobial resistance surveillance and phylogenetic analysis [6, 11].

10. Conclusion

De Bruijn graphs remain the central computational abstraction for de novo genome assembly, offering a favorable balance of simplicity, efficiency, and scalability. Their mathematical foundation in Eulerian graph theory, combined with continuous algorithmic refinements (variable-order, multiplex, and minimizer-space variants), ensures their ongoing relevance in veterinary genomics and diagnostics. As sequencing technologies produce ever longer and more accurate reads, de Bruijn graph assemblers continue to evolve, enabling complete, telomere-to-telomere reconstruction of genomes across all domains of veterinary medicine.

References

[1] P.E.C. Compeau, P. Pevzner, G. Tesler. "Why are de Bruijn graphs useful for genome assembly?" Nature Biotechnology, 2011. URL: https://www.semanticscholar.org/paper/c9406b3b7c81e3281fa230f5faeb2e77dbc634a5

[2] D. Zerbino and E. Birney. "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome Research, 2008. URL: https://www.semanticscholar.org/paper/cccdadaba3b4b598d0d963b9c9d431cec24c4bd4

[3] D. Díaz-Domínguez, P. Martinello, T. Onodera et al. "Genome assembly with variable-order de Bruijn graphs." bioRxiv, 2026. URL: https://www.semanticscholar.org/paper/35d1e3b068117e8fccfc250ad2d1165fcf42d619

[4] B. Zhu, S. Liu, S. Liu. "Reconstruction of DNA Sequences Through Eulerian Traversal of De Bruijn Graphs." Mathematics, 2026. URL: https://www.semanticscholar.org/paper/23af6c06f98c39f04f5bf0219d021e6f9ed04b67

[5] Z. Iqbal, M. Cáccamo, I. Turner et al. "De novo assembly and genotyping of variants using colored de Bruijn graphs." Nature Genetics, 2012. URL: https://www.semanticscholar.org/paper/91c964330beeed2c5c875d18633793a548a987cf

[6] A. Rahman, Y. Dufresne, P. Medvedev. "Compression algorithm for colored de Bruijn graphs." Algorithms for Molecular Biology, 2024. URL: https://www.semanticscholar.org/paper/bdf1ecdd6732ba5ed08afcc742e29c1e8df32451

[7] A. Bankevich, A.V. Bzikadze, M. Kolmogorov et al. "Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads." Nature Biotechnology, 2022. URL: https://www.semanticscholar.org/paper/6011afd02e4634b5b57fc76fa770623467132b4f

[8] A. Bankevich, A.V. Bzikadze, M. Kolmogorov et al. "LJA: Assembling Long and Accurate Reads Using Multiplex de Bruijn Graphs." bioRxiv, 2020. URL: https://www.semanticscholar.org/paper/4b2d67718cc8807cbcde1c767f635932b1f8249e

[9] D. Antipov, M. Rautiainen, S. Nurk et al. "Verkko2 integrates proximity-ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding." Genome Research, 2025. URL: https://www.semanticscholar.org/paper/793db66111b2edea42dbe3afff7edeeb206ac0c1

[10] D. Antipov, M. Rautiainen, S. Nurk et al. "Verkko2: Integrating proximity ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding." bioRxiv, 2024. URL: https://www.semanticscholar.org/paper/f85c7f25df466a918f05fdb70a52b91d5c900d56

[11] B. Ekim, B. Berger, R. Chikhi. "Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer." bioRxiv, 2021. URL: https://www.semanticscholar.org/paper/753b19dbf8c8ef148351ca38170fccf59b331f21

[12] G. Benoit, S. Raguideau, R. James et al. "Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs." bioRxiv, 2023. URL: https://www.semanticscholar.org/paper/4b0b6e6b0dedaceb0b70c67a668c9189f493c4b9

[13] M. Šimunović, M. Šikić, A. Bankevich. "GnnDebugger: GNN based error correction in De Bruijn Graphs." bioRxiv, 2025. URL: https://www.semanticscholar.org/paper/fba6fc0477548138c20a39f9c307135e36c7a50e

[14] A. Steyaert, P. Audenaert, J. Fostier. "Improved Node and Arc Multiplicity Estimation in De Bruijn Graphs Using Approximate Inference in Conditional Random Fields." IEEE/ACM Transactions on Computational Biology & Bioinformatics, 2022. URL: https://www.semanticscholar.org/paper/87f10de03c74960d6fa664aea6ec9f65ae9f488e

[15] I. Turner, K.V. Garimella, Z. Iqbal et al. "Integrating long-range connectivity information into de Bruijn graphs." bioRxiv, 2017. URL: https://www.semanticscholar.org/paper/95798e752225b5f4afa591f00eb19b5f27571109

[16] L. Salmela, R. Walve, E. Rivals et al. "Accurate self-correction of errors in long reads using de Bruijn graphs." Bioinformatics, 2016. URL: https://www.semanticscholar.org/paper/47ea131f8eec393a5a6ae2dc21727dcd4d15fe45

[17] R. Rizzi, S. Beretta, M. Patterson et al. "Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era." Quantitative Biology, 2019. URL: https://www.semanticscholar.org/paper/ccd8743520dc7f73020e8767c6a1548f52fcbfa0

[18] E. Rødland. "Compact representation of

[19] J. Khan, L. Dhulipala, P. Pandey et al. "Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3." bioRxiv, 2025. URL: https://www.semanticscholar.org/paper/0dd4f4b5eebed6dd94e7ee25a68ba5b7c689dcdd

[20] D. Díaz-Domínguez, T. Onodera, S. Puglisi et al. "Genome assembly with variable order de Bruijn graphs." Journal, 2022. URL: https://www.semanticscholar.org/paper/723da5f474030a7ad8a3b8cf1f4bb9000d455476

[21] G. Narzisi, A. Corvelo, K. Arora et al. "Genome-wide somatic variant calling using localized colored de Bruijn graphs." Communications Biology, 2018. URL: https://www.semanticscholar.org/paper/b78430a1d64aefa548f374f373bac9a16886c69c

[22] A. Limasset, J.-F. Flot, P. Peterlongo. "Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs." Bioinformatics, 2017. URL: https://www.semanticscholar.org/paper/487f655a78b6c29824c5e357aec37cbba5c857d3 *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.