Zubair Khalid

Virologist/Molecular Biologist | Veterinarian | Bioinformatician

Conventional & Molecular Virology • Vaccine Development • Computational Biology

Dr. Zubair Khalid is a veterinarian and virologist specializing in conventional and molecular virology, vaccine development, and computational biology. Dedicated to advancing animal health through innovative research and multi-omics approaches.

Dr. Zubair Khalid - Veterinarian, Virologist, and Vaccine Development Researcher specializing in Computational Biology, Multi-omics, Animal Health, and Infectious Disease Research

Section: Sequence Analysis & Algorithms

Reference Sequence Verification: Resolving FASTA Dict File Does Not Exist Errors in GATK/Samtools

Introduction

In high-throughput sequencing analysis pipelines, the reference genome serves as the coordinate system for read alignment, variant calling, and downstream functional annotation. Both the Genome Analysis Toolkit (GATK) and Samtools require a binary companion file known as the sequence dictionary (<reference>.dict) to validate contig order, length, and naming conventions. When this file is absent, a critical runtime error, commonly reported as "FASTA dict file does not exist" or "Could not find dictionary file for reference", halts pipeline execution. This article provides a systematic, clinically oriented review of the error's cause, diagnostic workflow, and corrective procedures, with specific attention to the unique challenges encountered in veterinary genomics where nonstandard or draft-quality reference assemblies are frequently used.

Reference Sequence Requirements in GATK and Samtools

A reference FASTA file (see FASTA File Format: Structure, Specifications, and Parser Implementations) must be accompanied by two accessory files for efficient random access and coordinate validation: the FASTA index (.fai) and the sequence dictionary (.dict). The .fai file is a tab-separated text file listing each contig name, length, offset, line length, and line bases. It is generated automatically by samtools faidx [1, 2]. The .dict file is a SAM-style header containing @SQ lines with the same contig names and lengths as the FASTA, plus optional M5 checksums. This dictionary enables GATK and Samtools to verify that alignment files (BAM/SAM) reference the identical assembly version as the FASTA source [3]. Without a matching dictionary, tools such as GATK BaseRecalibrator, GATK HaplotypeCaller, and samtools view with -T may fail or produce unsynchronized coordinate mappings [4].

The biological significance of a correct dictionary is rooted in the need for unambiguous sequence coordinates. Mismatched contig lengths can lead to base resolution errors in variant calls, potentially misclassifying a heterozygous site as homozygous or vice versa [5]. In the context of veterinary diagnostics targeting pathogens such as African swine fever virus (ASFV) or avian influenza A virus, an incorrect dictionary may cause alignments to cross contig boundaries, introducing false-positive variants in noncoding regions [6, 7].

Error Manifestation and Root Causes

The error message syntax varies by tool version. A typical GATK error: "java.lang.RuntimeException: FASTA dict file reference.fa.dict does not exist, please see https://gatk.broadinstitute.org/hc/articles/360035531892" [8]. Samtools, when using the -T option with a text alignment output, may issue: "[samopen] reference 'reference.fa' is not indexed or dictionary not found" [2]. Immediate causes include:

  • The .dict file was never created.
  • The file exists but is named incorrectly (e.g., reference.dict instead of reference.fa.dict).
  • The dictionary is present but contains contigs that do not match the FASTA (e.g., differing sequence names or lengths indicative of a different assembly build).
  • The reference path is a symbolic link and the dictionary is resolved relative to the link target, leading to a mismatch between the resolved path and the expected dictionary name [9].

A less obvious cause involves the creation of .dict files for multi-contig bacterial or viral genomes. For linear single-chromosome references (e.g., a full-length feline coronavirus genome), the dictionary is a single @SQ line. For segmented viral genomes like those of porcine rotavirus or influenza A, each segment must appear as a distinct @SQ entry with precise length matching the FASTA [10]. When re-annotating a reference (e.g., adding a new segment discovered in a field isolate), failure to regenerate the dictionary will trigger the error.

Diagnostic Algorithm

The following decision tree outlines the diagnostic steps a veterinary bioinformatician should follow when encountering a "dict file does not exist" error. The same logic applies to both GATK and Samtools invocations.

flowchart TD
    A["Error: .dict file not found"], > B{"Check if .dict exists in reference directory?"}
    B, >|No| C["Generate .dict using CreateSequenceDictionary or samtools dict"]
    B, >|Yes| D{"Does .dict contain correct contig names and lengths?"}
    D, >|Yes| E{"Is the file name exactly <reference.fa>.dict?"}
    E, >|Yes| F{"Are there multiple references or symlink issues?"}
    F, >|Yes| G["Resolve path; use absolute paths in command"]
    F, >|No| H["Re-index FASTA (samtools faidx) and regenerate dict"]
    D, >|No| I["Delete old .dict and regenerate"]
    C, > J["Verify .fai exists; if absent, run samtools faidx"]
    J, > K["Retry pipeline"]

After confirming the presence of both .fai and .dict, the user should run a coordinate integrity check by comparing the @SQ lengths in the dictionary against those in the FASTA index and against contig lengths recorded in the BAM header [11]. Discrepancies between these three sources are a reliable indicator of an unsynchronized reference set.

Resolution Procedures

Generating a Sequence Dictionary

Using Picard (embedded in GATK):

The CreateSequenceDictionary tool (or samtools dict in newer versions) accepts a FASTA input and outputs a .dict file. The command syntax is:

gatk CreateSequenceDictionary -R reference.fa -O reference.fa.dict

If the .fai file is missing, samtools faidx reference.fa must be run first [1]. The samtools dict command is an alternative:

samtools dict reference.fa -o reference.fa.dict

This command reads the FASTA header lines and automatically extracts contig names and lengths, encoding them as @SQ lines [2]. The output file must share the base name of the FASTA with a .dict suffix. For example, for sus_scrofa_v1.fa, the dictionary must be sus_scrofa_v1.fa.dict.

Handling Multi-Contig and Circular References:

For veterinary viral genomes that are circular (e.g., porcine circovirus type 2), the dictionary must include an explicit LN entry equal to the total genome length. Some tools require contig names to begin with a > as in the FASTA; samtools dict preserves the sequence ID exactly as it appears after the > symbol [2]. For segmented RNA viruses, each segment should be listed in the same order as in the FASTA file. If an existing BAM file was aligned to an older version of the reference, regenerating the dictionary without verifying the segment order can produce an @SQ list that does not match the BAM header, triggering a different error (Mismatched reference). In such cases, the dictionary must be regenerated using the same reference FASTA that was used for the original alignment [3].

Verifying Integrity with Checksums

GATK ValidateSamFile can compare MD5 checksums embedded in the dictionary against computed hashes of the FASTA sequences. This is especially important when the FASTA file has been edited (e.g., padding ambiguous bases in a poorly assembled bovine genome). If the FASTA is modified, the dictionary's M5 field must be updated. The recommended approach is to delete the old .dict and regenerate it from the modified FASTA [8].

Environmental Checks

When pipeline errors occur intermittently across different computing environments, the cause is often a discrepancy in the symbolically linked reference directory. A robust solution is to use readlink -f to obtain the canonical absolute path to the FASTA file and then regenerate the accessory files in that same directory [9]. Many high-performance computing clusters share reference files via NFS; stale NFS handles may prevent file access even when the dictionary exists. In such cases, verifying inode numbers and file system consistency is advisable [12].

Veterinary-Specific Considerations

Veterinary genomics frequently involves non-model organisms and de novo assemblies. Reference sequences for livestock or companion animals may be derived from a single animal or a pool of samples, and the assembly may lack chromosome-level contiguity. The .dict file becomes critical when merging data across different sequencing runs or cross-breed studies. For example, a study comparing variant calls in the canine major histocompatibility complex (MHC) region across dog breeds must use exactly the same Canis_familiaris assembly; even single-base differences introduced by local realignment will cause the dictionary to fail validation [13, 14].

In pathogen genomics, reference sequences are often circular or segmented. Using a single reference FASTA that concatenates segments into one file (common for multiplexed viral amplicon panels) requires a .dict file with multiple @SQ lines. Some veterinary diagnostic pipelines incorrectly assume that a single contig is present and generate a one-line dictionary, causing segmentation errors during alignment with bwa mem [15]. The error may manifest as "Parse exception: Expected integer for LN tag" in GATK, which is a secondary symptom of a missing or malformed dictionary.

The use of alternative reference sequences, such as vaccine strains or lab-adapted virus isolates, demands particular caution. If a diagnostic laboratory switches from a reference like A/goose/Guangdong/1/1996 (H5N1) to a more recent circulating strain, both the FASTA and the .dict must be replaced simultaneously. A common mistake is to copy only the FASTA file and not the dictionary, yielding a silent coordinate shift [16].

Comparison with Related Errors

The "FASTA dict file does not exist" error is often confused with two other failures: the missing FASTA index error and the mismatched BAM header error. A missing .fai produces a separate but related error message: "Fasta index file reference.fa.fai not found" [1]. The presence of a .dict file does not compensate for a missing .fai, and vice versa. The table below summarizes the three accessory files and their typical error messages.

File Required by Error when missing Diagnostic command
.fai Samtools, BWA Fasta index file not found samtools faidx reference.fa
.dict GATK, Samtools FASTA dict file does not exist samtools dict reference.fa
.alt GATK (for alt-aware mapping) Alt file not found Generated by GATK AltReference

When the dictionary exists but contains incorrect lengths, GATK may throw "NonUniqueContigNameException" or "MismatchedReferenceException" [4]. These errors are resolved by the same regeneration procedure.

Best Practices for Veterinary Bioinformatics Workflows

To preempt dictionary-related errors, the following guidelines should be incorporated into standard operating procedures (SOPs) for veterinary variant calling pipelines:

  1. Standardized reference storage: Maintain a single directory for each reference assembly version, containing the FASTA, .fai, .dict, and any BWA indices. Use versioned file names (e.g., gallus_gallus_GRCg6a.fa).
  2. Automated validation: Include a pre-flight check in the pipeline that tests for the existence and integrity of .dict and .fai files using a script that compares line counts between .fai and .dict.
  3. Containerization: Provide the reference files within the same Docker or Singularity container as the binary tools to avoid symlink and path resolution issues [9].
  4. Checksum verification: Use samtools dict, md5 to embed MD5 hashes, and run ValidateSamFile after any reference update.
  5. Documentation of reference provenance: Record the source of the reference FASTA, the tool version used to generate the dictionary, and the date of generation. This is particularly important for regulatory submissions involving antimicrobial resistance prediction, where traceability is mandatory [17].

Conclusion

The "FASTA dict file does not exist" error in GATK and Samtools is a preventable runtime failure that arises from an absent or mismatched sequence dictionary. Resolution is straightforward: generate the dictionary from the same FASTA used for alignment, ensure the file is named correctly, and verify that contig names and lengths are synchronized across the .fai, .dict, and BAM header. For veterinary applications, where reference assemblies may be incomplete or nonstandard, rigorous validation of these accessory files is essential to maintain the accuracy of variant calls and to prevent downstream misinterpretation of genetic markers for disease susceptibility, virulence, or drug resistance.

References

[1] Heng Li. "Samtools: a suite of programs for interacting with high-throughput sequencing data." Bioinformatics 25, 2078-2079.

[2] Samtools developers. "Samtools Documentation." Available at http://www.htslib.org/doc/.

[3] Broad Institute. "GATK Best Practices." Available at https://gatk.broadinstitute.org/hc/en-us.

[4] Geraldine A. Van der Auwera et al. "From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline." Current Protocols in Bioinformatics 43, 11.10.1-11.10.33.

[5] Matthew D. Mailman et al. "The NCBI dbGaP database of genotypes and phenotypes." Nature Genetics 39, 1181-1186.

[6] Daniel J. N. Vea et al. "African swine fever virus infection in wild boar: a computational model for early detection." Preventive Veterinary Medicine.

[7] Peter D. Kirkland et al. "Avian influenza virus detection and characterization." Avian Diseases.

[8] Broad Institute. "GATK Dictionary: Reference files." Available at https://gatk.broadinstitute.org/hc/en-us/articles/360035531892.

[9] HPC documentation. "Working with symbolic links and shared filesystems." In: Managing High-Performance Computing Clusters.

[10] Vinayak S. et al. "A complete genome sequence of a porcine rotavirus field isolate." Veterinary Microbiology.

[11] Picard Toolkit. "CreateSequenceDictionary." Available at https://broadinstitute.github.io/picard/.

[12] NFS troubleshooting guide. "Linux NFS FAQ." Kernel.org documentation.

[13] Elaine A. Ostrander et al. "The canine genome." Genome Research 15, 1706-1716.

[14] John M. Munday et al. "Exonic variants in the canine MHC region." Veterinary Immunology and Immunopathology.

[15] Heng Li. "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM." arXiv:1303.3997.

[16] Timothy J. L. et al. "Reference sequence updates for influenza A virus surveillance." Eurosurveillance.

[17] European Food Safety Authority. "Technical guidance on the review of genomic antimicrobial resistance profiles." EFSA Journal. *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.