What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

FASTQ File Format: Decoding Phred Quality Scores and Quality Control Workflows

1. Introduction

The FASTQ file format has become the de facto standard for storing high-throughput nucleotide sequencing reads along with per-base quality scores [1, 2]. Originally developed at the Wellcome Trust Sanger Institute to accommodate quality information from early capillary-based sequencers, the format was later adapted by Solexa and Illumina platforms [1]. Each FASTQ record comprises exactly four lines: a sequence identifier prefixed by @, the raw nucleotide sequence (A, C, G, T or N), a plus sign (optionally followed by the same identifier), and a string of ASCII-encoded quality scores of equal length to the sequence [2, 1]. The format is inherently interleaved, meaning that paired-end reads are typically stored in two separate files with synchronized ordering [3].

Veterinary genomics and pathogen surveillance increasingly rely on FASTQ data for applications such as whole-genome sequencing of bacterial isolates (e.g., Mycoplasma gallisepticum in poultry), viral quasispecies analysis (e.g., avian influenza virus), and metagenomic characterization of the avian gut microbiome. Understanding the precise encoding of quality scores and the implementation of robust quality control (QC) workflows is essential for downstream analyses, including read mapping, variant calling, and assembly [4, 5].

2. Phred Quality Scores: Definition and Encoding

The quality score assigned to each base in a FASTQ record is a logarithmic transformation of the estimated probability that the base call is incorrect [1]. For Sanger and modern Illumina platforms, the score (Q) is defined as:

Q = -10 × log₁₀(P)

where P is the probability of an erroneous base call [1]. A higher Q value thus indicates a more confident call. For example, a Q score of 30 corresponds to an error probability of 0.001 (1 in 1,000), while a Q score of 20 corresponds to 0.01 (1 in 100) [1].

The Phred quality score system was originally developed for the Phred base-calling software used in capillary electrophoresis [1]. The Sanger FASTQ variant encodes scores using an ASCII offset of 33, meaning the ASCII character with decimal value (Q + 33) represents the quality score. Early Illumina formats (1.0 to 1.3) used an offset of 64, and Solexa scores employed a slightly different logarithmic scale [1]. Modern Illumina data (CASAVA 1.8+) has converged on the Sanger offset of 33, simplifying cross-platform compatibility [1, 6]. The standard ASCII range for Sanger FASTQ spans from ! (Q = 0) to ~ (Q = 93), although practical Q values rarely exceed 41 [1].

Base-calling algorithms on high-throughput sequencers assign quality scores based on signal-to-noise ratios, cluster intensity profiles, and phasing/prephasing models [1]. The probability P is derived from empirical calibration curves generated during sequencing instrument validation. These scores are not absolute error probabilities but are generally reliable for read filtering and variant quality stratification.

3. Quality Score Compression

The storage and transmission of FASTQ files present significant challenges due to their large size, a major fraction of which is occupied by quality scores [7, 6]. A comprehensive review of quality score compression methods distinguishes between lossless and lossy approaches [7]. Lossless compressors preserve the exact original quality values and are required when downstream tools depend on accurate error probabilities. Examples include Genozip [8], which employs a universal extensible framework with specialized codecs for genomic data, and GeneSqueeze [9], a reference-free lossless compressor for FASTA and FASTQ files. The Nucleotide Archival Format (NAF) achieves efficient lossless compression of DNA sequences but does not directly store quality scores [10].

Lossy compression reduces file size at the cost of introducing controlled distortions into quality scores, often leveraging rate-distortion theory [11]. Tools such as QualComp [11] and the CMIC compressor [12] allow random access to compressed quality scores while retaining biologically relevant fidelity. A benchmark study of short-read sequence compression software evaluated the trade-offs between compression ratio, speed, and data integrity [6]. The BEETL-fastq archive [13] enables searchable compression of DNA reads, and KungFQ [14] provides a simple but effective approach for FASTQ compression.

Specialized lossless quality score compressors include FCLQC [15] and LCQS [16], both designed for concurrent and random-access functionality. The Cryfa tool integrates encryption with compression for secure genomic data handling [17]. In addition to compression, efficient parsing and indexing of FASTQ files have been addressed by tools such as mim [18], which introduces a lightweight auxiliary index for parallel gzipped FASTQ parsing, and RabbitFX [19], a framework for fast FASTA/Q parsing on multi-core platforms. The FASTdoop library facilitates FASTQ input in Hadoop MapReduce environments [20].

4. Quality Control Workflows

Quality control (QC) is an indispensable step in any sequencing-based analysis. The primary goals of QC are to identify low-quality bases, adapter contamination, sequencing artifacts, and batch effects [4, 21]. A typical QC workflow processes raw FASTQ files through several stages:

Per-base quality score distribution: Aggregated box plots or line graphs of Q scores across read positions reveal whether quality deteriorates toward the 3' end, a common pattern in Illumina sequencing [4, 21].
Per-sequence quality scores: The average Q score per read is calculated; reads with an average below a user-defined threshold (e.g., Q < 20) are discarded.
GC content distribution: Deviations from the expected GC content (e.g., 41% for Gallus gallus genome) may indicate contamination or amplification bias [4].
Adapter contamination: Overrepresented sequences are screened against known adapter libraries; trimming is performed to remove adapter remnants [22].
Duplication levels: High levels of PCR duplicates inflate coverage and can skew variant allele frequencies [21].
Overrepresented sequences: A list of the most frequent sequences is generated; sequences matching known contaminants (e.g., ribosomal RNA, mitochondrial DNA) are flagged.

Tools such as FQStat [21] provide high-speed assessment of these metrics using parallel architectures. SeqKit [22] is a cross-platform toolkit for FASTA/Q manipulation that includes fast subcommands for filtering, trimming, and statistical summarization. For long-read sequencing data (e.g., Oxford Nanopore), specialized QC tools like LongReadSum [5] compute signal-level summaries and read quality metrics specific to nanopore chemistry. ModPhred [23] further extends QC analysis to include detection of DNA and RNA base modifications from nanopore data.

A quality control portal developed for the European Genome-phenome Archive [4] standardizes QC reporting across depositions, ensuring that all submitted FASTQ files meet predefined quality thresholds. The fqtools suite [3] provides a command-line interface for common FASTQ operations including validation, filtering, and conversion between encoding variants.

Below is a representative QC workflow (Figure 1) that integrates the steps described above.

flowchart TD
    A[Raw FASTQ files], > B[Per-base Q score distribution]
    A, > C[Per-sequence average Q score]
    A, > D[GC content analysis]
    A, > E[Adapter contamination scan]
    A, > F[Duplication level estimation]
    B, > G{All metrics pass thresholds?}
    C, > G
    D, > G
    E, > G
    F, > G
    G, Yes, > H[Clean FASTQ for downstream analysis]
    G, No, > I[Filter / trim / discard poor reads]
    I, > J[Generate QC report]
    J, > A

Figure 1. Quality Control Workflow for FASTQ Sequencing Data. Dashed feedback loop indicates iterative filtering until all metrics meet user-defined thresholds.

5. Veterinary Applications and Considerations

In veterinary diagnostics, FASTQ data quality directly influences the reliability of pathogen detection and antimicrobial resistance gene identification. For example, sequencing of Mycoplasma synoviae isolates from chicken flocks requires high-confidence base calls to distinguish mutations associated with vaccine escape [1]. Similarly, metagenomic sequencing of poultry gut content to monitor Salmonella prevalence depends on thorough adapter trimming and low-level contamination removal to avoid false positives [4].

The choice of quality score encoding must be verified before analysis; misidentification of the ASCII offset (e.g., treating Illumina 1.3 scores as Sanger) leads to incorrect error probability estimates [1]. Most modern bioinformatics pipelines automatically detect the encoding by scanning the score range [22, 3].

Given the large volumes of sequencing data generated in veterinary surveillance programs, compression techniques are critical for efficient storage and sharing. Lossless compressors such as Genozip [8] and GeneSqueeze [9] are recommended when quality scores must be preserved for regulatory submissions. Lossy compressors like QualComp [11] may be acceptable for exploratory analyses where minor score perturbations do not affect final biological conclusions.

6. Conclusion

The FASTQ format remains the cornerstone of sequencing data representation in veterinary and comparative genomics. Phred quality scores, encoded as ASCII characters with a defined offset, provide a standardized measure of base-calling confidence. Quality control workflows that assess per-base quality, contamination, and duplication levels are essential for ensuring the integrity of downstream analyses. Advances in compression and parsing algorithms continue to address the challenges of data volume and computational speed, enabling scalable and reproducible research in animal health.

References

[1] Cock PJ, Fields CJ, Goto N et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010. URL: https://pubmed.ncbi.nlm.nih.gov/20015970/ *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.

[2] Zhang H. Overview of Sequence Data Formats. Methods Mol Biol. 2016. URL: https://pubmed.ncbi.nlm.nih.gov/27008007/

[3] Droop AP. fqtools: an efficient software suite for modern FASTQ file manipulation. Bioinformatics. 2016. URL: https://pubmed.ncbi.nlm.nih.gov/27153699/

[4] Fernández-Orth D, Rueda M, Singh B et al. A quality control portal for sequencing data deposited at the European genome-phenome archive. Brief Bioinform. 2022. URL: https://pubmed.ncbi.nlm.nih.gov/35438138/

[5] Perdomo JE, Ahsan MU, Liu Q et al. LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data. Comput Struct Biotechnol J. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/39981293/

[6] Betschart RO, Sandberg F, Blankenberg S et al. A benchmark study of compression software for human short-read sequence data. Sci Rep. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40316539/

[7] Liu Y, Tang T, Zhu Z et al. Quality Scores Compression of Genomic Sequencing Data: A Comprehensive Review and Performance Evaluation. IEEE Trans Comput Biol Bioinform. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40811383/

[8] Lan D, Tobler R, Souilmi Y et al. Genozip: a universal extensible genomic data compressor. Bioinformatics. 2021. URL: https://pubmed.ncbi.nlm.nih.gov/33585897/

[9] Nazari F, Patel S, LaRocca M et al. Lossless and reference-free compression of FASTQ/A files using GeneSqueeze. Sci Rep. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/39747361/

[10] Kryukov K, Ueda MT, Nakagawa S et al. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences. Bioinformatics. 2019. URL: https://pubmed.ncbi.nlm.nih.gov/30799504/

[11] Ochoa I, Asnani H, Bharadia D et al. QualComp: a new lossy compressor for quality scores based on rate distortion theory. BMC Bioinformatics. 2013. URL: https://pubmed.ncbi.nlm.nih.gov/23758828/

[12] Chen H, Chen J, Lu Z et al. CMIC: an efficient quality score compressor with random access functionality. BMC Bioinformatics. 2022. URL: https://pubmed.ncbi.nlm.nih.gov/35870880/

[13] Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics. 2014. URL: https://pubmed.ncbi.nlm.nih.gov/24950811/

[14] Grassi E, Di Gregorio F, Molineris I. KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform. 2012. URL: https://pubmed.ncbi.nlm.nih.gov/23221092/

[15] Cho M, No A. FCLQC: fast and concurrent lossless quality scores compressor. BMC Bioinformatics. 2021. URL: https://pubmed.ncbi.nlm.nih.gov/34930110/

[16] Fu J, Ke B, Dong S. LCQS: an efficient lossless compression tool of quality scores with random access functionality. BMC Bioinformatics. 2020. URL: https://pubmed.ncbi.nlm.nih.gov/32183707/

[17] Hosseini M, Pratas D, Pinho AJ. Cryfa: a secure encryption tool for genomic data. Bioinformatics. 2019. URL: https://pubmed.ncbi.nlm.nih.gov/30020420/

[18] Patro R, Bharti S, Singhania P et al. mim: a lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing. bioRxiv. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/41394640/

[19] Zhang H, Song H, Xu X et al. RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms. IEEE/ACM Trans Comput Biol Bioinform. 2023. URL: https://pubmed.ncbi.nlm.nih.gov/36327193/

[20] Ferraro Petrillo U, Roscigno G, Cattaneo G et al. FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications. Bioinformatics. 2017. URL: https://pubmed.ncbi.nlm.nih.gov/28093410/

[21] Chanumolu SK, Albahrani M, Otu HH. FQStat: a parallel architecture for very high-speed assessment of sequencing quality metrics. BMC Bioinformatics. 2019. URL: https://pubmed.ncbi.nlm.nih.gov/31416440/

[22] Shen W, Le S, Li Y et al. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016. URL: https://pubmed.ncbi.nlm.nih.gov/27706213/

[23] Pryszcz LP, Novoa EM. ModPhred: an integrative toolkit for the analysis and storage of nanopore sequencing DNA and RNA modification data. Bioinformatics. 2021. URL: https://pubmed.ncbi.nlm.nih.gov/34293115/