What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Human Reference Genomes: Navigating and Downloading hg38 and hg19 FASTA Assemblies

Introduction

Reference genomes serve as the foundational coordinate framework for all genomic analyses, from read alignment to variant calling and comparative genomics [1, 2]. In human genomics, two assemblies have dominated the field for over a decade: hg19 (also known as GRCh37) and hg38 (GRCh38) [3, 4]. These assemblies are not only essential for human biomedical research but also provide critical outgroup references for veterinary comparative genomics, host-pathogen interaction studies, and cross-species alignment of conserved regulatory elements [1, 5]. The Human Genome Project: Computational Triumphs established the initial draft, which has since undergone continuous refinement [2, 6]. Understanding the structural differences between hg19 and hg38, their coordinate systems, and the methods for obtaining their FASTA sequences is a prerequisite for any computational biologist working with mammalian genomes [7, 8].

Historical Context and Assembly Improvements

The hg19 assembly (GRCh37) was released in 2009 and represented a major improvement over earlier builds, incorporating sequence from multiple individuals and closing many gaps [4, 9]. However, hg19 still contained hundreds of unresolved gaps, accounting for approximately 5% of the total sequence length, and included minor alleles at 2.18 million positions that could introduce false positives in variant calling [6, 10]. The transition to hg38 (GRCh38) in 2013 addressed several of these limitations: it incorporated alternative locus scaffolds (ALT contigs) to represent highly variable regions, corrected misassemblies, and added approximately 2.2 Mb of novel sequence from gap-closing efforts [6, 8]. The 1000 Genomes Project: Computational Insights provided extensive variant data that informed these improvements [4, 8]. Despite these advances, hg38 remains a mosaic genome derived from a small number of individuals and does not capture the full spectrum of human genetic diversity [2, 11, 5].

Structural Differences Between hg19 and hg38

The two assemblies differ in several key metrics. hg38 has a total sequence length of approximately 3.1 Gb, compared to hg19's 3.0 Gb, with the increase largely due to the inclusion of ALT contigs and the filling of previously unsequenced regions [6, 12]. The number of gaps in hg38 is substantially reduced: whereas hg19 contained 783 gaps, hg38 closed 132 of these (16.9%) using long-read sequencing data [6]. The centromeric regions, which are composed of highly repetitive alpha-satellite arrays, remain challenging in both assemblies, though hg38 includes improved representations [13, 12]. The telomere-to-telomere (T2T) CHM13 assembly, released subsequently, has demonstrated that even hg38 is incomplete, particularly in the short arms of acrocentric chromosomes and in complex structural variant regions [2, 13, 14]. For veterinary researchers, these differences matter when aligning non-human primate or mammalian genomes to the human reference, as gaps and misassemblies can produce spurious alignment signals [15, 5].

Coordinate Systems and Nomenclature

Both hg19 and hg38 use a chromosome-based coordinate system with 1-based indexing, but the underlying sequences differ, requiring coordinate liftover when switching between assemblies [7, 8]. The chromosome naming convention is consistent (e.g., chr1, chr2, ..., chrX, chrY, chrM), but hg38 introduced additional ALT contig names (e.g., chr1_KI270766v1_alt) to represent haplotypic diversity [7, 8]. The FASTA File Format: Structure, Specifications, and Parser Implementations is the standard for storing these sequences, with each chromosome represented as a separate record. The Genome Reference Consortium (GRC) issues patch releases for both assemblies (e.g., GRCh38.p14), which add minor sequence corrections without altering the primary coordinate system [8]. Researchers must be aware of the patch version when downloading FASTA files to ensure reproducibility [8, 12].

Navigating Public Repositories for FASTA Downloads

The primary sources for human reference genome FASTA files are the UCSC Genome Browser, NCBI Assembly database, and Ensembl (part of the European Bioinformatics Institute (EMBL-EBI): A Comprehensive Reference for Veterinary Computational Biology). Each repository provides slightly different file formats and directory structures. UCSC offers "chromFa.tar.gz" archives containing one FASTA file per chromosome, while NCBI provides the assembly in a single compressed FASTA file with a standardized naming convention [8, 12]. Ensembl distributes both primary assemblies (excluding ALT contigs) and toplevel assemblies (including ALT contigs) [7]. For most alignment and variant calling pipelines, the primary assembly is recommended to avoid complications with multi-mapping reads [8, 15]. The following Mermaid diagram outlines a decision workflow for selecting and downloading the appropriate assembly.

flowchart TD
    A[Start: Determine required assembly], > B{Which assembly?}
    B, >|hg19 (GRCh37)| C[Choose source: UCSC, NCBI, Ensembl]
    B, >|hg38 (GRCh38)| D[Choose source: UCSC, NCBI, Ensembl]
    C, > E{Include ALT contigs?}
    D, > E
    E, >|Yes| F[Download toplevel FASTA (with ALT)]
    E, >|No| G[Download primary FASTA (no ALT)]
    F, > H[Verify checksum and patch version]
    G, > H
    H, > I[Index with samtools faidx or similar]
    I, > J[Ready for alignment or analysis]

The download process typically involves using wget or curl to retrieve the compressed archive, followed by decompression with gzip or tar [8]. For example, the UCSC download URL for hg38 follows the pattern http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz. After downloading, the FASTA file should be indexed using samtools faidx to enable random access to specific regions [8, 12]. The SAM and BAM Formats: Mapping, Alignment Representation, and Indexing with Samtools article provides further details on indexing and alignment workflows.

Implications for Veterinary Comparative Genomics

Human reference genomes are frequently used as outgroups in comparative genomic studies of domestic animals, including cattle, swine, dogs, and poultry [1, 5]. The presence of reference bias, where reads carrying the reference allele map more successfully, can skew population genetic inferences in cross-species alignments [15]. This bias is particularly pronounced in ancient DNA studies and low-coverage sequencing, where short fragment lengths exacerbate mapping artifacts [15]. The incomplete representation of human genetic diversity in hg19 and hg38 also affects the detection of structural variants that may be conserved across mammals [13, 16, 17]. For example, the Viromics: Computational Analysis of Viral Genomes often relies on host genome subtraction to identify pathogen sequences; incomplete host references can lead to false-positive viral detection or the inadvertent release of host-identifying information [1]. The use of pangenome references, such as those being developed by the Human Pangenome Reference Consortium, promises to reduce these biases by representing multiple haplotypes [2, 13, 18].

Future Directions: Pangenome and T2T References

The T2T-CHM13 assembly, completed in 2022, represents the first truly complete human genome, with no gaps in any chromosome [2, 13, 18]. This assembly has revealed thousands of previously missing sequences, including complete centromeric arrays and ribosomal DNA clusters [13, 12]. However, CHM13 is derived from a hydatidiform mole cell line and is nearly homozygous, limiting its representation of human diversity [2]. The Human Pangenome Reference Consortium is now producing high-quality diploid assemblies from diverse individuals, with the goal of creating a graph-based pangenome reference that captures global genetic variation [2, 13, 16]. For veterinary applications, these resources will enable more accurate cross-species comparisons and improve the identification of conserved functional elements [18, 14]. Population-specific references, such as the Ashkenazi (Ash1) and Japanese (JG1) assemblies, further demonstrate the value of tailored references for reducing mapping bias [11, 19].

Conclusion

The human reference genomes hg19 and hg38 remain essential tools in both human and veterinary genomics. Understanding their structural differences, coordinate systems, and proper download procedures is critical for reproducible bioinformatics analyses. As the field moves toward pangenome and T2T references, researchers must stay informed about the evolving landscape of reference assemblies to minimize bias and maximize the accuracy of their genomic inferences [2, 18]. The resources provided by UCSC, NCBI, and Ensembl, combined with careful attention to assembly version and patch level, ensure that FASTA sequences can be reliably obtained and used in alignment, variant calling, and comparative genomics pipelines [8, 12].

References

[1] Guccione C, Patel L, Tomofuji Y, et al. Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data. Nature Communications. 2025. URL: https://www.semanticscholar.org/paper/2082a7fe7a48636c675173caa3c297b0e3125f5a

[2] Jarvis E, Formenti G, Rhie A, et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022. URL: https://www.semanticscholar.org/paper/0c58363bd73ea3646eb2436cdd3fec68bc77ddf2

[3] Dwarshuis N, Kalra D, McDaniel JH, et al. The GIAB genomic stratifications resource for human reference genomes. Nature Communications. 2024. URL: https://www.semanticscholar.org/paper/f9da99b98fe83de74de346e2b8b9c48d047a7ef1

[4] Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015. URL: https://www.semanticscholar.org/paper/53a7a77005e93fb884ebad4fb958bc774b97bf9f

[5] Rosenfeld J, Mason C, Smith TM. Limitations of the Human Reference Genome for Personalized Genomics. PLoS ONE. 2012. URL: https://www.semanticscholar.org/paper/0869aab38d4f9cc0a3602ca5af22535a667ada67

[6] Zhao T, Duan Z, Genchev G, et al. Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences. G3. 2020. URL: https://www.semanticscholar.org/paper/987df828931d27336743bde992da3fc057cb5ee1

[7] Rand K, Grytten I, Nederbragt A, et al. Coordinates and intervals in graph-based reference genomes. BMC Bioinformatics. 2016. URL: https://www.semanticscholar.org/paper/ca30cafa637bbd952e1db171984be4bef5cb8f8c

[8] Zheng-Bradley X, Streeter I, Fairley S, et al. Alignment of 1000 Genomes Project reads to reference assembly GRCh38. GigaScience. 2017. URL: https://www.semanticscholar.org/paper/753495c9d8689f141675c7feea01b1225180c657

[9] Genovese G, Handsaker R, Li H, et al. Mapping the human reference genome's missing sequence by three-way admixture in Latino genomes. American Journal of Human Genetics. 2013. URL: https://www.semanticscholar.org/paper/1d8545fde56dec53d581e0ec402eb95c7ef80aec

[10] Shukla H, Bawa P, Srinivasan S. hg19KIndel: ethnicity normalized human reference genome. BMC Genomics. 2019. URL: https://www.semanticscholar.org/paper/7eaf41be1b07675bb8a8c308c3095c658a9cc652

[11] Shumate A, Zimin A, Sherman RM, et al. Assembly and annotation of an Ashkenazi human reference genome. bioRxiv. 2020. URL: https://www.semanticscholar.org/paper/f1d8aefff8ca0a764896b3be058733dbd7c6af3b

[12] Tao Y, He C, Lin D, et al. Comprehensive Identification of Mitochondrial Pseudogenes (NUMTs) in the Human Telomere-to-Telomere Reference Genome. Genes. 2023. URL: https://www.semanticscholar.org/paper/cd9d69119a43aa9bac98be017ead8b589263604c

[13] Logsdon GA, Ebert P, Audano P, et al. Complex genetic variation in nearly complete human genomes. bioRxiv. 2024. URL: https://www.semanticscholar.org/paper/d6654cdcb0238112d367eba4f362c39559b46653

[14] Bilgrav Saether K, Eisfeldt J, Bengtsson J, et al. Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps. Genome Research. 2024. URL: https://www.semanticscholar.org/paper/43691c884ea796d103539d35e9071d519b6ff152

[15] Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. bioRxiv. 2018. URL: https://www.semanticscholar.org/paper/f74b680ec63c45d069ce36fb2e689c848a153afb

[16] Ebert P, Audano P, Zhu Q, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021. URL: https://www.semanticscholar.org/paper/f467146811515d5a32460275406cb27dee243934

[17] Collins RL, Brand H, Karczewski K, et al. A structural variation reference for medical and population genetics. Nature. 2020. URL: https://www.semanticscholar.org/paper/a85b86354a118ee2ab05b9b1b66a6b62bdfce121

[18] Miga K, Eichler EE. Envisioning a new era: Complete genetic information from routine, telomere-to-telomere genomes. American Journal of Human Genetics. 2023. URL: https://www.semanticscholar.org/paper/8b3b6e727c24e1871c22b0f138fc7410aec7a122

[19] Takayama J, Tadaka S, Yano K, et al. Construction and integration of three de novo Japanese human genome assemblies toward a population-specific reference. Nature Communications. 2019. URL: https://www.semanticscholar.org/paper/db3263a027b9893abdf327f436b24ddf55c7f234

[20] Volpe E, Colantoni A, Corda L, et al. The reference genome of the human diploid cell line RPE-1. Nature Communications. 2025. URL: https://www.semanticscholar.org/paper/1cb559c0519b9dfa48eac6893a79e9c8d2446468

[21] Ganapathiraju M, Subramanian S, Chaparala S, et al. A reference catalog of DNA palindromes in the human genome and their variations in 1000 Genomes. Human Genome Variation. 2020. URL: https://www.semanticscholar.org/paper/2c14045edfeef2afbe43d3a3788c99fed2ff6298

[22] Zook J, Catoe D, McDaniel J, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data. 2015. URL: https://www.semanticscholar.org/paper/da1f670f622d866b2456cc4f72192dd6ed0332fe

[23] Hehir-Kwa J, Marschall T, Kloosterman W, et al. A high-quality human reference panel reveals the complexity and distribution of genomic structural variants. Nature Communications. 2016. URL: https://www.semanticscholar.org/paper/096a1877c23c4bdc5e1ba655d36fd3f50b560a1c

[24] Li R, Zhu H, Ruan J, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. 2010. URL: https://www.semanticscholar.org/paper/c6e319a023f932d6d2ff8897c2c91e56e872db8e

[25] Sosa MX, Sivakumar I, Maragh S, et al. Next-Generation Sequencing of Human Mitochondrial Reference Genomes Uncovers High Heteroplasmy Frequency. PLoS Computational Biology. 2012. URL: https://www.semanticscholar.org/paper/b921bf783c76fe924511597838d04aaad514e825

[26] Duan Z, Qiao Y, Lu J, et al. HUPAN: a pan-genome analysis pipeline for human genomes. Genome Biology. 2019. URL: https://www.semanticscholar.org/paper/d6cb417ba6aa140fc38acb5695e437308d4cb7ee

[27] Bowden R, Davies R, Heger A, et al. Sequencing of human genomes with nanopore technology. Nature Communications. 2019. URL: https://www.semanticscholar.org/paper/6dbf45d9b448aa7dce0a5aa60d0849de6131b0cb

[28] Maretty L, Jensen JM, Petersen B, et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature. 2017. URL: https://www.semanticscholar.org/paper/68b93854e731c1e19055f591dbcad4442ad9324b

[29] Choi Y, Chan A, Kirkness E, et al. Comparison of phasing strategies for whole human genomes. PLoS Genetics. 2018. URL: https://www.semanticscholar.org/paper/d86723055dab79ef36ad5774a10bda76bb800b85

[30] Liu Y, Koyutürk M, Maxwell S, et al. Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing. BMC Genomics. 2014. URL: https://www.semanticscholar.org/paper/1013dd9a595fa4fbe811ec1bc9d4cfe969dc59d2

[31] Mallick S, Li H, Lipson M, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016. URL: https://www.semanticscholar.org/paper/b20616b0b707175d638cfdac4ac57640227efffb

[32] Chen G, Wang C, Shi L, et al. Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches. Human Genetics. 2013. URL: https://www.semanticscholar.org/paper/e24ef34abd8a0c27584cd352813a0daea0dbf926

[33] Sanders A, Hills M, Porubsky D, et al. Characterizing polymorphic inversions in human genomes by single-cell sequencing. Genome Research. 2016. URL: https://www.semanticscholar.org/paper/8a6e1ac1baefdeeac35b1a10f58e0430e916c830

[34] Browning B, Browning S. Genotype Imputation with Millions of Reference Samples. American Journal of Human Genetics. 2016. URL: https://www.semanticscholar.org/paper/2ee27eff8ac0cfcf48b28dcbbfa769ec5af7f316

[35] Eberle M, Fritzilas E, Krusche P, et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. bioRxiv. 2016. URL: https://www.semanticscholar.org/paper/29f68a9512a16c6db526fb166a6433be72ad005c *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.