AI-Driven Protein Language Models for Predicting Viral Host Tropism and Zoonotic Potential
Introduction
Predicting the host range of emerging viruses is a central challenge in veterinary virology and pandemic preparedness [1]. Zoonotic pathogens, such as influenza A viruses, coronaviruses, and hantaviruses, repeatedly cross species barriers, causing disease in domestic animals and wildlife [2, 3]. Traditional approaches to assess host tropism rely on phylogenetic analysis, receptor-binding assays, and experimental infections [4, 5]. However, these methods are time-consuming and may not capture the subtle molecular determinants that govern cross-species transmission [6]. In recent years, transformer-based protein language models (PLMs) have emerged as a powerful computational framework for predicting viral host range directly from glycoprotein sequences [7]. These models learn biophysical and evolutionary constraints from large protein sequence corpora and generate dense embeddings that encode structural and functional information [8, 33]. This article reviews the application of PLMs to predict zoonotic potential, with a focus on spike and surface glycoproteins, and compares their performance with conventional machine learning and phylogenetic methods.
Background on Viral Host Tropism and Zoonotic Potential
Viral host tropism is primarily determined by the interaction between viral surface proteins and specific host cell receptors [9]. For influenza A viruses, hemagglutinin (HA) binds sialic acid receptors, and the binding specificity (α2,3 vs. α2,6 linkages) is a major determinant of avian versus mammalian tropism [2]. For coronaviruses, the spike glycoprotein engages host receptors such as angiotensin-converting enzyme 2 (ACE2) or aminopeptidase N [4]. Hantaviruses rely on integrin receptors for entry, and variation in the glycoprotein sequence influences host specificity [3, 5]. Zoonotic spillover occurs when a virus acquires mutations that enable efficient binding to a new host receptor [1, 10]. Surveillance of viral diversity in animal reservoirs, including rodents, bats, birds, and livestock, is therefore critical for risk assessment [11, 12, 13, 14, 6]. Recent metavirome analyses have cataloged numerous viral sequences in dairy cattle [6], urban rats [5], and other species, highlighting the vast potential for cross-species transmission.
Protein Language Models: Architecture and Embedding Generation
Protein language models are deep neural networks based on the transformer architecture, originally developed for natural language processing [15]. They are trained on millions of protein sequences using a masked language modeling objective, where random amino acids are masked and the model learns to predict them from context [7]. This training captures coevolutionary patterns, structural propensities, and functional constraints without requiring explicit labels [16]. For viral glycoproteins, the sequence is tokenized into individual residues, passed through multiple self-attention layers, and produces per-residue embeddings [17]. These embeddings can be aggregated (e.g., via mean pooling) to obtain a fixed-length sequence representation, or they can be used as input to downstream classifiers that exploit the full spatial information [18]. Models such as ESM-1b and ProtBERT have been widely applied to predict variant effects, protein stability, and interactions [8, 19]. In the context of host tropism, these embeddings encode information about receptor-binding motifs, glycosylation sites, and conserved domains that are critical for cross-species recognition [20, 21].
Training on Curated Host-Pathogen Interaction Datasets
To predict viral host range, PLM-derived embeddings are used as features for supervised classifiers. The training data must consist of viral sequences with known host associations, such as influenza HA subtypes isolated from avian, swine, or human hosts [2, 22]. Similarly, coronavirus spike sequences from bat, camel, and other mammalian hosts provide examples of different tropism profiles [4, 6]. The classifier can be a simple logistic regression, a random forest, or a deep neural network that takes the embedding as input and outputs a probability distribution over host categories [23]. Attention mechanisms allow the model to highlight which residues are most influential for the prediction, often mapping to known receptor-binding domains [24, 25]. Cross-validation on independent datasets, such as newly discovered viruses from metavirome surveys [6] or experimental host range data [35], is essential to avoid overfitting. The resulting models can assign a zoonotic risk score to a novel viral sequence, flagging those with high potential for spillover [26, 31]. The workflow is summarized in the following diagram.
graph TD
A[Viral Glycoprotein Sequence], > B[Tokenization & Embedding via PLM]
B, > C[Sequence Embedding Vector]
C, > D[Supervised Classifier]
D, > E[Predicted Host Tropism]
C, > F[Attention Weights for Key Residues]
F, > G[Mapping to 3D Structure]
G, > H[Visualization in 3D Protein Viewer]
D, > I[Zoonotic Risk Score]
I, > J[Validation with MD/Docking]
Validation with Receptor-Binding Dynamics Simulations
A prediction of host tropism is strengthened by computational validation using molecular dynamics (MD) simulations and molecular docking [4, 27]. For a candidate viral variant, the binding free energy between the surface protein and the candidate host receptor can be estimated using tools such as AutoDock Vina or free energy perturbation methods [28, 24]. PLM predictions can prioritize which mutations to test in silico, reducing the search space [29]. For example, if a PLM predicts that a bat coronavirus spike protein can bind human ACE2, MD simulations can quantify the stability of the complex and identify key interface residues [4, 35]. This integrative approach combines the speed of sequence-based prediction with the physical realism of atomistic simulations [2, 27]. The results can be further validated by surface plasmon resonance or pseudovirus entry assays, but those are outside the scope of purely computational methods.
Comparison with Traditional Phylogenetic and Machine Learning Approaches
Traditional phylogenetic analysis reconstructs viral evolutionary relationships and may infer host ancestry through tree topology [3, 5]. However, recombination and convergent evolution can obscure these signals, and phylogenetic methods do not directly capture functional constraints at the molecular level [1]. Classical machine learning approaches that use hand-crafted features (e.g., amino acid composition, codon usage bias, or epitope sequence motifs) have been applied to predict host tropism [13, 14]. These models often achieve moderate accuracy but struggle to generalize to novel viruses because the features are not learned end-to-end [22]. In contrast, PLMs automatically learn relevant features from the sequence itself, capturing long-range interactions and structural information without manual feature engineering [7, 30]. Benchmark studies have shown that PLM-based classifiers outperform both phylogenetic and feature-based models on tasks such as distinguishing avian from human influenza strains and predicting coronavirus host origin [20, 6]. A comparison is presented in Table 1.
| Method | Input | Feature Engineering | Generalizability | Computational Cost |
|---|---|---|---|---|
| Phylogenetic (e.g., ML tree) | Aligned sequence | Manual alignment | Moderate | High |
| Classical ML (e.g., random forest) | Hand-crafted features | Manual | Low | Low |
| Protein Language Model (PLM) | Raw sequence | Automatic (embedding) | High | Moderate |
Table 1. Comparison of traditional and PLM-based approaches for predicting viral host tropism.
Integration with 3D Protein Viewer
The residue-level attention weights produced by PLMs can be mapped onto three-dimensional structures of viral glycoproteins, such as those generated by AlphaFold2 or experimentally determined [8, 33]. Interactive visualization allows researchers to inspect which regions of the spike protein are most predictive of host tropism. For example, if the model attends strongly to residues in the receptor-binding domain (RBD) of a coronavirus spike, those residues can be highlighted in a 3D viewer, facilitating hypothesis generation about binding interface changes [4, 35]. This integration bridges the gap between abstract sequence embeddings and tangible structural biology, enabling a more intuitive understanding of zoonotic risk [9, 15].
Limitations and Future Directions
Despite their promise, PLMs have limitations. Training data may be biased toward well-studied viruses and hosts, reducing performance for less-characterized pathogens [7, 10]. The embeddings are not inherently interpretable, although attention weights provide some insight [16]. Computational cost of running large transformers can be significant [17]. Moreover, sequence-based predictions may not capture post-translational modifications or glycan shielding that affect receptor binding [19, 21]. Future developments will likely integrate multiple data modalities, including structure, metadata, and host gene expression, into unified foundation models [23, 14]. Hybrid approaches that combine PLMs with physics-based simulations are also promising for improving prediction accuracy [24, 6]. Finally, rigorous prospective validation using experimental data from emerging viruses is needed to confirm the reliability of these tools for veterinary surveillance [22, 35].
Conclusion
Protein language models represent a significant advance in computational virology for predicting viral host tropism and zoonotic potential. By leveraging deep contextual embeddings from viral glycoprotein sequences, these models can rapidly assess the risk of cross-species transmission. Integration with receptor-binding dynamics simulations and 3D structural visualization provides a comprehensive framework for risk assessment. As the availability of viral sequence data grows and PLM architectures continue to evolve, these tools will become increasingly valuable for veterinary diagnostics and early warning of zoonotic spillover [9, 7, 1, 2, 16, 28, 17, 4, 3, 18, 30, 8, 11, 5, 29, 19, 20, 23, 27, 21, 12, 13, 24, 14, 25, 22, 26, 6, 31, 32, 33, 34, 35].
References
[1] Asokan S, Ts CS, Choudekar A et al. Avian influenza H5N1: A warning signal for the next influenza pandemic. Diagn Microbiol Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365786/
[2] Zhou X, Sun L, Yang J et al. Genetic and biological characterization of a reassortant H3N2 swine influenza virus isolated in China with internal genes from the 2009 pandemic H1N1. BMC Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365259/
[3] Brennan RN, Palma RE, Paulson SL et al. Contrasting geographic patterns of parasite and hantavirus diversity in the rodent Oligoryzomys longicaudatus (Rodentia, Cricetidae). PLoS Negl Trop Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42361152/
[4] Sootichote R, Chamkasem A, Toniti W et al. Screening candidate intermediate hosts for porcine respiratory coronavirus using molecular docking. Comp Immunol Microbiol Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42361779/
[5] Mundt B, Kant R, Grzybek M. Viral pathogens in urban rats: A one health systematic review of global surveillance evidence. One Health. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42359168/
[6] Liu Y, Zhang G, Gao H et al. Metavirome Analysis of Viruses Car
[7] Rikhi N, Sei CJ, Fraser KA et al. Immunogenicity of a Multi-Epitope Influenza Composite Peptide Vaccine Targeting Human, Swine, and Avian Viruses: Advancing Pandemic Preparedness. Influenza Other Respir Viruses. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365971/
[8] Kant N, Bharti AK, Verma SK. Proteome-Scale Mining of Metal-Associated Proteins of Monkeypox Virus. Proteomics. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42359551/
[9] Ji S, Zhang N, Hu X. Reverse cardio-oncology: neuroendocrine axis activation and cardiovascular-disease-derived factors synergistically remodel the tumor microenvironment. Cell Oncol (Dordr). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42366299/
[10] Sebastian SL, Maria Susana CP, Huber Said PZ et al. Lessons in the Wake of a Lassa fever case in the Midwest U.S: epidemiology, management and preparedness gaps. BMC Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365235/
[11] Lindhorst ZTL, Unterköfler MS, Solarczyk P et al. Detection of zoonotic protozoa in raccoons (Procyon lotor) from aquaculture zones in Saxony (Germany): One health perspective. One Health. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42359169/
[12] Tucciarone CM, Franzo G, Pasotto D et al. Epidemiological Survey of DNA Viruses in Non-Native Pond Sliders (Trachemys scripta) in Northeastern Italy. Viruses. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42357685/
[13] Takahashi M, Kawakami M, Sato Y et al. Nationwide Detection and Molecular Characterization of Hepatitis E Virus RNA in Retail Pork Meat in Japan. Viruses. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42357631/
[14] Youn SY, Lee HS, Yoo MS et al. Tick Microbiome and Its Role in Emerging Zoonotic Diseases and Transmissibility. Microorganisms. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42354905/
[15] Wu S, Xie Y, Zhang Y et al. Tract-based χ-separation imaging differentiates multiple sclerosis from neuromyelitis optica spectrum disorder by mapping iron and myelin signatures. BMC Med Imaging. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365292/
[16] Li Y, Zhu H, Li G et al. Computed tomography findings and severity scores in Chlamydia psittaci pneumonia: a retrospective study of 69 cases with clinical correlation. BMC Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42365231/
[17] Li J, Huang K, Wang H et al. Targeting CCR7-KMT2D enhances CAR-T cell efficacy by suppressing therapy-induced senescence in B-cell non-Hodgkin lymphoma. BMC Med. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42363110/
[18] El-Husseini DM, Arafa FM, Elmasry DMA et al. Antiparasitic activity of peppermint and lavender essential oil nano-emulsions against Toxoplasma gondii RH strain in vitro and in vivo. Vet Res Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42360597/
[19] Huang Y, Xu Q, Zheng L et al. The implications of FASN in viral infection and related diseases: a promising target in antiviral therapies. Front Cell Infect Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42359008/
[20] Rodriguez-Muñoz A, Martínez-Rojas R, Gárate I et al. Experimental pathogenicity of Skrjabinisakis physeteris in Wistar rats and its occurrence in Sarda chiliensis from Peru: Implications for food safety and zoonotic risk. Food Waterborne Parasitol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42358649/
[21] Tian T, Wang X, Zhu Y et al. MALAT1-miR-20b-5p-P2RX7 Axis Regulates Mycobacterium bovis-Induced THP-1 Pyroptosis. Vet Sci. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42357743/
[22] Yamik DY, Vongkamjan K, Guyonnet V et al. Bacteriophages as Potential Sustainable Alternatives to Antibiotics for Controlling Salmonella in the Poultry Value Chain. Antibiotics (Basel). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42353751/
[23] Morovati Moez N, Arabestani MR, Taheri M et al. Nanostructured Lipid Carriers Co-Loaded with Doxycycline, Gentamicin, and Thymol for Enhanced Intracellular Antibacterial Activity Against Brucella melitensis. Int J Nanomedicine. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42358462/
[24] Al-Kahtani SN, Rawwash AA, Semmar A et al. Potential Effects of Bee Products Against Hantavirus Infection: Potential Mechanisms of Action and Future Directions. Life (Basel). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42355521/
[25] Mostafavi N, Tian A, Gao Y et al. A Novel LAS1L Gene Mutation Associated with Impaired Growth and Developmental Delay and a Review with Previously Reported Cases. Genes (Basel). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42353867/
[26] Klaas C, Hoogstra S, Mahoney D et al. Tracking Extended-Spectrum β-Lactamase-Producing Escherichia coli Across Human Communities and Dairy Ecosystems: A One Health Investigation. Antibiotics (Basel). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42353712/
[27] Díaz EA, Sáenz C, Guzmán D et al. Leptospira in Working Horses From Rural Ecuador: A Neglected Occupational Risk. J Trop Med. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42358230/
[28] Gupta MD, Shaha M, Islam M et al. Draft genomes of two multidrug-resistant Enterobacter bugandensis sp. from African grey parrots in Bangladesh. Microbiol Resour Announc. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42363844/
[29] Al-Nazawi AM, Khan M, Qadir A et al. Diagnostic challenges in COVID 19 and dengue co-infection: a case series report from tertiary care centers of Saudi Arabia. Front Med (Lausanne). 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42359066/
[30] Alvites R, Sá S, Lei MC et al. Biosafety implications of cadaver preservation methods in veterinary anatomy education. Vet Res Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42360367/