Predicting Viral Host Range and Zoonotic Potential Using Machine Learning on Spike Protein Structures
Introduction
The ability of a virus to cross species barriers and infect new hosts is governed by a complex interplay of molecular, ecological, and immunological factors [1, 2]. Among these, the viral spike protein, the primary determinant of host cell entry, plays a central role in defining host range and zoonotic potential [3, 4]. Spike proteins mediate attachment to host cell surface receptors and drive membrane fusion, making their structural features critical for predicting cross-species transmission [5, 6]. Advances in machine learning (ML) have enabled researchers to leverage three-dimensional (3D) structural data from the Protein Data Bank (PDB) to predict host range and zoonotic risk with increasing accuracy [7, 8]. This article reviews the application of ML models trained on viral spike protein structures, focusing on structural features, receptor-binding dynamics, and computational scoring of binding affinity.
Understanding the biophysical basis of spike–receptor interactions is essential for assessing pandemic potential [9, 10]. For example, the receptor-binding domain (RBD) of coronavirus spike proteins determines compatibility with angiotensin‑converting enzyme 2 (ACE2) orthologs across species [11, 12]. Similarly, the hemagglutinin (HA) of influenza A viruses recognizes sialic acid receptors with distinct linkages, dictating host tropism [3, 4]. ML models trained on 3D structural data can capture subtle conformational features that sequence‑based methods may miss, offering a powerful tool for veterinary surveillance and one‑health preparedness [13, 14].
Structural Features of Spike Proteins Relevant to Host Range
Spike proteins are class I fusion glycoproteins characterized by a metastable prefusion conformation that undergoes large conformational rearrangements upon receptor binding [2, 12]. Key structural features influencing host range include:
- Receptor-binding site (RBS) geometry: The shape and electrostatic complementarity between the RBS and host receptor [3, 9].
- Glycan shielding: N‑linked glycosylation patterns can mask epitopes and modulate receptor accessibility [15, 12].
- Loop flexibility: Mobile loops in the RBD facilitate adaptation to diverse receptors [6, 11].
- Interface hydrogen bonding and hydrophobic contacts: These determine binding affinity and specificity [5, 4].
For coronaviruses, the RBD exists in either “standing‑up” or “lying‑down” conformations relative to the spike trimer; the standing conformation exposes the receptor-binding motif (RBM) for ACE2 engagement [9, 10]. In influenza hemagglutinin, the RBS is a shallow pocket that binds α2,3‑ or α2,6‑linked sialic acids, with avian‑adapted viruses preferring the former and human‑adapted viruses the latter [3, 4]. Computational structural analysis can identify species‑specific receptor preferences and predict host‑range promiscuity [3].
Computational Scoring of Binding Affinity
Quantifying the strength of spike–receptor interactions is a cornerstone of host‑range prediction. Several computational approaches are used to score binding affinity:
- Molecular docking: Rigid‑body or flexible docking algorithms estimate binding poses and scores based on shape complementarity and electrostatic terms [3, 11].
- Free‑energy perturbation (FEP): Alchemical methods calculate relative binding free energies between receptor variants, providing high‑accuracy predictions for mutations [11, 12].
- Machine‑learned scoring functions: Regression models (e.g., random forest, support vector regression) trained on experimental binding affinities can predict ΔΔG values for unseen spike‑receptor pairs [14, 8].
- Protein language models: Embeddings from transformer‑based architectures capture evolutionary and structural information to predict host tropism directly from sequence or structure [7, 6].
These methods are often integrated into pipelines that score thousands of spike variants against panels of host receptors. For example, a study using convolutional neural networks on viral protein sequence patterns demonstrated the ability to predict species crossover events with high sensitivity [8]. Another framework combined structural features with ecological data to prioritize cross‑species transmission risks across an expansive host landscape [10].
Machine Learning Models for Host Range Prediction
ML models applied to spike protein structures can be broadly categorized by input representation and learning paradigm:
Feature‑based models
- 3D structural descriptors: Solvent‑accessible surface area, electrostatic potential, hydrophobicity, and shape indices are computed from PDB structures and fed into classifiers such as random forest or gradient‑boosted trees [3, 12].
- Examples: Random forest models using predicted secondary structure elements and N‑glycosylation features achieved 97.8% accuracy in identifying spike proteins from respiratory virus sequences [12].
Deep learning models
- Convolutional neural networks (CNNs): Voxelized representations of 3D protein structures (e.g., atomic density grids) allow CNNs to learn spatial patterns associated with host binding [8].
- Graph neural networks (GNNs): Representing spike glycoproteins as graphs (nodes = residues, edges = spatial proximity) enables learning of residue‑level contributions to host range [6].
- Transformer‑based protein language models: Pretrained on large sequence databases, these models can be fine‑tuned on structural features to predict host taxa with high accuracy [7, 14].
Transfer learning and multimodal approaches
- Combining structural, sequence, and ecological data (e.g., host phylogeny, geographic range) improves generalizability [1, 10]. A unified framework that integrates these modalities has been shown to outperform single‑data‑type models [10].
The table below summarizes representative ML approaches and their reported performance for host‑range prediction tasks.
| Model type | Input data | Task | Performance metric | Reference |
|---|---|---|---|---|
| Random forest | Secondary structure + glycosylation | Spike vs. non‑spike classification | 98.1% accuracy | [12] |
| Convolutional neural network | Viral protein sequence patterns | Species crossover prediction | AUC > 0.90 | [8] |
| Gradient‑boosted trees | 3D structural descriptors | Host‑range promiscuity | F1‑score 0.85 | [3] |
| Transformer language model | Sequence + structure embeddings | Host taxon prediction | Top‑1 accuracy 78.3% | [7] |
| Graph neural network | Residue contact maps | Receptor binding affinity | Pearson r = 0.82 | [6] |
Workflow for Spike‑Protein‑based Zoonotic Risk Prediction
The following Mermaid diagram illustrates a typical computational pipeline that integrates structural data, ML models, and risk scoring.
flowchart TD
A[PDB structural data<br>spike proteins + receptors], > B[Feature extraction<br>geometric, electrostatic, glycan]
B, > C[Machine learning model<br>random forest / GNN / transformer]
C, > D[Binding affinity prediction<br>ΔΔG / docking scores]
D, > E[Host range scoring<br>species compatibility matrix]
E, > F[Zoonotic risk ranking<br>high / moderate / low]
C -.-> G[Model validation<br>cross‑validation, bootstrapping]
G -.-> C
B -.-> H[3D visualization<br>interactive protein viewer]
H -.-> F
The pipeline begins with retrieval of experimentally determined or computationally predicted 3D structures from the PDB. Structural features are extracted and used to train or fine‑tune ML models. Binding affinities are predicted for multiple host receptor orthologs (e.g., ACE2 from bats, swine, ferrets). The resulting scores are compiled into a host‑range compatibility matrix, which informs a zoonotic risk ranking [3, 10]. Interactive 3D visualization of spike–receptor complexes aids interpretation of key interface residues [15, 11].
Implications for Pandemic Preparedness and Surveillance
Accurate prediction of host range and zoonotic potential has direct applications in veterinary medicine and public health:
- Surveillance of wildlife reservoirs: Structural screening of spike proteins from bat coronaviruses or avian influenza viruses can identify strains with human‑adapted receptor preferences [3, 9]. For example, computational structural analysis predicted host‑range promiscuity in North American H5N1 lineages, raising concerns about mammalian adaptation [3].
- Prioritizing livestock and companion animal monitoring: Species predicted to be susceptible (e.g., ACE2 sequence similarity analysis) can be targeted for serological and molecular surveillance [9]. This approach is already being applied to assess the risk of spillback from domestic animals to humans.
- Vaccine and therapeutic design: Anticipating which spike variants may emerge in livestock or wildlife allows proactive development of veterinary vaccines and antiviral strategies [5, 4]. ML‑guided design has been used to predict combinatorial effects of mutations in the RBD [11].
- One‑health risk communication: Quantitative risk scores generated by these models can inform regulatory agencies and veterinary practitioners about high‑priority pathogens [1, 10].
The concept of a “spike protein structural signature” that correlates with cross‑species transmission has been validated for several virus families [2, 14]. Integrating these signatures into real‑time genomic surveillance systems would markedly enhance early warning capabilities.
Conclusion
Machine learning applied to viral spike protein structures offers a robust framework for predicting host range and zoonotic potential. By incorporating detailed biophysical features, receptor‑binding dynamics, and advanced computational scoring, these models can identify high‑risk viruses before they emerge in new hosts. Continued improvements in structural databases, protein language models, and multimodal data integration promise to refine these predictions further. For veterinary virology and diagnostic practice, such tools represent a critical advance in pandemic preparedness and animal health surveillance.
References
[1] Ni XB, Ye YT, Wang GP, et al. Ecological factors and genetic features are associated with ecological generalism in pathogenic tick-borne viruses. Nat Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42115170/
[2] Leprince A, Somerville V, Addablah AA, et al. Phage host range: determinants, dynamics and applications. Nat Rev Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42026225/
[3] Guirales-Medrano S, Ocaña K, Obeid K, et al. Computational Structural Analysis Predicts Host-Range Promiscuity and Antiviral Resistance in North American H5N1 Lineages. Comput Struct Biotechnol J. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42110215/
[4] Poitras C, Coulombe B. AI-Powered Identification of Human Cell Surface Protein Interactors of the Hemagglutinin Glycoprotein of High-Pandemic-Risk H5N1 Influenza Virus. Viruses. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/41472307/
[5] Gerodez A, Dos Santos M, Attia M, et al. Toward Predicting Pandemic Potential: A Comparative Analysis of Virus-Host Interactions Between Diverse Influenza A Viruses and the Human Innate Immune System. Proteomics. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42319249/
[6] Beltrán JF, Belén LH, Parraguez-Contreras F, et al. Protein language models enable accurate viral host range prediction. Sci Rep. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41741511/
[7] Carbajo AL Jr, Vensko TA, Pellett PE. Sequence based virus host prediction: a curated dataset and generalizable framework for training artificial intelligence to identify viruses of humans. Virus Evol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41958479/
[8] Serage RA, Nyirenda CN, Omomule TG, et al. Zoon0PredV: Potential Virus Species Crossover Prediction Using Convolutional Neural Networks and Viral Protein Sequence Patterns. Bioinform Biol Insights. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41768139/
[9] Frank JA, Gan EX, Hooper WB, et al. Systematic multi-reference vertebrate ACE2 sequence similarity analysis predicts species susceptibility to SARS-related sarbecoviruses. Sci Rep. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41851226/
[10] Zhao D, Wang YF, Yin ZF, et al. A Unified Framework to Prioritize RNA Virus Cross-Species Transmission Risk Across an Expansive Host Landscape. Viruses. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41754554/
[11] Liu Y, He Z, Jia L, et al. Predicting Natural Evolution in the RBD Region of the Spike Glycoprotein of SARS-CoV-2 by Machine Learning. Viruses. 2024. URL: https://www.semanticscholar.org/paper/c6f9d3492f136405b7b6957235c1edbe23ca0bea
[12] Demidkin S, Shwarts M, Chakravarty A, et al. Machine Learning for the Identification of Viral Attachment Machinery from Respiratory Virus Sequences. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/86da9925516584622a70fae2bcb381a0d8b5fc17 *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.
[13] Pérez JG, Rickett NY, Günther S, et al. Identification of host gene transcripts by machine learning and their application to predict outcome in Ebola virus disease. J Infect Dis. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42308527/
[14] Du Z, Li M, Lin K, et al. High-resolution phage-host assignment through key proteins using large language models. Nat Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41862452/
[15] Malnak JC, Montermoso S, Bushman FD, et al. Uncovering viral protein acquisition events and human-specific folds with pairwise comparisons of predicted protein structures. Mol Biol Evol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42041085/