Biological Foundation Models for Predicting Host Tropism in Emerging Zoonotic Viruses

Abstract

The prediction of host tropism in emerging zoonotic viruses represents a critical frontier in veterinary virology and one health surveillance. Biological foundation models, a class of deep learning architectures pre-trained on large-scale biomolecular sequence and structural data, offer a transformative approach to inferring viral host range from genomic features alone. This article provides an exhaustive technical review of the biological, biophysical, and computational principles underlying these models. It examines the molecular determinants of host tropism, the architecture of transformer-based and graph neural network foundation models, training data requirements, and validation strategies specific to veterinary contexts. A detailed case study of influenza A virus host prediction using random forest methodologies is presented alongside a discussion of limitations, including data sparsity in non-model host species and the challenge of conformational epitope representation. The article concludes with a workflow diagram for integrating foundation model outputs into routine veterinary diagnostic and surveillance pipelines.

1. Introduction

Zoonotic virus emergence is fundamentally a problem of host tropism: the set of cellular and organismal determinants that permit viral entry, replication, and transmission in a given species. For veterinary medicine, the ability to predict whether a novel virus isolated from a wildlife reservoir, a livestock sentinel, or an avian population can productively infect domestic animal species is paramount for outbreak preparedness. Traditional approaches to host tropism determination rely on experimental inoculation studies, receptor binding assays, and phylogenetic analyses of viral sequences. These methods are labor-intensive, require high biocontainment facilities, and often lag behind viral discovery.

Biological foundation models address this latency by learning distributed representations of biological sequences that capture evolutionary, structural, and functional constraints. These models, typically based on transformer architectures or graph neural networks, are pre-trained on massive corpora of viral and host genomes, then fine-tuned on smaller, labeled datasets of known host-virus associations. The resulting predictors can generalize to novel viruses, provided the input features are sufficiently informative. This review focuses on the veterinary applications of such models, with particular emphasis on influenza A viruses, coronaviruses, and paramyxoviruses.

2. Biophysical Determinants of Host Tropism

Host tropism is governed at multiple molecular scales: viral attachment proteins, host receptor availability, intracellular host factor compatibility, and innate immune evasion. The most tractable scale for computational modeling is the viral glycoprotein-host receptor interface.

2.1 Receptor Binding Specificity

Viral entry requires specific interactions between viral surface glycoproteins and host cell surface receptors. For influenza A viruses, the hemagglutinin (HA) protein binds to sialic acid residues linked to galactose via either alpha-2,3 or alpha-2,6 linkages. Avian influenza viruses typically exhibit preferential binding to alpha-2,3-linked sialic acids, which predominate in the avian intestinal and respiratory tract. Mammalian influenza viruses, including those adapted to swine and humans, preferentially bind alpha-2,6-linked sialic acids found in the upper respiratory tract of mammals [1].

The molecular basis for this specificity resides in the receptor binding site (RBS) of HA, a shallow pocket formed by the 190-helix, the 130-loop, and the 220-loop. Single amino acid substitutions at positions such as Gln226Leu or Gly228Ser can switch receptor preference from avian-type to mammalian-type sialic acids. Foundation models trained on HA sequences must therefore capture these subtle structural correlates.

2.2 Host Protease Compatibility

Cleavage of viral fusion proteins by host proteases is a post-binding requirement for membrane fusion. Influenza HA requires cleavage by trypsin-like proteases such as TMPRSS2 in mammals or matriptase in avian tissues. Similarly, coronavirus spike proteins require priming by transmembrane serine protease 2 (TMPRSS2) or cathepsins. Species-specific variation in protease expression and substrate specificity constrains viral tropism. Computational models that incorporate host protease gene expression profiles alongside viral sequence features achieve higher predictive accuracy than sequence-only models.

2.3 Intracellular Host Factors

After entry, viral replication depends on interactions with host proteins including importins, translation factors, and restriction factors such as Mx proteins and tetherin. For example, the avian influenza virus polymerase subunit PB2 requires host importin-alpha isoforms for nuclear import. The residue 627 in PB2, when changed from glutamic acid (avian) to lysine (mammalian), enhances replication in mammalian cells. Foundation models that integrate protein-protein interaction networks can learn these dependencies.

3. Biological Foundation Model Architectures

Biological foundation models for host tropism prediction fall into three broad architectural categories: sequence-based transformers, graph neural networks operating on protein structures, and hybrid models combining multiple data modalities.

3.1 Transformer-Based Models

Transformers, originally developed for natural language processing, have been adapted to biological sequences by tokenizing amino acids or nucleotides. The self-attention mechanism allows the model to learn long-range dependencies, such as interactions between distal residues in a viral glycoprotein. Pre-training is performed using masked language modeling on millions of viral sequences from public repositories including GenBank and GISAID.

For host tropism prediction, the pre-trained transformer is fine-tuned using a classification head that outputs a probability distribution over host taxa. Common taxon groupings include avian, swine, canine, feline, equine, bovine, and human. The fine-tuning dataset requires careful curation to avoid label leakage where closely related viruses share sequence similarity and host labels.

3.2 Graph Neural Networks

Proteins are naturally represented as graphs where nodes correspond to residues and edges represent spatial proximity in the three-dimensional structure. Graph neural networks (GNNs) operate on these graphs by iteratively updating node representations based on neighbor information. For host tropism prediction, GNNs require either experimentally determined structures or high-confidence predicted structures from tools such as AlphaFold2 or Rosetta.

GNNs have the advantage of explicitly modeling conformational features such as the geometry of the receptor binding pocket. However, they are computationally expensive and may not generalize to viruses with divergent structures not represented in the training set.

3.3 Hybrid Multimodal Models

Hybrid models combine sequence embeddings from transformers with structural embeddings from GNNs, and optionally include additional feature layers representing host receptor expression data, codon usage bias, or phylogenetic distances. These models typically use cross-attention mechanisms to fuse information across modalities. Performance gains from multimodal integration are most pronounced when sequence similarity alone is insufficient, as is the case for highly divergent viral families.

4. Training Data Requirements and Curation

The performance of biological foundation models is fundamentally limited by the quality and coverage of training data. For host tropism prediction, the key data sources include:

  • Viral genome sequences: Complete coding sequences for glycoprotein genes, polymerase genes, and nucleoprotein genes.
  • Host labels: Experimentally confirmed or literature-curated host species annotations. Sources include the WHO, WOAH, and specialized databases such as Virus-Host DB.
  • Negative associations: Viruses that have been tested and found not to infect a given host. These are more difficult to obtain but critical for avoiding biased predictors.

4.1 Challenges in Veterinary Data

Several challenges are specific to veterinary datasets. First, many livestock species including sheep, goats, and camelids are underrepresented in viral sequence databases relative to poultry and swine. Second, asymptomatic infections in wildlife reservoirs are underreported, leading to ascertainment bias. Third, host range is often defined at the species level, but within-species variation in receptor expression (for example, different dog breeds or chicken lines) can influence tropism.

4.2 Data Augmentation Strategies

To address data sparsity, augmentation strategies include:

  • Phylogenetic imputation: Using ancestral state reconstruction to infer host labels for unsampled lineages.
  • Synthetic sequence generation: Using generative adversarial networks or variational autoencoders to create plausible viral glycoprotein variants.
  • Cross-family transfer learning: Pre-training on abundant viral families (e.g., orthomyxoviruses) and fine-tuning on sparse families (e.g., henipaviruses).

5. Predictive Features for Host Tropism

Foundation models learn feature representations end-to-end, but interpretability methods reveal that specific features are highly predictive. Table 1 summarizes the key feature categories.

Table 1. Key Predictive Features for Host Tropism Prediction

Feature Category Specific Features Biological Rationale
Glycoprotein sequence Receptor binding site residues, glycosylation motifs, fusion peptide length Directly determines receptor binding and membrane fusion
Glycoprotein structure Binding pocket volume, electrostatic potential, hydrogen bond donor density Geometric and physicochemical complementarity to host receptor
Host factor interaction Importin-alpha isoform binding motifs, polymerase subunit interfaces Post-entry replication efficiency
Codon usage Codon adaptation index relative to host tRNA pools Translation efficiency in host cells
Phylogenetic context Distance to known zoonotic lineages, clade-specific markers Evolutionary potential for host switching

6. Case Study: Influenza A Virus Host Tropism Prediction

Eng and colleagues [1] developed a random forest model for predicting host tropism of influenza A virus proteins. While not a foundation model in the contemporary deep learning sense, this work established the predictive framework upon which modern models build. The authors trained separate classifiers for each of the 11 influenza proteins using features derived from amino acid composition, dipeptide composition, and predicted secondary structure.

The HA-based classifier achieved the highest accuracy, consistent with the known primary role of HA in host determination. The random forest algorithm identified positions 190, 225, and 226 in HA as the most informative, matching experimental mutagenesis data. The authors emphasized that models trained on avian versus human isolates performed well, but classification of swine isolates was less accurate due to the intermediate receptor binding phenotype of swine-adapted viruses.

This framework has been extended to foundation models by replacing the manual feature engineering with learned embeddings from transformer architectures. Contemporary models achieve area under the receiver operating characteristic curve (AUC-ROC) values exceeding 0.95 for avian versus mammalian classification, and greater than 0.85 for fine-grained host species classification including swine, equine, and canine.

7. Validation Strategies

Robust validation is essential for clinical and diagnostic deployment. Standard approaches include:

  • Temporal cross-validation: Training on sequences collected before a cutoff date and testing on sequences collected after. This simulates prospective prediction of emerging viruses.
  • Leave-one-host-out cross-validation: Holding out all viruses associated with a particular host species during training, then testing the model's ability to predict that host.
  • External validation on independent datasets: Testing on viral families not seen during training, such as predicting host range for a novel paramyxovirus using a model trained on orthomyxoviruses.

8. Integration with Veterinary Diagnostic Workflows

The incorporation of foundation model predictions into routine veterinary diagnostics requires clear protocols. Figure 1 presents a decision workflow.

flowchart TD
    A[Novel viral isolate detected], > B{Sequence available?}
    B, No, > C[Perform metagenomic sequencing]
    C, > D[Assemble viral genome]
    B, Yes, > D
    D, > E[Extract glycoprotein and polymerase sequences]
    E, > F[Pre-trained foundation model inference]
    F, > G[Hop tropism probability vector]
    G, > H{Highest probability host?}
    H, Avian, > I[Alert poultry surveillance network]
    H, Swine, > J[Alert swine health monitoring]
    H, Bovine, > K[Alert cattle health authorities]
    H, Canine/Feline, > L[Alert companion animal clinics]
    H, Multiple hosts, > M[Indicate broad tropism risk]
    I, > N[Initiate targeted diagnostic testing]
    J, > N
    K, > N
    L, > N
    M, > N
    N, > O[Confirm via receptor binding assay]
    O, > P[Report to WOAH]

Figure 1. Workflow for integrating biological foundation model predictions into veterinary diagnostic surveillance. The process begins with sequence acquisition from a novel isolate, proceeds through model inference, and culminates in targeted confirmatory testing and reporting.

9. Limitations and Future Directions

Biological foundation models for host tropism prediction face several limitations. First, they cannot predict host range expansions that require novel mutations not present in the training distribution. Second, they may fail for viruses that use alternative entry mechanisms such as direct cell-to-cell spread or receptor-independent endocytosis. Third, the black-box nature of deep learning models complicates regulatory acceptance by veterinary authorities.

Future directions include:

  • Multiscale integration: Combining molecular predictions with ecological niche modeling, host population density data, and vector distribution maps.
  • Active learning: Prioritizing viruses for experimental host range testing based on model uncertainty.
  • Few-shot learning: Adapting models to predict tropism for novel viral families using only a handful of labeled examples.
  • Mechanistic interpretability: Developing attention-based explanations that report which specific residues drive a prediction.

10. Conclusion

Biological foundation models represent a paradigm shift in the prediction of host tropism for emerging zoonotic viruses. By learning rich representations of viral sequences and structures, these models can infer host range with accuracy approaching that of experimental approaches. For veterinary medicine, the ability to rapidly assess whether a novel avian influenza virus, coronavirus, or paramyxovirus poses a risk to livestock, poultry, or companion animals provides a crucial early warning capability. Continued investment in training data curation, model architecture innovation, and diagnostic integration will be essential for realizing the full potential of these methods in one health surveillance.

References

[1] Eng CL, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics. 2014. https://pubmed.ncbi.nlm.nih.gov/25521718/