Section: Computational Biology

In Silico Profiling of Viral Receptor-Binding Domain Evolutionary Trajectories

1. Introduction

The receptor-binding domain (RBD) of viral surface glycoproteins constitutes the primary molecular interface between a virus and its target host cell. For enveloped viruses such as coronaviruses, paramyxoviruses, and orthomyxoviruses, the RBD mediates the initial attachment step that precedes membrane fusion and genome delivery into the host cytoplasm [1]. The specificity and affinity of this interaction are primary determinants of host range, tissue tropism, and zoonotic potential [2, 3]. Understanding the evolutionary trajectories of RBD sequences is therefore central to veterinary virology, particularly in the context of emerging pathogens that cross species barriers from wildlife to domestic livestock or companion animals [4].

Traditional experimental approaches for characterizing RBD evolution, such as in vitro binding assays, X-ray crystallography of RBD-receptor complexes, and pseudovirus entry assays, are resource-intensive and low-throughput [5]. In silico profiling methods provide a complementary framework that allows researchers to analyze large sequence datasets, predict the functional consequences of amino acid substitutions, and reconstruct the historical evolutionary pathways that have shaped RBD-receptor interactions [6, 7]. These computational approaches integrate principles from structural biology, molecular biophysics, phylogenetics, and statistical thermodynamics.

This article provides a technical overview of the key in silico methodologies used to profile viral RBD evolutionary trajectories. The focus is on biophysical principles underlying binding affinity prediction, ancestral sequence reconstruction algorithms, and structural mapping of evolutionary changes. While the methodologies discussed are applicable broadly across viral families, illustrative examples are drawn from coronaviruses that infect veterinary species.

2. Biophysical Basis of RBD-Receptor Interactions

The RBD-receptor interaction is governed by the same thermodynamic principles that dictate all protein-protein interactions [8]. The binding free energy (ΔG binding) of the RBD for its cognate receptor is the central biophysical parameter that determines the strength of attachment. This free energy change is described by the Gibbs-Helmholtz equation:

ΔG binding = ΔH binding - T ΔS binding

where ΔH binding represents the enthalpic contributions from hydrogen bonds, van der Waals contacts, and electrostatic interactions, and ΔS binding represents the entropic penalty from the loss of translational and conformational degrees of freedom upon complex formation [8, 9]. The dissociation constant (Kd) is exponentially related to ΔG binding through Kd = exp(ΔG binding/RT), where R is the universal gas constant and T is the absolute temperature [9].

The RBD itself is typically a independently folding domain that protrudes from the larger glycoprotein surface [1]. For coronaviruses, the RBD is located within the S1 subunit of the spike (S) protein and adopts a beta-sheet-rich immunoglobulin-like fold [3]. The receptor-binding motif (RBM), a structurally defined subregion of the RBD, makes direct atomic contacts with the host receptor, most commonly angiotensin-converting enzyme 2 (ACE2) for many coronaviruses [2, 3]. The precise geometry of the RBM-ACE2 interface, including the number and type of hydrogen bonds, salt bridges, and hydrophobic contacts, determines the binding affinity and host specificity [10].

3. In Silico Binding Affinity Prediction

Computational prediction of binding free energy changes (ddG, the difference in ΔG between a mutant and wild-type RBD) is a core tool for profiling evolutionary trajectories [11]. These methods allow researchers to estimate how specific amino acid substitutions alter RBD affinity without performing experimental binding assays. Two broad classes of algorithms are commonly employed: physics-based energy functions and statistical potential-based approaches.

3.1 Physics-Based Methods

Physics-based methods compute binding free energies using molecular mechanics force fields such as AMBER, CHARMM, or OPLS [12]. These force fields treat atoms as point charges connected by harmonic bonds and parameterize non-bonded interactions through Lennard-Jones potentials and Coulombic electrostatics [12]. The binding free energy is typically decomposed into contributions from van der Waals interactions (ΔEvdw), electrostatic interactions (ΔEelec), and desolvation penalties (ΔGsolv) using the molecular mechanics generalized Born surface area (MM/GBSA) or molecular mechanics Poisson-Boltzmann surface area (MM/PBSA) formalism [13, 14].

The MM/GBSA approach estimates ΔGsolv using the generalized Born model, which approximates the electrostatic solvation energy of a molecule as a function of its atomic radii and partial charges [13]. The MM/PBSA approach solves the Poisson-Boltzmann equation numerically to obtain the electrostatic potential around the solute, providing a more rigorous but computationally expensive treatment of solvation [14]. Both methods require explicit conformational sampling, typically through molecular dynamics (MD) simulations of the RBD-receptor complex, to account for conformational flexibility [15].

3.2 Statistical Potential-Based Methods

Statistical potential methods use knowledge-based scoring functions derived from the observed frequencies of atomic contacts in a database of known protein structures [16]. These potentials are based on the inverse Boltzmann relation:

E(x) = -k_B T * ln[ρ(x) / ρ_ref(x)]

where ρ(x) is the observed frequency of a structural feature x (e.g., interatomic distance) in the database and ρ_ref(x) is its expected frequency under a reference state [16]. Programs such as FoldX and Rosetta implement such potentials, along with additional terms for hydrogen bonding, electrostatics, and torsional preferences, to compute ddG values for point mutations [17, 18]. These methods are computationally efficient, requiring only a few seconds per mutation, and have been benchmarked against large experimental datasets of alanine scanning mutagenesis [17].

3.3 Hybrid and Machine Learning Approaches

More recent approaches combine physics-based and statistical features with machine learning classifiers to improve prediction accuracy [19]. Features may include the change in solvent-accessible surface area (ΔSASA), the number of buried unsatisfied hydrogen bonds, the local sequence entropy, and evolutionary conservation scores from multiple sequence alignments [20]. Random forests, support vector machines, and gradient-boosted trees have been trained on curated datasets of experimental ddG values to produce models that outperform single-method predictions [19, 20].

Table 1 summarizes the key characteristics of the major ddG prediction methods.

Table 1: Comparison of common in silico ddG prediction methods for RBD mutations.

Method Type Computational Cost Input Requirements Output
MM/GBSA Physics-based High (hours) MD trajectory, force field ΔG binding, component energies
MM/PBSA Physics-based Very high (hours to days) MD trajectory, PB solver ΔG binding, component energies
FoldX Statistical potential Low (seconds) Single structure (PDB) ddG, individual energy terms
Rosetta Statistical potential Low to moderate (minutes) Single structure or ensemble ddG, full-atom score
ML-based (e.g., MutaBind, SDM) Hybrid Very low (seconds) Sequence/structure features ddG with confidence interval

4. Ancestral Sequence Reconstruction

Ancestral sequence reconstruction (ASR) is a phylogenetic method that infers the most likely amino acid sequences of ancestral proteins at internal nodes of a phylogenetic tree [21, 22]. For RBD evolutionary studies, ASR allows researchers to resurrect the RBD sequences of ancestral viruses and predict their binding properties in silico, thereby tracing the historical path of host range adaptation [23, 24].

4.1 Maximum Likelihood Reconstruction

The most widely used ASR approach is based on maximum likelihood (ML) phylogenetics [21]. For each site in a multiple sequence alignment of extant RBD sequences, the ML method calculates the marginal posterior probability of each possible amino acid state at every ancestral node, given the tree topology, the branch lengths, and a substitution model that describes the relative rates of amino acid changes (e.g., JTT, WAG, LG) [21, 25]. The ancestral state with the highest posterior probability is assigned as the most likely residue at that site [21].

The branching process is modeled as a continuous-time Markov chain, where the probability of observing a substitution from amino acid i to amino acid j over time t is given by the matrix exponential: P_ij(t) = [exp(Qt)]_ij, where Q is the instantaneous rate matrix [25]. The exchangeability parameters in Q capture the biochemically similar substitutions that occur more frequently (e.g., isoleucine to valine) versus the less frequent radical substitutions (e.g., glycine to tryptophan) [25].

4.2 Empirical Bayesian and Marginal Reconstruction

Empirical Bayesian approaches combine the ML tree and substitution model with a prior distribution on the ancestral states, typically a uniform distribution over the 20 amino acids [21]. The posterior probability of a particular ancestral amino acid at a node is then proportional to the likelihood of the data given that state multiplied by the prior probability [21]. Marginal reconstruction assigns the state with the highest posterior probability to each node independently, while joint reconstruction finds the combination of ancestral states across all nodes that maximizes the overall posterior probability [21, 22].

4.3 Accounting for Functional Constraints

Standard ASR methods assume that all sites evolve independently under the same substitution process [21]. However, RBD residues directly involved in receptor binding tend to be under strong purifying selection, leading to reduced substitution rates and potential biases in ancestral state inference [26]. To address this issue, partitioned or mixture models can be applied, where different categories of sites (e.g., binding interface versus non-interface) are allowed to evolve under different rate matrices or equilibrium frequencies [26, 27]. Codon-based substitution models can also be used, which account for selective pressures at the level of nonsynonymous versus synonymous substitution rates (dN/dS ratios) and may provide more accurate ancestral state estimates for functionally constrained regions [27, 28].

5. Structural Mapping of Evolutionary Changes

Evolutionary information derived from sequence alignments and ASR gains biological meaning when mapped onto three-dimensional (3D) structures of the RBD-receptor complex [29]. Structural mapping reveals the spatial distribution of variable and conserved residues, identifies clusters of co-evolving residues on the protein surface, and highlights regions undergoing positive selection at the host-pathogen interface [30, 31].

5.1 Conservation Coloring and ConSurf Analysis

The ConSurf algorithm, among others, estimates the evolutionary conservation of each amino acid position in a protein sequence by comparing the rate of substitution at that position with the average rate across all positions in the alignment [32]. The method uses a phylogenetic tree to compute position-specific evolutionary rates that are normalized to account for the overall rate heterogeneity of the protein [32]. The resulting conservation scores are mapped onto the 3D structure using a color gradient from cyan (variable) through white to magenta (highly conserved) [32]. For RBDs, residues at the binding interface that are required for receptor recognition are typically highly conserved, while residues on the periphery of the interface that determine host specificity often show greater variability [30, 31].

5.2 Positive Selection Detection and Structural Interpretation

The ratio of nonsynonymous to synonymous substitution rates (dN/dS, denoted ω) is the standard metric for detecting positive selection [28]. When ω > 1, amino acid changes are fixed at a rate higher than expected under neutral evolution, indicating that diversifying selection is acting at that site [28]. Codon-based ML methods such as those implemented in PAML (Codeml) or HyPhy can identify individual codons under positive selection by fitting models that allow ω to vary across sites [28, 33]. The Bayes Empirical Bayes (BEB) procedure calculates the posterior probability that each site belongs to a class with ω > 1, providing statistical support for sites under positive selection [28].

Mapping positively selected sites onto the RBD 3D structure typically reveals their concentration at or near the receptor-binding interface [30, 31]. These sites represent positions where structural changes can alter receptor binding specificity without abolishing the overall interaction [30]. Positively selected sites are often located on surface-exposed loops that directly contact the receptor and are the primary drivers of host range evolution [30, 34].

5.3 Co-Evolution Analysis and Contact Prediction

Residues within the RBD that physically contact the receptor or each other often exhibit co-evolution, meaning that substitutions at one position are correlated with substitutions at another position to maintain structural or functional integrity [35]. Methods such as direct coupling analysis (DCA) use maximum entropy models to infer co-evolutionary couplings from large multiple sequence alignments [35, 36]. These couplings can be used to predict inter-residue contacts and have been applied to validate and refine RBD-receptor complex structures [36]. Co-evolution analysis of RBD and host receptor sequences can also identify compensatory mutations that maintain binding affinity in the face of host receptor divergence across species [34, 36].

6. Integrative Workflow for RBD Evolutionary Trajectory Profiling

The computational methods described above are most powerful when combined into an integrative workflow that traces the complete evolutionary trajectory of an RBD from ancestral reconstruction through structural functional mapping. A representative workflow is presented in Figure 1.

flowchart TD
    A[Collect RBD sequences from target virus genus], > B[Multiple sequence alignment (e.g., MAFFT, MUSCLE)]
    B, > C[Reconstruct maximum likelihood phylogeny (e.g., RAxML, IQ-TREE)]
    C, > D[Ancestral sequence reconstruction (e.g., PAML, BEAST)]
    D, > E[Obtain or predict 3D structure of ancestral RBD-receptor complex (e.g., homology modeling, AlphaFold)]
    E, > F[Compute ddG for all substitutions along branches (e.g., FoldX, Rosetta)]
    F, > G[Identify positively selected sites (codeml, HyPhy)]
    G, > H[Map evolutionary changes onto 3D structure (PyMOL, Chimera)]
    H, > I{Validated with experimental data?}
    I, >|Yes| J[Correlate predicted affinity changes with experimental binding data]
    I, >|No| K[Design mutagenesis experiments to test predictions]
    J, > L[Generate trajectory model of RBD affinity evolution across host species]
    K, > F

This workflow begins with the collection of a comprehensive set of RBD sequences from a target virus genus. Phylogenetic reconstruction provides the branching order and divergence times [7]. ASR generates the inferred ancestral RBD sequences at key internal nodes [22, 23]. The ancestral RBDs are then modeled onto the known 3D structure of the RBD-receptor complex using homology modeling or, more recently, deep learning-based structure prediction tools like AlphaFold [37]. Binding free energy calculations are performed for all substitutions along each branch of the phylogeny, providing a per-substitution ddG value [11]. Positively selected sites are identified and mapped onto the structure, revealing the spatial distribution of adaptive changes [30]. The final integrated model describes how RBD affinity for a given host receptor has evolved over time, allowing prediction of which contemporary or ancestral viruses might pose a risk of host switching [23, 24].

7. Veterinary Implications and Future Directions

The in silico profiling of RBD evolutionary trajectories has direct applications in veterinary virology. For viruses that circulate in livestock, poultry, or wildlife populations, understanding the molecular determinants of host range allows for risk assessment of spillover events into domestic animals [4, 38]. For example, coronaviruses related to porcine epidemic diarrhea virus (PEDV) and transmissible gastroenteritis virus (TGEV) can be analyzed using the methodologies described to predict whether newly emerged variants are likely to show altered tropism for swine enterocytes [38, 39]. Similarly, avian influenza virus hemagglutinin RBD evolution can be profiled to identify strains with increased binding affinity for avian or mammalian sialic acid receptors [40].

The integration of these computational methods with high-throughput experimental validation platforms, such as deep mutational scanning of RBD libraries, is an important frontier [41]. Deep mutational scanning experimentally measures the functional impact of nearly all single amino acid substitutions in an RBD and provides rich datasets for training and validating ddG prediction algorithms [41]. Incorporating these large-scale datasets into the computational workflow will improve the accuracy of evolutionary trajectory models [20, 41].

The continued development of machine learning models that directly predict RBD-receptor binding affinity from sequence alone, without requiring a 3D structure, will further accelerate the screening of large sequence datasets generated by genomic surveillance programs [19, 20]. As computational resources and algorithm sophistication increase, the ability to profile viral RBD evolutionary trajectories in near real-time will become an essential component of veterinary diagnostic and outbreak response systems.

8. Conclusion

In silico profiling of viral receptor-binding domain evolutionary trajectories is a multi-faceted computational discipline that integrates phylogenetic reconstruction, ancestral sequence inference, biophysical binding energy prediction, and structural mapping. These methods provide a rigorous framework for understanding how RBDs have adapted to different host receptors over evolutionary time. For veterinary virologists, the application of these tools enables more informed risk assessments regarding host range and zoonotic potential, and supports the rational design of surveillance strategies for emerging viral pathogens in animal populations.

References

[1] D.J. Sharkey, A. Mukherjee, J.A. Lee, et al. Structural biology of coronavirus spike proteins. Molecular Biology of the Cell, vol. 31, no. 17, pp. 1785-1797.

[2] J. Shang, Y. Wan, C. Luo, et al. Cell entry mechanisms of SARS-CoV-2. Proceedings of the National Academy of Sciences, vol. 117, no. 21, pp. 11727-11734.

[3] A.C. Walls, Y.J. Park, M.A. Tortorici, et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell, vol. 181, no. 2, pp. 281-292.

[4] S. Cleaveland, M.K. Laurenson, L.H. Taylor. Diseases of humans and their domestic mammals: pathogen characteristics, host range and the risk of emergence. Philosophical Transactions of the Royal Society B, vol. 356, no. 1411, pp. 991-999.

[5] M. Letko, A. Marzi, V. Munster. Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nature Microbiology, vol. 5, pp. 562-569.

[6] S. Mallick, Z. Zhang, D. Bhatt, et al. Computational approaches for predicting the binding affinity of protein-protein complexes. Current Opinion in Structural Biology, vol. 60, pp. 44-53.

[7] Z. Yang, B. Rannala. Molecular phylogenetics: principles and practice. Nature Reviews Genetics, vol. 13, pp. 303-314.

[8] J. Janin. Principles of protein-protein recognition. Quarterly Reviews of Biophysics, vol. 42, no. 4, pp. 237-278.

[9] T. Hill. An Introduction to Statistical Thermodynamics. Dover Publications, 1986.

[10] J. Lan, J. Ge, J. Yu, et al. Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature, vol. 581, pp. 215-220.

[11] C. Breed, E. Roberts. Computational prediction of mutation effects on binding affinity in protein-protein complexes. Annual Review of Biophysics, vol. 48, pp. 83-104.

[12] B.R. Brooks, C.L. Brooks III, A.D. Mackerell Jr., et al. CHARMM: the biomolecular simulation program. Journal of Computational Chemistry, vol. 30, no. 10, pp. 1545-1614.

[13] B. Honig, A. Nicholls. Classical electrostatics in biology and chemistry. Science, vol. 268, no. 5214, pp. 1144-1149.

[14] P.A. Kollman, I. Massova, C. Reyes, et al. Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models. Accounts of Chemical Research, vol. 33, no. 12, pp. 889-897.

[15] M. Karplus, J.A. McCammon. Molecular dynamics simulations of biomolecules. Nature Structural Biology, vol. 9, pp. 646-652.

[16] R. Das, D. Baker. Macromolecular modeling with Rosetta. Annual Review of Biochemistry, vol. 77, pp. 363-382.

[17] J. Schymkowitz, J. Borg, F. Stricher, et al. The FoldX web server: an online force field. Nucleic Acids Research, vol. 33, pp. W382-W388.

[18] C.A. Rohl, C.E. Strauss, K.M. Misura, et al. Protein structure prediction using Rosetta. Methods in Enzymology, vol. 383, pp. 66-93.

[19] M. Li, F. Simonetti, A.J. Goncearenco, et al. MutaBind estimates and interprets the effects of sequence variants on protein-protein interactions. Nucleic Acids Research, vol. 44, pp. W494-W501.

[20] P. Vangone, A.M. Bonvin. Contacts-based prediction of binding affinity changes upon mutations in protein-protein complexes. Bioinformatics, vol. 31, no. 12, pp. 1876-1882.

[21] Z. Yang, S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, 2000.

[22] J.P. Huelsenbeck, F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, vol. 17, no. 8, pp. 754-755.

[23] J.W. Thornton. Resurrecting ancient genes: experimental analysis of extinct molecules. Nature Reviews Genetics, vol. 5, pp. 366-375.

[24] M.J. Harms, J.W. Thornton. Evolutionary biochemistry: revealing the historical and physical causes of protein properties. Nature Reviews Genetics, vol. 14, pp. 559-571.

[25] D.T. Jones, W.R. Taylor, J.M. Thornton. The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences, vol. 8, no. 3, pp. 275-282.

[26] H. Philippe, D. Derelle, M. Lopez, et al. Phylogenomics resolves the timing and pattern of insect evolution. Proceedings of the National Academy of Sciences, vol. 107, no. 24, pp. 10979-10984.

[27] Z. Yang, J.P. Bielawski. Statistical methods for detecting molecular adaptation. Trends in Ecology and Evolution, vol. 15, no. 12, pp. 496-503.

[28] Z. Yang. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution, vol. 24, no. 8, pp. 1586-1591.

[29] E.F. Pettersen, T.D. Goddard, C.C. Huang, et al. UCSF Chimera: a visualization system for exploratory research and analysis. Journal of Computational Chemistry, vol. 25, no. 13, pp. 1605-1612.

[30] Y. Li, H. Wang, X. Tang, et al. Evolutionary dynamics of the SARS-CoV-2 spike protein. Virus Evolution, vol. 6, no. 2, pp. veaa070.

[31] M. Worobey, A. Rambaut, E.C. Holmes. Evolutionary biology of emerging viruses. Nature Reviews Microbiology, vol. 6, pp. 749-759.

[32] H. Ashkenazy, E. Erez, E. Martz, et al. ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Research, vol. 38, pp. W529-W533.

[33] S.L. Kosakovsky Pond, S.D. Frost, S.V. Muse. HyPhy: hypothesis testing using phylogenies. Bioinformatics, vol. 21, no. 5, pp. 676-679.

[34] K. Andersen, K. Rader, J. Alexander, et al. Co-evolution of host receptor and viral attachment protein. Molecular Biology and Evolution, vol. 31, no. 7, pp. 1693-1706.

[35] D.S. Marks, L.J. Colwell, R. Sheridan, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE, vol. 6, no. 12, pp. e28766.

[36] M. Weigt, R.A. White, H. Szurmant, et al. Identification of direct residue contacts in protein-protein interaction by message passing. Proceedings of the National Academy of Sciences, vol. 106, no. 1, pp. 67-72.

[37] J. Jumper, R. Evans, A. Pritzel, et al. Highly accurate protein structure prediction with AlphaFold. Nature, vol. 596, pp. 583-589.

[38] D. Song, B. Park. Porcine epidemic diarrhoea virus: a comprehensive review of molecular epidemiology, diagnosis, and vaccines. Virus Genes, vol. 44, pp. 167-175.

[39] L. Enjuanes, I. Sola, C.M. Sanchez, et al. Transmissible gastroenteritis coronavirus gene 7 is not essential for virus replication but influences pathogenesis in vivo. Journal of Virology, vol. 69, no. 10, pp. 6096-6103.

[40] M. Matrosovich, T.Y. Stech, H.D. Klenk. Receptor binding properties of H1, H3, H5, H7, and H9 influenza viruses. Microbiology and Molecular Biology Reviews, vol. 64, no. 4, pp. 764-779.

[41] D.M. Fowler, S. Fields. Deep mutational scanning: a new style of protein science. Nature Methods, vol. 11, pp. 801-807. *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.