What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Machine Learning-Driven Prediction of Receptor-Binding Dynamics in Emerging Zoonotic Coronaviruses

Introduction

The emergence of zoonotic coronaviruses from animal reservoirs, particularly bats and intermediate livestock hosts, represents a persistent threat to animal and public health. The capacity of a coronavirus to cross species barriers is fundamentally determined by the molecular interaction between its spike glycoprotein and host cell receptors. For betacoronaviruses, angiotensin-converting enzyme 2 (ACE2) serves as the primary receptor for SARS-CoV-2 and related sarbecoviruses, while dipeptidyl peptidase 4 (DPP4) is the receptor for MERS-CoV and several bat coronaviruses (Merck Veterinary Manual). Understanding and predicting these receptor-binding dynamics is essential for assessing zoonotic spillover risk and guiding surveillance efforts in wildlife and livestock populations.

Machine learning (ML) and molecular dynamics (MD) simulations have emerged as powerful tools to predict binding affinities, structural stability, and mutational effects on spike-receptor interfaces. These computational methods enable rapid screening of viral variants from sequencing data, structural modeling of novel spike proteins, and estimation of cross-species transmission potential. This article reviews the current state of ML-driven prediction of receptor-binding dynamics in emerging zoonotic coronaviruses, with a focus on veterinary applications and pandemic preparedness.

Molecular Basis of Receptor Binding in Zoonotic Coronaviruses

Coronavirus spike proteins are class I viral fusion glycoproteins that mediate host cell attachment and entry. The receptor-binding domain (RBD) within the S1 subunit undergoes conformational changes to engage the host receptor. In sarbecoviruses, the RBD adopts two conformations: a closed (down) state that occludes the receptor-binding motif (RBM) and an open (up) state that exposes the RBM for ACE2 binding (Fields Virology). The binding interface involves a network of hydrogen bonds, van der Waals contacts, and salt bridges between RBM residues and the N-terminal helix of ACE2.

For DPP4-using coronaviruses such as MERS-CoV and several bat coronaviruses, the RBD binds to the propeller domain of DPP4. The binding mode is distinct from ACE2 engagement and involves a larger interface area. Key residues in the RBM that contact DPP4 include a loop region that inserts into a cavity on DPP4 (Diseases of Poultry). Understanding these structural determinants is critical for predicting host range.

Zoonotic coronaviruses circulating in bats, such as RaTG13, RmYN02, and various HKU-related viruses, exhibit variable binding affinities for ACE2 orthologs from different species. Some bat coronaviruses can bind human ACE2 with high affinity, while others require intermediate adaptation in an amplifying host (e.g., civets, raccoon dogs, or swine) before efficient human transmission (Merck Veterinary Manual). Computational prediction of these affinities can prioritize surveillance targets.

Computational Tools and Workflows

The prediction of receptor-binding dynamics integrates several computational tools. The following table summarizes key methods and their applications.

Tool / Method	Application	Output
AlphaFold2	Prediction of spike protein 3D structure from amino acid sequence	All-atom model of RBD and full spike
Rosetta	Protein-protein docking and binding energy calculation	Interface scores, binding affinity estimates
GROMACS	Molecular dynamics simulations of spike-receptor complexes	Trajectories, free energy landscapes, RMSD, binding free energy (MM-PBSA)
Deep learning models (e.g., convolutional neural networks, graph neural networks)	Prediction of binding affinity from sequence or structure features	Numeric affinity score (e.g., ΔΔG)
Protein language models (e.g., ESM-1b, ProtBERT)	Embedding of spike sequences for variant effect prediction	Latent representations for downstream classifiers

The typical workflow for predicting receptor-binding dynamics is illustrated in the Mermaid diagram below.

flowchart TD
    A[Viral sequence data from GISAID or field samples], > B[Sequence alignment and phylogenetic analysis]
    B, > C[Structural modeling with AlphaFold2]
    C, > D[Docking of spike RBD to host receptor orthologs using Rosetta]
    D, > E[Molecular dynamics simulations with GROMACS]
    E, > F[Binding free energy calculation (MM-PBSA / MM-GBSA)]
    F, > G[Machine learning model training on features from sequence, structure, and dynamics]
    G, > H[Prediction of binding affinity for novel variants]
    H, > I[Zoonotic spillover risk assessment]
    I, > J[Surveillance prioritization in animal reservoirs]

This workflow can be applied iteratively as new sequences become available. The integration of ML models trained on large datasets of experimentally measured binding affinities (e.g., from deep mutational scanning) allows rapid screening without exhaustive MD simulations.

Machine Learning Models for Binding Affinity Prediction

Several ML architectures have been adapted to predict spike-receptor binding affinity. Convolutional neural networks (CNNs) applied to 3D voxelized representations of protein interfaces can capture spatial patterns of physicochemical complementarity. Graph neural networks (GNNs) treat protein structures as graphs where nodes represent residues and edges represent spatial proximity or interatomic contacts. GNNs have shown superior performance in predicting binding affinity changes upon mutation (Protein-Protein Interface Design and Binding Energy Prediction).

Random forest and gradient boosting models trained on handcrafted features (e.g., electrostatic potential, hydrophobic patch area, hydrogen bond count) remain competitive when feature engineering is carefully performed. Deep learning models, however, can automatically learn relevant features from raw sequence or structure data. Protein language models, such as those based on the transformer architecture, generate embeddings that capture evolutionary and structural information. These embeddings can be fed into a regression head to predict binding free energy (ΔΔG) for RBD mutations.

A critical challenge is the limited availability of experimentally measured binding affinities for diverse coronavirus-receptor pairs. Transfer learning from general protein-protein interaction datasets (e.g., SKEMPI) can partially address this. Additionally, data augmentation through MD simulations can generate synthetic training examples.

Feature Engineering from Sequence and Structure

Effective ML models require informative features. For sequence-based features, one-hot encoding of amino acids, position-specific scoring matrices (PSSMs) from multiple sequence alignments, and evolutionary conservation scores are commonly used. Structural features include solvent-accessible surface area (SASA) of interface residues, residue depth, local backbone angles, and inter-residue contact maps.

For spike RBD-ACE2 complexes, specific features of interest include the number of hydrogen bonds across the interface, the change in SASA upon binding (ΔSASA), and the electrostatic complementarity score. The presence of glycosylation sites near the RBM can also affect binding and should be encoded. Features derived from MD simulations, such as root-mean-square fluctuation (RMSF) of interface residues and principal component analysis of conformational ensembles, provide dynamic information that static structures miss.

Dimensionality reduction techniques (e.g., PCA, t-SNE) are often applied before model training to avoid overfitting. Feature selection using mutual information or recursive feature elimination can identify the most predictive variables.

Deep Learning for Variant Impact Prediction

Deep learning models have been specifically developed to predict the impact of spike protein mutations on receptor binding. These models can be trained on deep mutational scanning (DMS) data, which systematically measures the effect of every single amino acid substitution on binding affinity. For SARS-CoV-2 RBD, comprehensive DMS libraries have been generated, providing a rich training resource.

Convolutional neural networks applied to 2D contact maps can predict how mutations alter the interface. Graph neural networks that incorporate both sequence and structure information can generalize to mutations not present in the training set. Attention-based models can highlight which residues are most critical for binding.

For zoonotic coronaviruses with limited experimental data, zero-shot prediction using protein language models is promising. These models, trained on millions of natural protein sequences, can assign likelihood scores to mutations based on evolutionary plausibility. Mutations that are both evolutionarily acceptable and predicted to increase binding to a new host receptor are flagged as high-risk.

Integration with Molecular Dynamics Simulations

Molecular dynamics simulations provide a dynamic view of spike-receptor interactions that complements static docking. All-atom MD simulations using force fields such as CHARMM36 or AMBER ff14SB can capture conformational changes, water-mediated interactions, and induced fit effects. The binding free energy can be estimated using the molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) method or thermodynamic integration.

Coarse-grained MD simulations, using models such as Martini, allow longer timescales and larger systems, enabling simulation of full spike trimers interacting with membrane-bound receptors. Markov state models constructed from MD trajectories can identify metastable binding states and transition pathways.

Machine learning can accelerate MD simulations by predicting free energy surfaces from short trajectories or by serving as surrogate models for binding affinity. For example, neural networks trained on MD-derived features can predict binding free energies for new variants without running full simulations.

Applications in Zoonotic Spillover Risk Assessment

The ultimate goal of these computational approaches is to assess the zoonotic potential of coronaviruses circulating in animal reservoirs. By predicting the binding affinity of spike proteins from bat, civet, or swine coronaviruses to human ACE2 or DPP4, researchers can prioritize viruses for experimental characterization and surveillance.

For example, the bat coronavirus RaTG13 shares 96% sequence identity with SARS-CoV-2 but shows reduced binding to human ACE2. ML models trained on DMS data can predict which mutations in RaTG13 would enhance human receptor binding, guiding monitoring of viral evolution in bat populations. Similarly, the spike protein of the swine acute diarrhea syndrome coronavirus (SADS-CoV) uses ACE2 from pigs but not humans; computational analysis can identify whether adaptive mutations could shift tropism.

The workflow can be extended to livestock species. Coronaviruses such as bovine coronavirus (BCoV) and porcine epidemic diarrhea virus (PEDV) use different receptors (e.g., sialic acids for BCoV, aminopeptidase N for PEDV). ML models can be trained to predict binding to these receptors and assess the risk of cross-species transmission to other livestock or wildlife.

Integration with genomic surveillance platforms, such as GISAID, allows real-time analysis of emerging variants. The Zoonotic Spillover Pathways and Receptor Binding Evolution in Bat Reservoirs article provides further context on the ecological drivers of spillover.

Conclusion

Machine learning-driven prediction of receptor-binding dynamics represents a transformative approach for veterinary virology and pandemic preparedness. By combining structural modeling, molecular dynamics, and deep learning, researchers can rapidly assess the zoonotic potential of emerging coronaviruses. These methods enable proactive surveillance in animal reservoirs and inform risk mitigation strategies. Continued development of transfer learning techniques and integration with experimental data will further improve predictive accuracy. The computational frameworks described here are equally applicable to other zoonotic viruses, as discussed in related articles on Computational Prediction of Host Tropism and Receptor Binding Dynamics in Emerging Zoonotic Coronaviruses and Structural Dynamics of Avian Influenza Hemagglutinin.

References

Fields Virology, 7th Edition. Wolters Kluwer.
Merck Veterinary Manual, 11th Edition. Merck & Co.
Diseases of Poultry, 14th Edition. Wiley-Blackwell.
Protein-Protein Interface Design and Binding Energy Prediction. In: Structural Bioinformatics and Computer-Aided Drug Design. Springer.

Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.