Zubair Khalid

Virologist/Molecular Biologist | Veterinarian | Bioinformatician

Conventional & Molecular Virology • Vaccine Development • Computational Biology

Dr. Zubair Khalid is a veterinarian and virologist specializing in conventional and molecular virology, vaccine development, and computational biology. Dedicated to advancing animal health through innovative research and multi-omics approaches.

Dr. Zubair Khalid - Veterinarian, Virologist, and Vaccine Development Researcher specializing in Computational Biology, Multi-omics, Animal Health, and Infectious Disease Research

Section: Computational Biology

Deep Mutational Scanning and Machine Learning for Predicting SARS-CoV-2 Spike Protein Evolution and Antibody Escape

Introduction

The continuous emergence of SARS-CoV-2 variants with mutations in the spike glycoprotein, particularly the receptor-binding domain (RBD), has necessitated advanced computational and experimental approaches to map viral fitness landscapes and predict immune evasion [1, 2]. Deep mutational scanning (DMS) coupled with machine learning (ML) models now provides a powerful framework for systematically characterizing the effects of individual amino acid substitutions on ACE2 receptor binding, protein expression, and antibody recognition [3, 4]. These technologies enable real-time assessment of emerging variants and inform the design of cross-reactive vaccines and therapeutics relevant to both human and veterinary coronaviruses [5, 6].

DMS experiments generate comprehensive mutational data by measuring the functional impact of thousands of single-point mutations in parallel [7]. When combined with ML architectures such as protein language models (PLMs), variational autoencoders (e.g., EVE), and autoregressive models (e.g., Tranception), these data can be used to predict epistatic interactions, evolutionary trajectories, and antibody escape potential with high accuracy [8, 9, 10]. This review details the experimental and computational pipeline from DMS data generation to predictive modeling, with emphasis on the SARS-CoV-2 spike RBD as a model system with comparative relevance to veterinary coronaviruses.

Deep Mutational Scanning Platforms and Data Generation

Yeast Surface Display of RBD Libraries

DMS of the SARS-CoV-2 spike RBD is most commonly performed using yeast surface display, where a library of RBD variants is expressed on the surface of Saccharomyces cerevisiae and subjected to selection for ACE2 binding or antibody escape [11, 12]. The library is generated by oligonucleotide-directed mutagenesis covering all 19 amino acid substitutions at each codon, typically yielding 3,800 to 6,200 unique variants depending on the RBD length. After mutagenesis, the library is cloned into a yeast display vector and transformed into yeast cells [12]. Expression of RBD variants fused to an epitope tag (e.g., HA or c-myc) is induced, and cells are incubated with soluble ACE2-Fc fusion protein or fluorescently labeled antibodies for selection [12, 13].

Binding and non-binding populations are separated via fluorescence-activated cell sorting (FACS), and the frequency of each variant in the input and selected populations is determined by deep sequencing of the RBD region [12]. The enrichment ratio (outcome) for each mutation is calculated as the log2 of the ratio of its frequency in the selected population relative to the input, normalized to wild-type. These enrichment scores serve as proxies for fitness effects on ACE2 binding, protein stability, or antibody escape [12, 13].

Lentiviral Pseudotyping for Functional Validation

Complementary to yeast display, lentiviral pseudotyping allows measurement of infectivity and neutralization in a mammalian cell context [13, 14]. Spike variants are cloned into a lentiviral vector and used to produce pseudovirions bearing the mutated spike. Pseudovirus infectivity is measured by transduction of ACE2-expressing target cells, and antibody neutralization is assessed by pre-incubating pseudovirus with serial dilutions of monoclonal antibodies or polyclonal sera [14, 35]. This system recapitulates native spike processing, membrane fusion, and entry, providing a more physiologic readout of fitness [13, 14].

Computational Pipelines for DMS Data Processing

Raw sequencing reads must be processed through standardized pipelines to convert counts to robust fitness estimates. Two commonly used tools are Enrich2 and DiMSum. Enrich2 uses a Bayesian framework to model counts across replicates and compute log2 enrichment ratios with confidence intervals, while DiMSum offers a generalized linear model that accounts for errors from sampling and sequencing [1, 4]. Both pipelines require careful filtering to remove variants with low coverage or ambiguous alignments. Output is a position-specific mutational effect matrix that can be used for downstream ML modeling [4].

Key Mutations in the RBD and Their Functional Impact

Numerous DMS studies have characterized the effects of RBD mutations that appear in variants of concern. Table 1 summarizes several critical substitutions and their reported effects on ACE2 affinity and antibody escape.

Mutation ACE2 Binding Effect Antibody Escape Effect Variant Association Key References
E484K Neutral to slightly increased Strong escape from class 1/2 antibodies Beta, Gamma, Omicron [15, 16, 31]
N501Y Increased affinity Reduced escape; key for ACE2 adaptation Alpha, Beta, Gamma, Omicron [11, 12, 17]
K417N/T Decreased affinity Escape from class 1 antibodies Beta, Gamma, Omicron BA.1 [18, 16, 19]
L452R Slightly increased Escape from class 2 antibodies Delta, Omicron BA.5 [20, 15, 32]
Q493R Decreased affinity (in Omicron) Escape from multiple mAbs Omicron BA.1 [12, 33]
F486S Decreased affinity Escape from class 3 antibodies Omicron BA.4/BA.5, XBB [21, 22, 23]
N460K Unclear Strong escape (BQ.1.1 key) BQ.1.1 [22]

Several mutations exhibit epistasis, where the effect of a mutation depends on the genetic background [1, 24]. For example, the Q498R mutation is more favorable in the Omicron background than in the ancestral Wuhan-Hu-1 background due to epistatic interactions with N501Y [12]. Such context-dependent effects are critical for accurate prediction of future evolutionary trajectories.

Machine Learning Models for Fitness Landscape Prediction

Protein Language Models and Zero-Shot Prediction

Protein language models (PLMs) such as ESM-1v and Tranception have been adapted to predict the fitness of spike protein variants directly from sequence, without requiring DMS data as training input (zero-shot prediction) [3, 10]. These models are trained on large corpora of natural protein sequences and learn evolutionary constraints. When applied to the RBD, PLM scores correlate strongly with experimentally measured enrichment scores, particularly for mutations that affect stability or compatibility with ACE2 binding [10]. The EVE model uses a variational autoencoder trained on homologous sequences to compute an evolutionary index that predicts variant pathogenicity and has been applied to SARS-CoV-2 to forecast high-fitness mutations [8].

Supervised Learning with DMS Data

When DMS data are available, supervised ML models can be trained to predict enrichment scores with high accuracy. Deep mutational learning (DML) as described by Taft et al. uses a convolutional neural network trained on combinatorial mutation data to predict ACE2 binding and antibody escape for billions of possible RBD variants [13]. Similarly, Durumeric et al. used Gaussian process regression and random forest models trained on DMS data to simulate the fitness landscape across sequence space and identify mutational trajectories with high escape potential [4].

Another approach integrates DMS data with structure-based features. Du et al. combined AlphaFold2-predicted structures, electrostatic surface potentials, and residue contact maps to predict the effect of mutations on antibody binding [5, 17]. The inclusion of structural information improves generalization to mutations not present in the training set, particularly for epitope mapping [5].

Epistasis Modeling and Clonal Interference

Epistatic interactions between mutations shape the accessibility of evolutionary pathways. Haddox et al. demonstrated that clonal interference in the antibody-selected environment can drive the emergence of combinations of mutations that individually confer little escape but together provide strong resistance [9]. ML models that incorporate pairwise or higher-order epistatic terms, such as those using the quasi-chemical approximation or pairwise interaction tensors, can capture these effects [4, 9]. For example, Nasir et al. used a random forest model that included pairwise mutation interaction features to predict antigenic grouping of SARS-CoV-2 variants with high accuracy [5].

Integration with Structural Data

Structural data from cryo-EM and AlphaFold2 predictions are increasingly used as inputs to ML models. Zhao et al. showed that Omicron mutations S371L, S373P, and S375F stabilize the one-RBD-up conformation, reducing exposure of cryptic epitopes [17]. Incorporating residue depth, solvent accessibility, and B-factor predictions into ML models improves prediction of antibody escape [7, 17]. Cryo-EM structures of spike-antibody complexes, such as those of BA.2.86, provide high-resolution maps of epitope remodeling [21].

Predicting Antibody Escape

A central application of DMS-ML integration is the prediction of mutations that enable escape from neutralizing antibodies. Deep mutational scans for specific monoclonal antibodies (e.g., LY-CoV1404, S309) have mapped the complete set of RBD substitutions that reduce binding [2, 12, 17]. Greaney et al. aggregated DMS data into an "escape estimator" that scores arbitrary combinations of mutations for their predicted polyclonal antibody escape relative to the ancestral strain [25]. This tool correlates well with neutralization assays for emerging variants [25, 24].

DMS also reveals that escape is often polygenic: multiple mutations are required to escape broadly neutralizing antibodies [9, 24]. Witte et al. showed that prior acquisition of Omicron BA.1/BA.2 mutations lowers the genetic barrier for subsequent escape from class 1 antibodies, a phenomenon driven by epistasis [24]. Machine learning models that account for such epistatic permissivity can rank novel variants before they spread.

Workflow Integration: From DMS to Evolutionary Forecasting

The integrated workflow for predicting spike evolution and antibody escape is depicted in Figure 1.

flowchart TD
    A[Deep Mutational Scanning], > B[Enrich2 / DiMSum Processing]
    B, > C{Data Type}
    C, > D[ACE2 Binding Scores]
    C, > E[Antibody Escape Scores]
    C, > F[Stability / Expression]
    D & E & F, > G[Feature Engineering]
    G, > H[Supervised ML Model]
    G, > I[Protein Language Model]
    H & I, > J[Fitness Landscape]
    J, > K[Epistatic Interaction Matrix]
    J, > L[Escape Prediction for Novel Variants]
    L, > M[Ranking by Escape Potential]
    M, > N[Select Candidate Variants for Experimental Validation]
    N, > O[Lentiviral Pseudotyping Assays]
    O, > P[Update ML Model]

Implications for Vaccine Design and Zoonotic Surveillance

The predictive power of DMS-ML approaches has direct implications for designing vaccines that anticipate future antigenic drift. By forecasting which escape variants are most accessible under immune pressure, computational models can guide the selection of immunogens that target conserved epitopes [5, 23]. This is particularly relevant for veterinary coronaviruses such as those infecting cats, dogs, pigs, and poultry, where analogous spike mutations could shift host range [26, 27].

For instance, the role of N501Y in enhancing ACE2 binding across species, including human and bat ACE2, underscores the importance of monitoring this position in animal reservoirs [11, 26]. Machine learning models trained on SARS-CoV-2 data can be fine-tuned for animal coronaviruses using homology-based transfers [7, 10]. The article "Spike Protein Mutational Landscapes and ACE2 Binding Affinity Prediction Using Machine Learning" provides additional detail on cross-species binding predictions (/knowledge/bioinformatics/spike-protein-mutational-landscapes-ace2-binding-affinity-prediction). Similarly, the "Zoonotic Spillover Pathways and Receptor Binding Evolution in Bat Reservoirs" article discusses the evolutionary origins of RBD mutations (/knowledge/bioinformatics/zoonotic-spillover-pathways-and-receptor-binding-evolution-in-bat-reservoirs).

Machine learning frameworks also enable rapid antigenic characterization of new variants using only sequence data, which is crucial for veterinary diagnostic laboratories that lack high-containment facilities for live virus neutralization tests [6, 27]. The integration of DMS-informed PLMs with structural modeling (e.g., AlphaFold2) allows virtual screening of spike mutations for their impact on antibody binding, reducing the need for exhaustive experimental mapping [3, 10].

Nonetheless, current models have limitations. They are predominantly trained on human polyclonal sera or monoclonal antibodies, which may not capture the distinct immune repertoires of veterinary species [6]. Furthermore, many models ignore mutations outside the RBD (e.g., NTD deletions, furin cleavage site changes) that contribute to antibody escape [20, 28]. Future work should extend DMS libraries to full-length spike proteins and incorporate data from animal vaccination studies [9, 29].

Conclusion

Deep mutational scanning combined with machine learning provides a high-resolution view of the SARS-CoV-2 spike fitness landscape and enables proactive identification of antibody escape mutations. The experimental platform of yeast display and lentiviral pseudotyping, coupled with computational pipelines like Enrich2 and DiMSum, generates the large-scale mutational data needed to train predictive models. Protein language models (Tranception, EVE) and supervised deep learning can forecast the impact of new mutations, while structural data from cryo-EM and AlphaFold2 refine predictions by capturing conformational and energetic constraints. These methods have direct relevance to veterinary virology, where analogous coronaviruses pose ongoing risks to animal health. Continued integration of DMS-ML into surveillance pipelines will be essential for staying ahead of viral evolution in both human and animal populations.

References

[1] Taylor AL, Starr TN. Deep mutational scanning of recent SARS-CoV-2 variants highlights changing amino acid preferences within epistatic hotspot residues. PLoS Pathog. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42330076/

[2] Shao C, Yang L, Xiao C, et al. Deep mutational scanning reveals the antibody escape and infectivity landscape of SARS-CoV-2 Omicron JN.1 and XEC receptor-binding domains. Emerg Microbes Infect. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42324717/

[3] Yang S, Luo X, Luo J, et al. A deep mutational scanning-informed protein language model predicts SARS-CoV-2 evolution dynamics with spatiotemporal resolution. Nat Microbiol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42204343/

[4] Durumeric AEP, McCarty S, Smith J, et al. Machine Learning-Driven Simulations of the SARS-CoV-2 Fitness Landscape from Deep Mutational Scanning Experiments. J Chem Inf Model. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42089465/

[5] Nasir A, Lee D, Avena LE, et al. Predictive modeling of immune escape and antigenic grouping of SARS-CoV-2 variants. J Virol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42037411/

[6] Shlesinger D, Sadilek V, Minot M, et al. Dissecting serum polyclonal antibody escape to SARS-CoV-2 variants by deep mutational learning. Cell Rep Methods. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/42030951/

[7] Soliman OA, Shahine Y, Baecker D, et al. Beyond the Mutation Abyss: Revisiting SARS-CoV-2 Receptor-Binding Domain Evolution from ACE2 Binding Optimization to Immune Epitope Remodeling. Pathogens. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41901725/

[8] Ding Z, Yuan HY. The role of receptor binding and immunity in SARS-CoV-2 fitness landscape: A modeling study. iScience. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41809055/

[9] Haddox HK, Abdel Aziz O, Galloway JG, et al. Clonal interference and changing selective pressures shape the escape of SARS-CoV-2 from hundreds of antibodies. Virus Evol. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41767406/

[10] Lamb KD, Hughes J, Lytras S, et al. From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2. Nat Commun. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41714330/

[11] Chen J, Wang R, Wang M, et al. Mutations Strengthened SARS-CoV-2 Infectivity. J Mol Biol. 2020. URL: https://pubmed.ncbi.nlm.nih.gov/32710986/

[12] Starr TN, Greaney AJ, Stewart C, et al. Deep mutational scans for ACE2 binding, RBD expression, and antibody escape in the SARS-CoV-2 Omicron BA.1 and BA.2 receptor-binding domains. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/a118b6995a4309538c03c84089ae8ae129f8ef01

[13] Taft JM, Weber CR, Gao B, et al. Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV-2 receptor-binding domain. Cell. 2022. URL: https://www.semanticscholar.org/paper/e95d0d9fe7e759d15ceb484b1e3f472e39d5ca3b

[14] Schmidt F, Weisblum Y, Rutkowska M, et al. High genetic barrier to SARS-CoV-2 polyclonal neutralizing antibody escape. Nature. 2021. URL: https://www.semanticscholar.org/paper/756e6309b5a6cece1aa943ce692317bff7ab8a3c

[15] Chakraborty C, Sharma AR, Bhattacharya M, et al. A Detailed Overview of Immune Escape, Antibody Escape, Partial Vaccine Escape of SARS-CoV-2 and Their Emerging Variants With Escape Mutations. Front Immunol. 2022. URL: https://www.semanticscholar.org/paper/0c3c4df9dcfb731959288d2611a88b7893121b77

[16] Focosi D, Maggi F. Neutralising antibody escape of SARS‑CoV‑2 spike protein: Risk assessment for antibody‑based Covid‑19 therapeutics and vaccines. Rev Med Virol. 2021. URL: https://www.semanticscholar.org/paper/e8b05a2e9c1b7677172603e6f1ea1caa8527e32d

[17] Zhao Z, Zhou J, Tian M, et al. Omicron SARS-CoV-2 mutations stabilize spike up-RBD conformation and lead to a non-RBM-binding monoclonal antibody escape. Nat Commun. 2022. URL: https://www.semanticscholar.org/paper/b18d3063725004c0327f92c243b3c27ea217debc

[18] Tuekprakhon A, Huo J, Nutalai R, et al. Antibody escape of SARS-CoV-2 Omicron BA.4 and BA.5 from vaccine and BA.1 serum. Cell. 2022. URL: https://www.semanticscholar.org/paper/b11dc6c42e36b67b213c5bb662a27f54abe943d2

[19] Gan HH, Zinno J, Piano F, et al. Omicron Spike protein has a positive electrostatic surface that promotes ACE2 recognition and antibody escape. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/06b43b375d516a57c11fa27285a771c6ec6ca94d

[20] McCarthy KR, Rennick L, Nambulli S, et al. Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape. Science. 2021. URL: https://www.semanticscholar.org/paper/65898f10da52a864f89da2c76877e764ae6044f4

[21] Liu C, Zhou D, Dijokaite-Guraliuc A, et al. A structure-function analysis shows SARS-CoV-2 BA.2.86 balances antibody escape and ACE2 affinity. Cell Rep Med. 2024. URL: https://www.semanticscholar.org/paper/c0a37474aed1d66cf0793e7fb26a756da0b9edb8

[22] Qu P, Evans JP, Faraone JN, et al. Distinct Neutralizing Antibody Escape of SARS-CoV-2 Omicron Subvariants BQ.1, BQ.1.1, BA.4.6, BF.7 and BA.2.75.2. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/41b427b0be15b304bff01d2996660d3d715c7ec1

[23] Makowski EK, Schardt J, Smith MD, et al. Mutational analysis of SARS-CoV-2 variants of concern reveals key tradeoffs between receptor affinity and antibody escape. PLoS Comput Biol. 2022. URL: https://www.semanticscholar.org/paper/63e29140a64b89c83df41d35e7c19bdca1ab5723

[24] Witte L, Baharani VA, Schmidt F, et al. Epistasis lowers the genetic barrier to SARS-CoV-2 neutralizing antibody escape. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/c8b3d503cac4e795bed594cb087404c333815f5b

[25] Greaney AJ, Starr TN, Bloom JD. An antibody-escape estimator for mutations to the SARS-CoV-2 receptor-binding domain. Virus Evol. 2022. URL: https://www.semanticscholar.org/paper/dd30f46b1a51e1aa289d02142b6378a306c83083

[26] Qiang XL, Xu P, Fang G, et al. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect Dis Poverty. 2020. URL: https://pubmed.ncbi.nlm.nih.gov/32209118/

[27] Chakraborty C, Bhattacharya M, Pal S, et al. Prompt-engineering enabled LLM or MLLM and instigative bioinformatics pave the way to identify and characterize the significant SARS-CoV-2 antibody escape mutations. Int J Biol Macromol. 2024. URL: https://www.semanticscholar.org/paper/96c476fdc116226a348cc430bf39339765b5083c

[28] Gruell H, Vanshylla K, Korenkov M, et al. SARS-CoV-2 Omicron sublineages exhibit distinct antibody escape patterns. Cell Host Microbe. 2022. URL: https://www.semanticscholar.org/paper/4fd4acb0aa616e575abeeb593ee6be09696b4286

[29] Javanmardi K, Segall-Shapiro TH, Chou CW, et al. Antibody escape and cryptic

[30] Tuekprakhon A, Huo J, Nutalai R, et al. Further antibody escape by Omicron BA.4 and BA.5 from vaccine and BA.1 serum. bioRxiv. 2022. URL: https://www.semanticscholar.org/paper/96914baeb617b3f20eda29a0eb1a26ec1f974e39