What is Dr. Zubair Khalid's research focus?

Dr. Zubair Khalid specializes in molecular virology, mRNA vaccine development, and computational biology, with a focus on avian pathogens like IBDV and Avian Reovirus.

Where is Dr. Zubair Khalid currently working?

Dr. Zubair Khalid is a Postdoctoral Research Associate at the University of Maryland (UMD), specifically within the Department of Animal and Avian Sciences.

Machine Learning for Predicting T-Cell Epitope Immunogenicity: A Technical Review

Introduction

The adaptive immune response in vertebrates relies on the precise recognition of peptide fragments presented by major histocompatibility complex (MHC) molecules to T-cell receptors (TCRs). In veterinary species, understanding this process is critical for vaccine development against pathogens such as Highly Pathogenic Avian Influenza (H5N1) in Poultry and Wild Birds: Clinical Signs, Transmission Dynamics, and Surveillance Maps and Porcine Reproductive and Respiratory Syndrome: Genomic Surveillance and Vaccine Strategies Using Bioinformatics. The prediction of T-cell epitope immunogenicity, defined as the ability of a peptide-MHC complex to elicit a T-cell response, has evolved from simple binding affinity models to sophisticated machine learning frameworks that incorporate structural, physicochemical, and sequence-based features [1, 2]. This review examines the computational approaches, datasets, and biological principles underlying modern immunogenicity prediction, with emphasis on the distinction between MHC binding and true immunogenicity.

Biological Foundations: HLA Binding versus Immunogenicity

MHC class I molecules, encoded by the highly polymorphic HLA (in humans) or SLA (in swine) and BoLA (in cattle) genes, present 8-11 amino acid peptides to CD8+ T cells [3, 4]. The initial step in epitope discovery has historically been the prediction of peptide-MHC binding affinity, as this is a necessary but insufficient condition for immunogenicity [5, 6]. Binding affinity predictions using algorithms such as NetMHCpan-4.2 achieve high accuracy through transfer learning and structural feature integration [4]. However, peptide-MHC binding does not guarantee T-cell activation. Immunogenicity requires additional molecular interactions, including TCR recognition, co-stimulatory signals, and appropriate antigen processing [7, 8].

The distinction between binding and immunogenicity is biologically critical. A peptide may bind to an MHC molecule with high affinity yet fail to trigger a T-cell response due to several factors: the absence of a cognate TCR in the repertoire, competition with higher-affinity endogenous peptides, or suboptimal peptide-MHC conformational dynamics [9, 10]. Machine learning models that incorporate immunogenicity data, as opposed to binding data alone, demonstrate superior predictive performance for identifying true T-cell epitopes [11, 12].

Immunopeptidomics Datasets for Model Training

The availability of high-quality immunopeptidomics data has been instrumental in advancing machine learning models for immunogenicity prediction. Immunopeptidomics, the comprehensive profiling of peptides bound to MHC molecules using liquid chromatography-tandem mass spectrometry (LC-MS/MS), provides direct experimental evidence of peptide presentation [13, 14]. These datasets differ from binding assay data because they capture the natural antigen processing and presentation pathway, including proteasomal cleavage and transporter associated with antigen processing (TAP) translocation [15, 16].

Key immunopeptidomics resources include public repositories such as the Immune Epitope Database (IEDB) and the SysteMHC Atlas, which contain curated peptide-MHC complexes from multiple species [17, 18]. Training on immunopeptidomics data improves model generalizability because the peptides represent naturally processed and presented ligands, reducing the false positive rate associated with binding-only predictions [19, 20]. For example, the MixMHCpred2.2 algorithm, trained on large-scale immunopeptidomics datasets, demonstrates enhanced prediction of CD8+ T-cell epitopes for both human and murine systems [18].

Machine Learning Architectures for Epitope Prediction

Sequence-Based Methods

Sequence-based approaches represent the peptide and MHC molecule as amino acid sequences and apply various machine learning algorithms to predict binding or immunogenicity. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been applied to learn sequence motifs associated with MHC binding [21, 22]. NetMHCpan-4.2 employs an ensemble of artificial neural networks that incorporate both peptide sequence and MHC pseudo-sequence features, achieving state-of-the-art performance for binding prediction [4]. The TinyHLAnet architecture reduces computational complexity through a lightweight 3D structure-aware design while maintaining high prediction accuracy for CD8+ T-cell antigens [3].

Transformer-based models represent a significant advancement in sequence-based immunogenicity prediction. These models use self-attention mechanisms to capture long-range dependencies in peptide-MHC sequences and can integrate protein language model embeddings [1, 2]. The TULIP framework, a transformer-based unsupervised language model, learns representations of interacting peptides and TCRs that generalize to unseen epitopes, addressing the challenge of TCR repertoire diversity [9]. Similarly, EPIC-TRACE combines attention mechanisms with contextualized embeddings to predict TCR binding to epitopes not encountered during training [12].

Structure-Aware Methods

The three-dimensional structure of the peptide-MHC-TCR complex provides critical information for immunogenicity prediction. Structural features such as peptide solvent accessibility, backbone conformation, and side-chain orientation influence TCR recognition [3, 23]. Deep learning architectures that incorporate structural information, such as graph neural networks and 3D CNNs, can model the spatial arrangement of atoms in the binding groove [24, 25].

TinyHLAnet explicitly models the 3D structure of the peptide-MHC complex using a lightweight architecture that predicts antigen presentation probability [3]. The integration of structural features from solved crystal structures or computational models (e.g., AlphaFold-derived structures) improves prediction of cryptic epitopes that may not be identified by sequence-based methods alone [11]. The DeepNetBim model combines network analysis with deep learning to predict HLA-epitope interactions by simultaneously considering binding and immunogenicity information [23].

Hybrid and Ensemble Approaches

Hybrid models combine multiple feature types and algorithms to capture the multifaceted determinants of immunogenicity. Physicochemical properties, including hydrophobicity, charge, and molecular weight, can be encoded as feature vectors for input into machine learning classifiers [10, 26]. Bukhari and Ogudo developed hybrid models for Respiratory and Intestinal Nematodes of Poultry: Syngamus trachea, Ascaridia galli, Heterakis gallinarum, and Capillaria obsignata epitope prediction that integrate amino acid composition, dipeptide composition, and position-specific scoring matrices [7].

Ensemble methods, including random forests, gradient boosting, and voting classifiers, aggregate predictions from multiple base learners to improve robustness and generalizability [20, 21]. Decision tree-based ensemble models have been applied to predict T-cell epitopes for Zika virus, demonstrating superior performance compared to individual classifiers [20, 22]. The iTTCA-Hybrid method employs hybrid feature representation that combines sequence-derived features with evolutionary information for improved identification of tumor T-cell antigens [31].

Modeling the TCR-Epitope Interface

The specificity of T-cell recognition is determined by the interaction between the TCR complementarity-determining regions (CDRs) and the peptide-MHC surface. Modeling this interaction is computationally challenging due to the extreme diversity of TCR sequences, estimated at 10^15 to 10^20 unique receptors in a single individual [9, 27]. The TCR-peptide-MHC interface involves contacts between CDR1, CDR2, and CDR3 loops of both TCR alpha and beta chains with the peptide and MHC alpha helices [12, 13].

Machine learning models for TCR-epitope binding prediction can be categorized into two approaches: pan-specific models that predict binding for any TCR-epitope pair, and repertoire-based models that predict the probability of a response given a population of TCRs [28, 29]. The TEINet deep learning framework uses a CNN architecture to model TCR-epitope binding specificity, achieving high accuracy on curated datasets of known TCR-peptide pairs [17]. TEPCAM provides interpretable predictions of TCR-epitope binding through an attention-based architecture that highlights critical residue contacts [13].

Structural features of the TCR-epitope interface, including buried surface area, hydrogen bonding patterns, and shape complementarity, can be computed from 3D models and used as input features for machine learning classifiers [3, 24]. The peptide-PRISM method integrates structural modeling with machine learning to identify both canonical and cryptic T-cell epitopes for cytomegalovirus, demonstrating the value of structure-aware approaches [11]. The Mermaid diagram below illustrates the workflow for integrating structural features into immunogenicity prediction.

flowchart TD
    A[Peptide Sequence], > B[Structural Modeling]
    C[MHC Allele], > B
    B, > D[Peptide-MHC Complex Structure]
    D, > E[Feature Extraction]
    E, > F[3D Contact Features]
    E, > G[Solvent Accessibility]
    E, > H[Binding Energy]
    F, > I[Machine Learning Classifier]
    G, > I
    H, > I
    J[TCR Sequence], > K[TCR Modeling]
    K, > L[TCR CDR Loop Features]
    L, > I
    I, > M[Immunogenicity Prediction]
    M, > N[Epitope Validation]

Training on Quantitative Immunogenicity Data

Quantitative immunogenicity data, where T-cell responses are measured as a continuous variable (e.g., ELISPOT spot counts, cytokine production levels), provide richer training signals than binary classification labels [32, 33]. Ogishi and Yotsuyanagi demonstrated that quantitative prediction of immunogenicity landscapes in sequence space improves discrimination between immunogenic and non-immunogenic peptides [35]. DeepHLApan incorporates both binding and immunogenicity information through a multitask learning framework, jointly predicting peptide-MHC binding affinity and T-cell response magnitude [32].

Immunogenicity prediction models that incorporate peptide processing features, including proteasomal cleavage scores and TAP transport efficiency, achieve improved performance compared to models using binding affinity alone [14, 26]. The INeo-Epp method uses sequence-related amino acid features, including position-specific scoring matrices and physicochemical properties, to predict HLA class I immunogenicity [30]. Smith et al. applied machine learning to predict tumor antigen immunogenicity, demonstrating that features such as peptide-MHC stability and TCR contact residue composition are among the most informative predictors [33].

Challenges and Limitations

Several challenges persist in machine learning for T-cell epitope immunogenicity prediction. Data imbalance is a critical issue, as experimentally validated immunogenic epitopes are far fewer than non-immunogenic binders [2, 16]. Class imbalance leads to models that predict the majority class accurately but perform poorly on the minority immunogenic class [21, 34]. Techniques such as synthetic oversampling, cost-sensitive learning, and data augmentation are employed to mitigate this issue [5, 22].

Cross-validation strategies must account for sequence redundancy, as homologous peptides may share similar binding properties [6, 15]. Training and test sets should be partitioned with sequence identity thresholds to avoid inflated performance estimates. The generalization of models across different species and MHC alleles remains a challenge, particularly for non-human species with limited immunopeptidomics data [18, 27].

Another limitation is the static nature of current models, which predict immunogenicity based solely on peptide sequence and MHC allele. Factors such as TCR repertoire composition, host immune status, and infection history are dynamic and difficult to incorporate into static prediction frameworks [24, 28]. Future models may integrate longitudinal immunogenicity data and systems immunology approaches to address this limitation.

Future Directions

The integration of single-cell technologies, including single-cell RNA sequencing and single-cell TCR sequencing, with machine learning offers opportunities to model T-cell responses at the clonal level [9, 35]. Graph neural networks that model the peptide-MHC-TCR complex as a graph of interacting residues can capture geometric features that are not accessible to sequence-based models [3, 17].

Transfer learning from large protein language models, such as those pre-trained on massive protein sequence databases, can improve prediction for epitopes with limited training data [1, 4]. Foundation protein language models for T-cell epitope prediction, as demonstrated for influenza A virus, show promise for rapid adaptation to emerging pathogens [1]. The application of these approaches to veterinary pathogens, including Escherichia coli in Chickens and Poultry Products: Bacterial Pathogenesis, Contamination Routes, Clinical Signs in Flocks, and Public Health Risks and Mycoplasma bovis in Feedlot Cattle: Chronic Pneumonia, Arthritis, and the Challenge of Cultivation versus Molecular Detection, could accelerate vaccine development for livestock and poultry.

References

[1] Bukhari SNH, Ogudo KA. Foundation Protein Language Models for Influenza A Virus T-Cell Epitope Prediction: A Transformer-Based Viroinformatics Framework. Viruses. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41902287/

[2] Cheng X, Wu H, Chen P et al. Progress of Deep Learning Prediction of CD8+ T-Cell Epitopes. Proteomics. 2026. URL: https://pubmed.ncbi.nlm.nih.gov/41452164/

[3] Sakthivel NC, Mukherjee S, Chandra N. TinyHLAnet: A Light-Weight 3D Structure-Aware Architecture for Rapid and Explainable Identification of CD8+ T-Cell Antigens. HLA. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/41044839/

[4] Nilsson JB, Greenbaum J, Peters B et al. NetMHCpan-4.2: improved prediction of CD8+ epitopes by use of transfer learning and structural features. Front Immunol. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40852704/

[5] Wohlwend J, Nathan A, Shalon N et al. Deep learning enhances the prediction of HLA class I-presented CD8(+) T cell epitopes in foreign pathogens. Nat Mach Intell. 2025. URL: https://pubmed.ncbi.nlm.nih.gov/40008296/

[6] Bukhari SNH, Ogudo KA. Prediction of antigenic peptides of SARS-CoV-2 pathogen using machine learning. PeerJ Comput Sci. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/39650382/

[7] Bukhari SNH, Ogudo KA. Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus. Bioengineering (Basel). 2024. URL: https://pubmed.ncbi.nlm.nih.gov/39199749/

[8] Yang Q, Xu L, Dong W et al. HLAIImaster: a deep learning method with adaptive domain knowledge predicts HLA II neoepitope immunogenic responses. Brief Bioinform. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/38920343/

[9] Meynard-Piganeau B, Feinauer C, Weigt M et al. TULIP: A transformer-based unsupervised language model for interacting peptides and T cell receptors that generalizes to unseen epitopes. Proc Natl Acad Sci U S A. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/38838016/

[10] Bukhari SNH, Elshiekh E, Abbas M. Physicochemical properties-based hybrid machine learning technique for the prediction of SARS-CoV-2 T-cell epitopes as vaccine targets. PeerJ Comput Sci. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/38686005/

[11] Rein AF, Lauruschkat CD, Muchsin I et al. Identification of novel canonical and cryptic HCMV-specific T-cell epitopes for HLA-A∗03 and HLA-B∗15 via peptide-PRISM. Blood Adv. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/38127299/

[12] Korpela D, Jokinen E, Dumitrescu A et al. EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings. Bioinformatics. 2023. URL: https://pubmed.ncbi.nlm.nih.gov/38070156/

[13] Chen J, Zhao B, Lin S et al. TEPCAM: Prediction of T-cell receptor-epitope binding specificity via interpretable deep learning. Protein Sci. 2024. URL: https://pubmed.ncbi.nlm.nih.gov/37983648/

[14] Lee CH, Huh J, Buckley PR et al. A robust deep learning workflow to predict CD8+ T-cell epitopes. Genome Med. 2023. URL: https://pubmed.ncbi.nlm.nih.gov/37705109/

[15]