Machine Learning for Predicting Host-Virus Protein-Protein Interaction Networks
Introduction
The molecular interface between a viral pathogen and its host cell is fundamentally governed by protein-protein interactions (PPIs) [1, 2]. These interactions mediate viral entry, replication, immune evasion, and pathogenesis [3, 4]. Experimental identification of host-virus PPIs, such as yeast two-hybrid screens, co-immunoprecipitation, and affinity purification mass spectrometry, remains time-consuming, costly, and not scalable to the vast combinatorial space of potential interactions [5, 6]. Computational methods, particularly machine learning (ML) approaches, have emerged as powerful tools to predict host-virus PPIs with high throughput and reasonable accuracy [7, 8]. This review provides an exhaustive, publication-grade examination of ML methodologies for predicting host-virus PPI networks, with a focus on graph neural networks, protein co-evolution, sequence patterns, and structural pocket mapping. The discussion is framed within the context of veterinary virology, drawing parallels to human systems where relevant but emphasizing applications to animal pathogens.
Biological Basis of Host-Virus Protein-Protein Interactions
Host-virus PPIs occur when viral proteins bind to specific host proteins to hijack cellular machinery [9, 10]. The binding interface typically involves complementary surface patches characterized by specific amino acid compositions, electrostatic potentials, and hydrophobic contacts [2, 11]. Viral proteins often mimic host interaction motifs to subvert normal signaling pathways [12, 13]. For example, influenza A virus NS1 protein binds to host CPSF30 to inhibit host mRNA processing [14, 15]. Similarly, coronavirus spike proteins engage host ACE2 receptors to initiate membrane fusion [16, 17]. Understanding these interactions at the molecular level is critical for predicting host range, zoonotic potential, and therapeutic targets [18, 19].
The structural determinants of binding can be captured through features such as amino acid composition, dipeptide composition, conjoint triad, pseudo amino acid composition, and autocorrelation descriptors [20, 21]. Additionally, evolutionary information encoded in position-specific scoring matrices (PSSMs) and co-evolutionary signals between interacting protein families provide rich predictive features [22, 23]. The three-dimensional geometry of binding pockets, including solvent accessibility, secondary structure propensities, and residue depth, further refines interaction predictions [2, 24].
Machine Learning Approaches for Host-Virus PPI Prediction
Feature Engineering and Representation
Transforming protein sequences into numerical feature vectors is a critical preprocessing step [5, 25]. Two primary paradigms exist: Independent Protein Feature (IPF) extraction, where host and viral proteins are encoded separately, and Merged Protein Feature (MPF) extraction, where sequences are concatenated before encoding [5]. An Extended Protein Feature (EPF) method that combines both approaches has shown improved performance with traditional ML classifiers such as Support Vector Machine (SVM), Logistic Regression, and Multilayer Perceptron [5].
Common sequence-based features include:
- Amino acid composition (AAC) and pseudo amino acid composition (PAAC) [7, 16]
- Dipeptide composition (DC) and conjoint triad (CT) [22, 26]
- Moran autocorrelation and normalized Moreau-Broto autocorrelation [22]
- Tripeptide features derived from reduced amino acid alphabets [12]
- Gene ontology (GO) terms and natural language processing (NLP) embeddings such as bag-of-words, TF-IDF, and doc2vec [7, 14, 21]
Deep protein sequence embeddings, inspired by NLP, have revolutionized feature representation [27, 28]. Methods like word2vec, doc2vec, and Byte Pair Encoding (BPE) convert amino acid sequences into dense vector representations that capture contextual and semantic relationships [20, 21]. The Siamese Tailored deep sequence Embedding of Proteins (STEP) approach integrates these embeddings into a Siamese neural network architecture to predict virus-host PPIs [27].
Classical Machine Learning Classifiers
Several supervised learning algorithms have been benchmarked for host-virus PPI prediction [6, 9]. Random Forest (RF) and Support Vector Machine (SVM) are among the most frequently used, with RF often achieving high accuracy due to its ensemble nature [6, 26]. Deep Forest, an ensemble of cascade forest models, has demonstrated superior predictive accuracy compared to RF and SVM in some studies [6]. Other classifiers include Logistic Regression, Naive Bayes, K-Nearest Neighbors, and eXtreme Gradient Boosting (XGBoost) [6, 22, 29]. Feature selection methods, such as correlation coefficient-based filtering and Max-Min Parents and Children (MMPC), reduce dimensionality while maintaining predictive performance [7, 12].
Deep Learning Architectures
Deep neural networks (DNNs) have become the state-of-the-art for host-virus PPI prediction due to their ability to learn hierarchical features from raw sequence data [4, 24]. Convolutional neural networks (CNNs) extract local, position-dependent motifs, while long short-term memory (LSTM) networks capture long-range dependencies [4, 21, 25]. Hybrid architectures combining CNN and LSTM layers have been proposed to leverage both local and sequential information [4, 25]. The LSTM-PHV model, using word2vec embeddings, achieved an AUC of 0.976 and accuracy of 98.4% on human-virus datasets [21].
Multi-scale CNNs and Siamese CNN architectures have been employed to compare pairs of protein sequences directly [17, 30]. The Siamese network processes host and viral proteins through identical subnetworks, learning a similarity metric that indicates interaction likelihood [17, 27, 30]. Transfer learning strategies, including frozen and fine-tuning approaches, enable knowledge transfer from well-characterized virus-host systems to novel or understudied viruses [11, 17, 31].
Graph Neural Networks and Network Topology
Graph neural networks (GNNs) incorporate topological information from known protein-protein interaction networks to generate enriched protein embeddings [20, 32]. The GraphSAGE model, a variant of graph convolutional networks, aggregates features from neighboring nodes in the host PPI network to produce hybrid embeddings that combine sequence-derived and network-derived features [20]. This approach improved AUC scores by 3-23% over sequence-only methods [20]. Incorporating viral molecular mimicry and network centrality measures (e.g., degree, betweenness, closeness) further enhances prediction accuracy [32, 26].
Transfer Learning and Multi-Task Learning
Given the scarcity of experimentally validated PPIs for many veterinary viruses, transfer learning is particularly valuable [11, 31]. Models pre-trained on large human-virus interaction datasets can be fine-tuned on smaller animal virus datasets [11, 17]. The DeepVHPPI framework uses a self-attention-based transformer with transfer learning to predict interactions for novel virus sequences, including mutated variants [11]. Probability-weighted ensemble transfer learning has been applied to HIV-human interactions, demonstrating robustness to data unavailability [31].
Evaluation Metrics and Benchmarking
Standard evaluation metrics for binary classification include accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and area under the receiver operating characteristic curve (AUC-ROC) [6, 9]. Cross-validation (e.g., 5-fold or 10-fold) is used to assess generalization [21, 29]. Independent test sets from different virus families provide rigorous validation of model transferability [4, 11]. The following table summarizes representative performance metrics from key studies:
| Model | Dataset | Accuracy | AUC | F1-score | Reference |
|---|---|---|---|---|---|
| LSTM-PHV | Human-virus | 98.4% | 0.976 | - | [21] |
| XGBoost (IAV-human) | Influenza A-human | 96.89% | - | 96.78% | [22] |
| Deep Forest | Human-parasite/bacteria | Highest | - | - | [6] |
| Hybrid CNN-LSTM | Virus-host benchmark | Best | - | - | [4, 25] |
| GraphSAGE hybrid | Human-virus | - | 3-23% better | - | [20] |
| MMPC-DNN (GO-NLP) | SARS-CoV-2 | - | 0.878 | 0.793 | [7] |
Workflow for Host-Virus PPI Prediction
The typical computational pipeline integrates data collection, feature extraction, model training, and validation. The following Mermaid diagram illustrates a generalized workflow:
flowchart TD
A[Experimental PPI Databases<br/>e.g., HVIDB, STRING], > B[Data Preprocessing<br/>Positive & Negative Sampling]
B, > C[Feature Extraction]
C, > D[Sequence-based Features<br/>AAC, CT, DC, PSSM, Embeddings]
C, > E[Structural Features<br/>Solvent Accessibility, Pocket Geometry]
C, > F[Network Topology Features<br/>Centrality, Graph Embeddings]
D, > G[Feature Selection<br/>Correlation, MMPC, PCA]
E, > G
F, > G
G, > H[Model Training]
H, > I[Classical ML<br/>RF, SVM, XGBoost]
H, > J[Deep Learning<br/>CNN, LSTM, Siamese, GNN]
H, > K[Transfer Learning<br/>Frozen/Fine-tuning]
I, > L[Evaluation<br/>Cross-validation, Independent Test]
J, > L
K, > L
L, > M[Prediction of Novel PPIs]
M, > N[Validation via GO/KEGG Enrichment]
N, > O[3D Structural Mapping<br/>Binding Pocket Visualization]
Structural Pocket Mapping and 3D Visualization
Predicting which residues at the binding interface contribute most to the interaction is a key downstream task [19, 27]. Explainable AI (XAI) methods, such as attention weights and gradient-based saliency maps, identify sequence regions critical for interaction [27]. These regions can be mapped onto three-dimensional protein structures using tools like PyMOL or ChimeraX [2, 33]. For veterinary applications, structural models of viral proteins (e.g., feline coronavirus spike, avian influenza hemagglutinin) can be obtained from the Protein Data Bank or predicted using AlphaFold2 [24, 33]. Binding pocket mapping highlights conserved hydrophobic patches, charged residues, and hydrogen-bonding networks that mediate host recognition [2, 19]. The HVIDB database provides pre-computed 3D complex structures for many human-virus PPIs, serving as a template for analogous animal systems [33].
Applications in Veterinary Virology
While most ML-based PPI predictors have been developed for human viruses, the underlying methodologies are directly transferable to veterinary pathogens [1, 3]. Key applications include:
Predicting host range and zoonotic potential: ML models trained on known interactions can assess whether a viral protein from an animal reservoir (e.g., bat coronavirus, avian influenza) can bind to human or livestock receptors [18, 32]. This complements structural approaches such as those described in Predicting Viral Host Range and Zoonotic Potential Using Machine Learning on Spike Protein Structures.
Identifying antiviral targets: Predicted PPIs between viral proteins and host factors reveal dependencies that can be targeted by therapeutics [9, 13]. For example, interactions between hepatitis E virus and human proteins linked to hepatocellular carcinoma have been predicted using ML [13].
Understanding pathogenesis: Network analysis of predicted PPIs can identify host pathways hijacked by viruses, such as immune signaling or apoptosis [16, 26]. This is relevant for diseases like African swine fever, where viral proteins modulate host defenses [18].
Vaccine design: Knowledge of PPI interfaces guides the design of subunit vaccines that block viral attachment [19, 22]. The Machine Learning in Predicting Protein-Protein Interactions article provides additional context.
Monitoring viral evolution: ML models can rapidly assess how mutations in viral proteins (e.g., influenza hemagglutinin, coronavirus spike) alter interaction profiles, aiding surveillance of emerging strains [11, 19]. This aligns with the Deep Mutational Scanning and Machine Learning for Predicting SARS-CoV-2 Spike Protein Escape Mutations from Antibody Neutralization article.
Databases such as HVIDB [33] and Viruses.STRING [29] provide extensive collections of experimentally verified and predicted host-virus PPIs that can be leveraged for veterinary species through orthology mapping.
Challenges and Future Directions
Despite significant progress, several challenges remain:
Data imbalance and negative sampling: Experimentally validated non-interacting pairs are rarely reported, necessitating computational negative sampling strategies that may introduce bias [10, 31]. Co-localization-based negative sampling is more reliable than random sampling [31].
Generalization across virus families: Models trained on one virus family often perform poorly on distantly related viruses due to differences in interaction mechanisms [11, 17]. Transfer learning partially addresses this but requires careful domain adaptation [30].
Incorporation of structural information: Most sequence-based methods ignore 3D conformation, which is critical for binding specificity [2, 24]. The advent of AlphaFold2 and other structure prediction tools enables integration of predicted structures into PPI models [24, 27].
Interpretability: Deep learning models are often black boxes; XAI methods are needed to identify biologically meaningful interaction determinants [27].
Veterinary-specific resources: Public databases are heavily biased toward human pathogens. Curating high-quality PPI datasets for livestock, poultry, and companion animals is essential for advancing veterinary applications [1, 3].
Future directions include the development of foundation models pre-trained on large protein sequence corpora and fine-tuned for host-virus interaction prediction [24, 27]. Graph neural networks that jointly model host and viral protein interaction networks will likely become standard [20, 32]. Integration with molecular dynamics simulations can validate predicted binding interfaces and estimate binding affinities [2, 16].
Conclusion
Machine learning has become an indispensable tool for predicting host-virus protein-protein interaction networks. From classical classifiers to deep learning architectures and graph neural networks, these methods enable high-throughput, cost-effective identification of molecular interactions that underpin viral infection. Feature engineering leveraging sequence, structural, and network topology information continues to improve predictive accuracy. Transfer learning and explainable AI enhance model applicability and interpretability. While most current work focuses on human viruses, the methodologies are directly applicable to veterinary virology, offering opportunities to predict host range, identify therapeutic targets, and monitor viral evolution. Continued development of veterinary-specific databases and integration with structural biology will further advance this field.
References
[1] Omid Mahmoudi, Somayye Taghvaei, Shirin Salehi et al. Machine Learning Approaches for Predicting Virus-Human Protein-Protein Interactions: An Evaluation of Retroviral Interaction Networks. bioRxiv, 2024. URL: https://www.semanticscholar.org/paper/ee23a048f298322650fc5c1059c86a21a4bfc088
[2] Yuri Matsuzaki, J. Simm, N. Uchikoga et al. A Docking Based Approach to Analyze Interaction Surfaces of Virus-Host Protein-Protein Interactions. Journal, 2017. URL: https://www.semanticscholar.org/paper/38d81f7912517d25862849cd97014e51c3526146
[3] Betül Asiye Karpuzcu, Erdem Türk, A. Ibrahim et al. Machine Learning Methods for Virus-Host Protein-Protein Interaction Prediction. Methods in Molecular Biology, 2023. URL: https://www.semanticscholar.org/paper/8aa0f2a53b456fdd72af3bc9b83b40708541e94c
[4] L. Deng, Wenjuan Nie, Jiaojiao Zhao et al. A hybrid deep learning framework for predicting the protein-protein interaction between virus and host. Journal, 2021. URL: https://www.semanticscholar.org/paper/298299f89737a255755792ab7dacebe74c1bf6d0
[5] Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde et al. An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction. Current Bioinformatics, 2024. URL: https://www.semanticscholar.org/paper/89530b8d900c5a96d7cb994a35b2cbd71be890b9
[6] Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde et al. Current Trend and Performance Evaluation of Machine Learning Methods for Predicting Host-Pathogen Protein-Protein Interactions. 2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), 2024. URL: https://www.semanticscholar.org/paper/f04259d1938b253c4645d79075e8bdad6e1bccdf
[7] Pınar Cihan, Zeynep Banu Ozger, Zeynep Cakabay. Computational analysis of virus-host protein-protein interactions using gene ontology and natural language processing. Applied Intelligence (Boston), 2025. URL: https://www.semanticscholar.org/paper/6517bfc6a7744951820ae279c6479c47fadf169a
[8] Rasool Sahragard, Masoud Arabfard, Ali Ahmadi et al. VHI-Pred: A Multi-Feature-Based Tool for Predicting Human-Virus Protein-Protein Interactions. Molecular Biotechnology, 2025. URL: https://www.semanticscholar.org/paper/c0d711c4e6b9fe1a424592ff6673aad24d2ba563
[9] Ankush Tatyaba Kargal. Machine Learning-Driven Prediction Of Protein-Protein Interactions In Emerging Viral Pathogens. International Journal of Drug Delivery Technology, 2026. URL: https://www.semanticscholar.org/paper/72c62130e66295a1af9723b80fdb5cae157ff9e7
[10] Lopamudra Dey, Sanjay Chakraborty. Supervised learning approaches for predicting Ebola-Human Protein-Protein interactions. Gene, 2025. URL: https://www.semanticscholar.org/paper/1da81b18da5a7c21743adf9866cdd3eaacf8427b
[11] Jack Lanchantin, Tom Weingarten, Arshdeep Sekhon et al. Transfer Learning for Predicting Virus-Host Protein Interactions for Novel Virus Sequences. bioRxiv, 2020. URL: https://www.semanticscholar.org/paper/11de4622bb72bb754762c2add2c12679ab2d7c22
[12] A. H. Ibrahim, Onur Can Karabulut, Betül Asiye Karpuzcu et al. A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction. PLoS ONE, 2023. URL: https://www.semanticscholar.org/paper/43c7112e368799a347d873efc2df4990a4f65b6c
[13] Anahid Hematpour, Parnian Habibi, S. Alavimanesh et al. Machine learning approach to predict protein-protein interactions between human and hepatitis E virus: revealing links to hepatocellular carcinoma. bioRxiv, 2025. URL: https://www.semanticscholar.org/paper/bc3f4d1af787522fb1e85f47f542a37141324311
[14] Pengfei Xie, J. Zhuang, Geng Tian et al. Emvirus: An embedding-based neural framework for human-virus protein-protein interactions prediction. Biosafety and Health, 2023. URL: https://www.semanticscholar.org/paper/1a02ed363e983b0c9eb32217e992fed1bfa832f1
[15] Fatma-Elzahraa Eid, M. Elhefnawi, Lenwood S. Heath. DeNovo: virus-host sequence-based protein-protein interaction prediction. Bioinformatics, 2016. URL: https://www.semanticscholar.org/paper/35735907c7526643fe8b06f92a01d1cf9371f65e
[16] Arijit Chakraborty, S. Mitra, M. Bhattacharjee et al. Determining human-coronavirus protein-protein interaction using machine intelligence. Medicine in Novel Technology and Devices, 2023. URL: https://www.semanticscholar.org/paper/0de6d8e3a61ef7d0e2831064321e540fc661f1e9
[17] Xiaodi Yang, Shiping Yang, Xianyi Lian et al. Transfer learning via multi-scale convolutional neural layers for human–virus protein–protein interaction prediction. bioRxiv, 2021. URL: https://www.semanticscholar.org/paper/99a501b476adec3dd9b28bf8260d9dceb9a3b898
[18] Zhiyuan Zhang, Yang Feng, Xingyi Ge et al. Virus-human protein-protein interactions predict viral phenotypes. bioRxiv, 2026. URL: https://www.semanticscholar.org/paper/d56953e8f52c71174b1785bc6fb56df5c84bd0a7
[19] Vidhi Sajnani, Omer Ali, Satarupa Das et al. Identification of Protein–Protein Interaction (PPI) Sites on the Influenza A (H1N1) Viral Genome Using Gradient Boosting and Artificial Neural Network (ANN) Models. ACS Omega, 2025. URL: https://www.semanticscholar.org/paper/62d7f42173e5a73bbc8ce1d6049514cec1c8565e
[20] Mehmet Burak Koca, E. Nourani, Ferda Abbasoglu et al. Graph convolutional network based virus-human protein-protein interaction prediction for novel viruses. Computational Biology and Chemistry, 2022. URL: https://www.semanticscholar.org/paper/28e3a3e2654c57656fd01b537ffef34dcdfdcf50
[21] Sho Tsukiyama, M. Hasan, Satoshi Fujii et al. LSTM-PHV: prediction of human-virus protein–protein interactions by LSTM with word2vec. bioRxiv, 2021. URL: https://www.semanticscholar.org/paper/564b557e09e172430ca6d50936a2cef9783f3fbd
[22] Binghua Li, Xin Li, Xiaoyu Li et al. Prediction of influenza A virus-human protein-protein interactions using XGBoost with continuous and discontinuous amino acids information. PeerJ, 2025. URL: https://www.semanticscholar.org/paper/599d93d1aa4a5e422aaf5ccb13eac55ed4fecf6c
[23] Sikender Mohsienuddin, D. Varma. Machine Learning Techniques for Sequence-Based Prediction of Viral-Host Interactions. Journal, 2021. URL: https://www.semanticscholar.org/paper/123f26a8c087e4243a38e68a02c43a465e44cad4
[24] Xiaodi Yang, Shiping Yang, Panyu Ren et al. Deep Learning-Powered Prediction of Human-Virus Protein-Protein Interactions. Frontiers in Microbiology, 2022. URL: https://www.semanticscholar.org/paper/e767485d8bf2d18b70676df1e3e4496551cf8ff0
[25] L. Deng, Jiaojiao Zhao, Jingpu Zhang. Predict the Protein-protein Interaction between Virus and Host through Hybrid Deep Neural Network. IEEE International Conference on Bioinformatics and Biomedicine, 2020. URL: https://www.semanticscholar.org/paper/2d0904883fdc1693b1000202dd5c716f537ec430
[26] Babak Khorsand, Abdorreza Savadi, J. Zahiri et al. Alpha influenza virus infiltration prediction using virus-human protein-protein interaction network. Mathematical Biosciences and Engineering, 2020. URL: https://www.semanticscholar.org/paper/f0ba73422901ce442f022653684824d5e024c7d8
[27] S. Madan, V. Demina, Marcus Stapf et al. Accurate prediction of virus-host protein-protein interactions via a Siamese neural network using deep protein sequence embeddings. bioRxiv, 2022. URL: https://www.semanticscholar.org/paper/3367d35fb820005ab941677c33b45a30bfdc171e
[28] Nikhil Mathews, Tuan Tran, Banafsheh Rekabdar et al. Predicting human–pathogen protein–protein interactions using Natural Language Processing methods. Informatics in Medicine Unlocked, 2021. URL: https://www.semanticscholar.org/paper/dc77db23abb5f839a0ff9efe10eb9c178c89
[29] Ho-Joon Lee. An interactome landscape of SARS-CoV-2 virus-human protein-protein interactions by protein sequence-based multi-label classifiers. bioRxiv, 2021. URL: https://www.semanticscholar.org/paper/89665aa3e70ed8e40c1210caac5c14a8283d15b3
[30] Xiaodi Yang, Ziding Zhang, S. Wuchty. Multi-scale Convolutional Neural Networks for the Prediction of Human-virus Protein Interactions. International Conference on Agents and Artificial Intelligence, 2021. URL: https://www.semanticscholar.org/paper/973b64845349eed8106d82b20dfa2c65b9a26 *** Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.
[31] Suyu Mei. Probability Weighted Ensemble Transfer Learning for Predicting Interactions between HIV-1 and Human Proteins. PLoS ONE, 2013. URL: https://www.semanticscholar.org/paper/202e23faa737f1c8545dcdcefa8dd22435bd2427
[32] Zhiyuan Zhang, Yang Feng, Xiangxian Meng et al. Improved prediction of virus-human protein-protein interactions by incorporating network topology and viral molecular mimicry. bioRxiv, 2026. URL: https://www.semanticscholar.org/paper/1e8a2c94648d846719744425e2a09eedabb81aa
[33] Xiaodi Yang, Xianyi Lian, Chen Fu et al. HVIDB: a comprehensive database for human-virus protein-protein interactions. Briefings in Bioinformatics, 2021. URL: https://www.semanticscholar.org/paper/563d2ec73d7bd852cd61175576f905e9c29c06bb
[34] A. Emamjomeh, B. Goliaei, J. Zahiri et al. Predicting protein-protein interactions between human and hepatitis C virus via an ensemble learning method. Molecular BioSystems, 2014. URL: https://www.semanticscholar.org/paper/e013b8af6baa3114723a8ee4ad7d1e51ddcb24dd
[35] Rakesh Kaundal, Cristian D. Loaiza, N. Duhan et al. deepHPI: a comprehensive deep learning platform for accurate prediction and visualization of host-pathogen protein-protein interactions. Briefings in Bioinformatics, 2022. URL: https://www.semanticscholar.org/paper/4d36877c139115f88cf0ca51edad065b25888828