AI-first structural identification of pathogenic protein target interfaces
Mihkel Saluri, Michael Landreh, Patrick Bryant
Abstract
The risk of pandemics is increasing as global population growth and interconnectedness accelerate. Understanding the structural basis of protein-protein interactions between pathogens and hosts is critical for elucidating pathogenic mechanisms and guiding treatment or vaccine development.
Introduction
During the recent pandemic outbreak of SARS-COV-2, the importance of obtaining fast insights into an emerging pathogen and its relationship with the host has become clear. Information about the interaction between the Spike protein and the human ACE2 receptor provided essential structural information for vaccine development and design. If this information could have been obtained earlier, it is possible that the pandemic would have had less of an impact on society due to vaccines and treatments being developed and deployed faster.
Materials and method
Nonredundant host-pathogen complexes from the PDB
All heteromeric protein structures with below 5 Å resolution and experimental technique X-ray crystallography or electron microscopy were downloaded from the PDB on 2021-12-20. From these structures, PFAM domains and species were mapped to Uniprot KB [25], keeping all structures with UniprotKB annotations. All structures that contain interacting sequences from at least two different Superkingdoms and have different PFAM domains were thereafter selected. The structures with unique pairwise interacting PFAM domains containing the most contacts and oldest release date were chosen (n = 111, 52 of these include human proteins). A contact is defined as two different chains having beta carbons within 8 Å from each other (alpha carbons for glycine). Fig 5 provides a visual guide to the data selection workflow.
Results
Structure prediction of known host-pathogen interactions
The FoldDock protocol, based on AlphaFold (AF) and AlphaFold-multimer (AFM) [10], was used to predict the structure of 111 host-pathogen protein-protein interactions (HP-PPIs). In addition, templates were added to FoldDock due to indications that this can improve the accuracy in some cases [11]. The median TM-score from MMalign [12] is 0.64 for FoldDock, 0.67 for AFM and 0.68 for FoldDock+templates (1a). However, AFM was trained on all proteins with a release date earlier than 2018-04-30. This leaves only 24 out of 111 structures (22%) to test this method, and the median TM-score for AFM is reduced to 0.63, 0.67 for FoldDock+templates and increased to 0.65 for FoldDock on this set (Fig 1b)
Discussion
With the advent of highly accurate structure prediction, exemplified by AlphaFold2, it has become possible to systematically expand structural knowledge across a wide range of organisms [23]. This technological leap opens entirely new prospects for rational vaccine and drug development by enabling rapid identification of potential therapeutic targets. In this study, we present an AI-guided framework for host-pathogen structure prediction, aimed at uncovering novel interactions of functional and clinical relevance.
Acknowledgments
The computational resources were partly provided by Arne Elofsson (berzelius-2021–29).
Citation: Saluri M, Landreh M, Bryant P (2025) AI-first structural identification of pathogenic protein target interfaces. PLoS Comput Biol 21(6): e1013168. https://doi.org/10.1371/journal.pcbi.1013168
Editor: Jeffrey Skolnick, Georgia Institute of Technology, UNITED STATES OF AMERICA
Received: March 11, 2025; Accepted: May 26, 2025; Published: June 26, 2025
Copyright: © 2025 Saluri et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data and code used to produce the results here are freely available in this gitlab repository: https://gitlab.com/patrickbryant1/hpopt
Funding: This study was supported by the SciLifeLab & Wallenberg Data Driven Life Science Program (grant: KAW 2020.0239, P.B). Computational resources were enabled by the supercomputing resource Berzelius provided by National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg foundation with project ids berzelius-2021-29, Berzelius-2023-267, Berzelius-2024-78 and Berzelius-2024-292 (P.B.). M.L. is supported by a Karolinska Institutet faculty-funded Career Position, a Cancerfonden Project grant, the Swedish Research Council (VR) Research Environment Grant, a Consolidator Grant from the Swedish Society for Medical Research (SSMF), and the Knut and Alice Wallenberg foundation (2022.0032). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.