Delora Baptista, Pedro G. Ferreira, Miguel Rocha
One of the main obstacles to the successful treatment of cancer is the phenomenon of drug resistance. A common strategy to overcome resistance is the use of combination therapies. However, the space of possibilities is huge and efficient search strategies are required. Machine Learning (ML) can be a useful tool for the discovery of novel, clinically relevant anti-cancer drug combinations. In particular, deep learning (DL) has become a popular choice for modeling drug combination effects. Here, we set out to examine the impact of different methodological choices on the performance of multimodal DL-based drug synergy prediction methods, including the use of different input data types, preprocessing steps and model architectures. Focusing on the NCI ALMANAC dataset, we found that feature selection based on prior biological knowledge has a positive impact—limiting gene expression data to cancer or drug response-specific genes improved performance. Drug features appeared to be more predictive of drug response, with a 41% increase in coefficient of determination (R2) and 26% increase in Spearman correlation relative to a baseline model that used only cell line and drug identifiers. Molecular fingerprint-based drug representations performed slightly better than learned representations—ECFP4 fingerprints increased R2 by 5.3% and Spearman correlation by 2.8% w.r.t the best learned representations. In general, fully connected feature-encoding subnetworks outperformed other architectures. DL outperformed other ML methods by more than 35% (R2) and 14% (Spearman). Additionally, an ensemble combining the top DL and ML models improved performance by about 6.5% (R2) and 4% (Spearman). Using a state-of-the-art interpretability method, we showed that DL models can learn to associate drug and cell line features with drug response in a biologically meaningful way. The strategies explored in this study will help to improve the development of computational methods for the rational design of effective drug combinations for cancer therapy.
The phenomenon of drug resistance is one of the greatest challenges in the fight against cancer. Although many tumors initially respond well to a given treatment, the efficacy of single-drug anti-cancer therapies is often diminished due to the existence of tumor drug resistance mechanisms. Resistance-conferring characteristics may already be present in the tumor cells prior to therapy, or they may arise as an adaptive response of the tumor to the treatment itself . One of the main drivers of resistance is intratumoral heterogeneity. Genomic instability in cancer leads to the emergence of subpopulations of cells within a tumor with distinct characteristics and different sensitivity to drugs. Treatment may exert selective pressure on the cells and select subpopulations possessing characteristics that favor drug resistance, leading to future relapse [2.
Materials and methods
Datasets and data preprocessing
ALMANAC drug response data in the form of ComboScores for <cell line, drugA, drugB> triplets were downloaded from CellMiner Cross Database (CellMinerCDB)  (version 1.2). The ComboScore for a given <cell line − drugA − drugB> triplet is the sum of the differences between the expected and observed cell line growth calculated for each dose combination tested in the screen, with expected effects being determined using a modified version of the Bliss independence reference model . These values were used as the output variable in our models. Since a standard synergy metric does not currently exist and considering that different synergy metrics may lead to different conclusions, we opted to use the synergy metric defined by the original ALMANAC study instead of another metric based on a different reference model.
Testing the impact of different methodological variables
We developed several multimodal DL models (Fig 1) to predict drug combination effects summarized as ComboScores, using the ALMANAC dataset . In total, 24 different DL models and 6 ML models were developed. A detailed description of the models is provided in the Materials and Methods section, and S2 Fig provides a summary of the different methodological choices that were tested.
The results of this study suggest that drug features are more predictive of drug combination effects than cell line features, at least for the ALMANAC dataset, in line with previous results . Substituting cell line identifiers for actual gene expression data (cell lineone hot + drugsECFP4) produced a model with performance scores that were similar to those of models trained on actual gene expression data. This may be an indication that the expression features are mainly being used by the models as a way to distinguish between cell lines, just as the cell line identifiers. The ability to identify cell lines based on their gene expression profiles is already a positive result, but it does not seem that the decisions made by the DL models are being driven by the identification of specific synergy biomarkers. In addition, the inclusion of other cell line features besides gene expression data was not beneficial.
In this study, we performed a systematic analysis of the impact of different methodological choices on the predictive performance of DL-based drug synergy prediction models, to determine which preprocessing and modeling approaches provide the best results. Different input data type combinations, drug encoding schemes, gene expression feature selection/reduction methods and DL architectures were tested, and an ensemble combining the top methods was also evaluated. These experiments enabled the identification of several strategies that may be interesting starting points for the development of new DL-based drug synergy prediction models in the future.
Citation: Baptista D, Ferreira PG, Rocha M (2023) A systematic evaluation of deep learning methods for the prediction of drug synergy in cancer. PLoS Comput Biol 19(3): e1010200. https://doi.org/10.1371/journal.pcbi.1010200
Editor: Shihua Zhang, Academy of Mathematics and Systems Science, Chinese Academy of Science, CHINA
Received: May 13, 2022; Accepted: February 8, 2023; Published: March 23, 2023
Copyright: © 2023 Baptista et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The original drug response and RNA-Seq datasets used in this study are available from CellMinerCDB (https://discover.nci.nih.gov/rsconnect/cellminercdb/), and the mutation and copy number variation data are available from CBioPortal (https://www.cbioportal.org/). The preprocessed response dataset, the filtered gene expression, mutation and copy number variation files (before merging with the response dataset), and the fully preprocessed drug and gene expression data required to run the expr (DGI) + drugs(ECFP4) model described in the study can be obtained from Zenodo (https://doi.org/10.5281/zenodo.6545638). All of the code used in this study is available online at https://github.com/BioSystemsUM/drug_response_pipeline.
Funding: This study was supported by the Portuguese Foundation for Science and Technology (FCT), through a PhD scholarship (SFRH/BD/130913/2017 awarded to DB) and under the scope of the strategic funding of UIDB/04469/2020 unit. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.