Integrated bioinformatics and statistical approaches to explore molecular biomarkers for breast cancer diagnosis, prognosis and therapies

Md. Shahin Alam , Adiba Sultana, Md. Selim Reza, Md Amanullah, Syed Rashel Kabir, Md. Nurul Haque Mollah

Abstract
Integrated bioinformatics and statistical approaches are now playing the vital role in identifying potential molecular biomarkers more accurately in presence of huge number of alternatives for disease diagnosis, prognosis and therapies by reducing time and cost compared to the wet-lab based experimental procedures. Breast cancer (BC) is one of the leading causes of cancer related deaths for women worldwide. Several dry-lab and wet-lab based studies have identified different sets of molecular biomarkers for BC. But they did not compare their results to each other so much either computationally or experimentally. In this study, an attempt was made to propose a set of molecular biomarkers that might be more effective for BC diagnosis, prognosis and therapies, by using the integrated bioinformatics and statistical approaches. At first, we identified 190 differentially expressed genes (DEGs) between BC and control samples by using the statistical LIMMA approach. Then we identified 13 DEGs (AKR1C1, IRF9, OAS1, OAS3, SLCO2A1, NT5E, NQO1, ANGPT1, FN1, ATF6B, HPGD, BCL11A, and TP53INP1) as the key genes (KGs) by protein-protein interaction (PPI) network analysis. Then we investigated the pathogenetic processes of DEGs highlighting KGs by GO terms and KEGG pathway enrichment analysis. Moreover, we disclosed the transcriptional and post-transcriptional regulatory factors of KGs by their interaction network analysis with the transcription factors (TFs) and micro-RNAs. Both supervised and unsupervised learning’s including multivariate survival analysis results confirmed the strong prognostic power of the proposed KGs. Finally, we suggested KGs-guided computationally more effective seven candidate drugs (NVP-BHG712, Nilotinib, GSK2126458, YM201636, TG-02, CX-5461, AP-24534) compared to other published drugs by cross-validation with the state-of-the-art alternatives top-ranked independent receptor proteins. Thus, our findings might be played a vital role in breast cancer diagnosis, prognosis and therapies.

Introduction
Breast cancer (BC) is one of the most common types of invasive cancers among women according to the World Health Organization (WHO), which affected around 2.3 million women in 2020. It is also the cause of large number of cancer-related deaths among women worldwide [1]. Symptoms of BC include a change in breast shape, dimpling of the skin, nipple discharge, or a red scaly patch of skin, and a lump in the breast [2]. Based on the existing treatment facilities, the average 5-year survival rate with BC is 86%, but BC with distant metastasis, the average 5-year survival rate drops down to 28% [3]. Thus, the performance of existing therapeutic treatments on BC is not yet reach to the satisfactory level. Therefore, in-depth molecular research is essential to explore BC causing more effective biomarkers and candidate drugs.

Materials and methods
To reach the goal of this study, we considered both raw-data (gene expression profiles) and meta-data associated with BC. Integrated bioinformatics and statistical approaches were used to analyze the datasets to explore KGs highlighting their functions, pathways, regulatory factors, prognosis power and repurposable drugs

Results
Identification of DEGs
We identified 190 DEGs, including 138 downregulated and 52 upregulated genes (S1 Table in S2 File) in BC tissue, using adj.P.Val < 0.01 and logFC > 1 as the threshold for upregulated DEGs, and adj.P.Val < 0.01 and logFC < -1 for downregulated DEGs. The upregulated and downregulated DEGs were displayed on the right and left sides respectively in the volcano plot by the green color in Fig 2A. A heatmap was constructed to show the clustering performance of case and control samples by the up and down regulated DEGs in Fig 2B. We observed that both DEGs and samples separated each other between their contrast groups accurately.

Discussion
In this study, we identified key genomic biomarkers highlighting their pathogenetic processes for breast cancer (BC) diagnosis, prognosis and therapies. At first, we identified 190 DEGs (138 downregulated and 52 upregulated) from the publicly available microarray gene-expression profiles. Then we detected 13 DEGs (AKR1C1, IRF9, OAS1, OAS3, SLCO2A1, NT5E, NQO1, ANGPT1, FN1, ATF6B, HPGD, BCL11A, and TP53INP1) as the KGs that drive the progression of BC. Some literatures also suggested that these KGs are BC causing genes [9, 58–78] (see Fig 8A). For example, the expression of two genes (AKR1C1 and AKR1C2) in carcinoma cells and stromal fibroblasts and their positive correlation are favorable tumor characteristics in primary BC patients [58]. Also these two genes appear to be an interesting target for new hormone-based therapy strategies in primary BC. The IRF9 gene with overexpression has the potential to be a surrogate marker of response and may be associated with drug resistance for BC [59]. The expression of the OAS1 gene that was inversely associated with multiple MSGs in the BC cell line [60]. The OAS3 gene plays a prognostic role in BC patients with potential mechanical value [61]. The SLCO2A1 is ubiquitously expressed and marked as a prostaglandin transporter due to its high affinity [62]. The Gene NT5E is regulated epigenetically in BC, the epigenetic status of this gene influencing metastasis and clinical outcome, and suggests that NT5E CpG island methylation is a promising BC epigenetic biomarker [63].

Conclusion
The main purpose of this study was to identify potential KGs highlighting their function, pathways, and regulatory factors for breast cancer (BC) diagnosis, prognosis and therapies by using the integrated bioinformatics and statistical approaches. We identified BC causing 13 DEGs (AKR1C1, IRF9, OAS1, OAS3, SLCO2A1, NT5E, NQO1, ANGPT1, FN1, ATF6B, HPGD, BCL11A, and TP53INP1) as the KGs by using the five topological measures in the PPI networking results. Their association with BC was also reported by several other studies directly or indirectly that we mentioned in the discussion section. We detected four TFs proteins (FOXC1, FOXL1, JUN, and GATA2) and four microRNAs (hsa-miR-27a-5p, hsa-miR-124-3p, hsa-miR-1-3p, and hsa-miR-210-3p) as the key transcriptional and post-transcriptional regulators of KGs. These regulatory factors play the vital role for the regulation of KGs. The GO terms (BPs, MFs and CCs) and KEGG pathway enrichment analysis revealed some vital GO terms from each of BPs, MFs and CCs that are significantly enriched by DEGs including KGs. The enriched GO terms and KEGG pathways were considered as the key pathogenetic processes of BC progression. These findings were also supported by the literature review directly or indirectly. We investigated the prognostic performance of KGs by using multivariate survival analysis including unsupervised hierarchical clustering and supervised classification. In each case, we observed the strong prognostic performance of the proposed KGs. Then we considered the proposed 13 key proteins and their regulatory 4 TFs-proteins as the drug target receptors to explore effective drugs for BC by molecular docking simulation with the 129 meta-drug agents. We detected 7 small molecules (NVP-BHG712, Nilotinib, GSK2126458, YM201636, TG-02, CX-5461, and AP-24534) as the top ranked candidate drugs for the treatment against BC. Then we investigated the resistance performance of both the proposed and already published candidate drugs against the state-of-the-art alternatives already published top-ranked 13 independent receptors for BC and observed that our proposed candidate drugs are computationally more effective against the independent receptors also. Therefore, the proposed candidate drugs might be played the vital role for the treatment against BC.

Acknowledgments
We are grateful to the editor and reviewers for their valuable comments that help us to improve the quality of the manuscript. We are also grateful and thankful to the authors whose articles help us to write this paper.

Citation: Alam MS, Sultana A, Reza MS, Amanullah M, Kabir SR, Mollah MNH (2022) Integrated bioinformatics and statistical approaches to explore molecular biomarkers for breast cancer diagnosis, prognosis and therapies. PLoS ONE 17(5): e0268967. https://doi.org/10.1371/journal.pone.0268967

Editor: Muhammad Tarek Abdel Ghafar, Tanta University Faculty of Medicine, EGYPT

Received: November 2, 2021; Accepted: May 11, 2022; Published: May 26, 2022

Copyright: © 2022 Alam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: In this study, we analyzed publicly available third-party data. We downloaded three microarray gene expression profiling datasets from the National Center of Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database with accession numbers GSE53566 [20], GSE119552 [21] and GSE152322 [22] using the weblink https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53566, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119552 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152322, respectively. The necessary meta-drug agents were selected by using the online database GSCALite (http://bioinfo.life.hust.edu.cn/web/GSCALite/) and reviewing the published articles (see Table S2 and S3 in S1 File).

Funding: This work was supported by Rajshahi University Research Project (A-289/5/52/RU/Science-24/2021-2022). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.