Identification of potential biomarkers for lung cancer using integrated bioinformatics and machine learning approaches

Md Symun Rabby, Md Merajul Islam, Sujit Kumar, Md Maniruzzaman, Md Al Mehedi Hasan, Yoichi Tomioka, Jungpil Shin

Abstract

Lung cancer is one of the most common cancer and the leading cause of cancer-related death worldwide. Early detection of lung cancer can help reduce the death rate; therefore, the identification of potential biomarkers is crucial. Thus, this study aimed to identify potential biomarkers for lung cancer by integrating bioinformatics analysis and machine learning (ML)-based approaches. Data were normalized using the robust multiarray average method and batch effect were corrected using the ComBat method. Differentially expressed genes were identified by the LIMMA approach and carcinoma-associated genes were selected using Enrichr, based on the DisGeNET database. Protein-protein interaction (PPI) network analysis was performed using STRING, and the PPI network was visualized using Cytoscape.

Introduction

Lung cancer is one of the most common cancer and its prevalence and mortality rate have been rapidly increased globally. It is the leading cause of cancer-related death in both sexes [1]. Around 2.2 million new cases of lung cancer are diagnosed each year, and approximately 1.8 million people die from the disease worldwide [2]. There are two main subtypes of lung cancer: small-cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). NSCLC accounts for around 85% of patients, which is also the most malignant carcinoma among men and women [3–5]. It has grown to be a major worldwide health concern that has imposed a heavy financial burden on people and families.

Materials and method

Proposed methodology

The overall workflow adopted for this study is presented in Fig 1. In our study, we utilized gene expression omnibus (GEO) dataset derive from the USA and Taiwan cohort. The training dataset was employed to determine the core genes for each cohort of NSCLC and their performance was validated using test set. Firstly, we combined training datasets for each cohort and normalized them using robust multi-array average (RMA), followed by correction batch effect with the combat method. After that, we determined the differentually expressed genes (DEGs) by linear models for microarray data (LIMMA) and identified carcinema asssociated DEGs using Enrichr web tools for each cohrt.

Results

Identification of DEGs

The DEGs were identified from the combined dataset based on the Adj. p-value < 0.01 and FC | > 2. As per the criteria, we identified 394 (318 up-regulated and 76 down-regulated) DEGs for USA cohort. The volcano plot of the DEGs between the NSCLC patients and healthy control for USA cohort is displayed in Fig 2a. Similarly, we also obtained a total of 277 (226 up-regulated and 51 down-regulated) DEGs for Taiwan cohort as shown in in Fig 2b.

Discussion

This study attempted to propose a system to identify potential biomarkers for patients with NSCLC using the integration of bioinformatics and ML-based approaches. In high-dimensional genomic data analysis, biomarker selection is challenging, mainly due to the large number of characteristics relative to the limited sample size. To identify effective biomarkers in these settings, multiple approaches are available, including hypothesis-based tests, penalized methods like the least absolute shrinkage and selection operator (LASSO), and other ML-based approaches such as support vector machine recursive feature elimination (SVMRFE).

Conclusion

This study aimed to identify the potential biomarkers for lung cancer using integrated bioinformatics and ML-based approaches. After performing different bioinformatics and ML-based analyses, our findings indicated that EDNRB and MME are the potential biomarkers for NSCLC between USA and Taiwan cohorts. The potential biomarkers regulatory network analysis revealed that the key TFs (FOXC1 and FOXL1) and miRNAs (hsa-mir-106b-5p, hsa-mir-20a-5p, and hsa-mir-27a-3p) as the transcriptional and post-transcriptional regulators of NSCLC.

Citation: Rabby MS, Islam MM, Kumar S, Maniruzzaman M, Hasan MAM, Tomioka Y, et al. (2025) Identification of potential biomarkers for lung cancer using integrated bioinformatics and machine learning approaches. PLoS ONE 20(2): e0317296. https://doi.org/10.1371/journal.pone.0317296

Editor: Suyan Tian, The First Hospital of Jilin University, CHINA

Received: July 23, 2024; Accepted: December 24, 2024; Published: February 27, 2025

Copyright: © 2025 Rabby et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: In this study, we used five datasets (GSE54495, GSE49644, GSE102287, GSE40791, and GSE101929) from USA cohort and another three datasets (GSE33356, GSE19804, and GSE27262) from Taiwan cohorts. These datasets can be easily downloaded from the following link: www.ncbi.nlm.nih.gov/geo/. Moreover, TCGA-LIHC dataset can also be easily downloaded from the TCGA database (https://portal.gdc.cancer.gov/).

Funding: This work was supported by the Competitive Research Fund of The University of Aizu, Japan (Grant Number: P-13).

Competing interests: The authors have declared that no competing interests exist.

INFORS HT || The Minitron SE (Special Edition)

Clarivate - Best practices in toxicology report

Clarivate - Companies to watch protein degraders report

Clarivate - Emerging Degrader Modalities on-demand webinar