Covar: A Generalizable Machine Learning Approach to Identify the Coordinated Regulators Driving Variational Gene Expression

Satyaki Roy, Shehzad Z. Sheikh, Terrence S. Furey

Abstract

Network inference is used to model transcriptional, signaling, and metabolic interactions among genes, proteins, and metabolites that identify biological pathways influencing disease pathogenesis. Advances in machine learning (ML)-based inference models exhibit the predictive capabilities of capturing latent patterns in genomic data. Such models are emerging as an alternative to the statistical models identifying causative factors driving complex diseases. We present CoVar, an ML-based framework that builds upon the properties of existing inference models, to find the central genes driving perturbed gene expression across biological states. Unlike differentially expressed genes (DEGs) that capture changes in individual gene expression across conditions, CoVar focuses on identifying variational genes that undergo changes in their expression network interaction profiles, providing insights into changes in the regulatory dynamics, such as in disease pathogenesis. Subsequently, it finds core genes from among the nearest neighbors of these variational genes, which are central to the variational activity and influence the coordinated regulatory processes underlying the observed changes in gene expression.

Introduction

The advent of high-throughput genomic data acquisition techniques has generated immense interest in the statistical and data-driven analysis of biological and biochemical interactions [1]. One such approach, network inference, attempts to identify network topologies that capture the “interactome”, defined as the set of direct or indirect molecular interactions, within a biological system [2,3]. It employs a range of computational models, such as Bayesian, autoregression, and differential equations, to answer questions ranging from basic cell biology to disease pathogenesis. The efficacy of the resultant biological networks largely depends on the ability of the computational models to capture the complex dynamics among the entities determined by nonlinear and stochastic interactions [4].

Methods

2.1 RNA-seq data preprocessing

CoVar takes as input results from two sets of genome-wide expression experiments, where one set of experiments represent a perturbation in relation to the second set of control experiments. Data from each experiment is represented as S×N expression matrix X with S samples and N genes, where rows are samples and columns are gene names, i.e., Xi,u the expression value of gene u in sample i (see Fig 1).

Results

3.1. Overview of CoVar

CoVar utilizes machine learning (ML) and network science to identify variational and core (or central) genes from gene expression data that are altered across biological conditions. Expression data from perturbed and control conditions are each represented as networks, where nodes are genes and expression relationships are directed edges (u, v) with weight wu,v∈[0,1] that represents the strength of the influence of gene u on the expression of gene v (Fig 6A; Methods 2.2). Genes that show the largest changes in the strength of influence relationships (weights) with neighbor genes are identified as variational genes (Fig 6B; Methods 2.3) providing an initial set of genes highly affected by the biological perturbation. Genes with strong connections to the variational genes, denoted as their nearest neighbors (Fig 6C; Methods 2.4), are determined, and together with the variational genes form the nearest neighbor network. Modules, or network communities, are identified to focus on well-connected groups of genes (Fig 5D; Methods 2.5).

Discussion

We present a network analysis framework, called CoVar, that analyzes expression data across two different conditions to identify central genes potentially involved in more fundamental changes to the cellular state due to the perturbation. CoVar prioritizes genes whose expression seems to influence the largest number of modified coexpression relationships in a coordinated manner, not simply genes whose individual expression levels have been changed. We believe the unique strengths of CoVar will be most beneficial in instances of significant cellular reprogramming due to an altered gene regulatory landscape, such as in disease or extreme changes in the environment, where reprogrammed cells will not only show changes in its expression profile but also its regulatory interactions leading to altered responses to external stimuli. Given control and perturbed expression datasets, CoVar employs feature ranking-based machine learning to create separate networks with directed edges having weights commensurate with the influence of a gene on the expression of another. Our analysis of simulated and real expression data serves to highlight the distinctive characteristic of CoVar that enables it to capture the modularity and variationality in the expression datasets. It opts for the selectivity of a few top variational genes rather than an exhaustive list. This intentional restraint serves a dual purpose—it captures pivotal genes with significant relative differences across control and perturbed datasets while deliberately allowing space for the emergence of modularity within the network.

Citation: Roy S, Sheikh SZ, Furey TS (2024) CoVar: A generalizable machine learning approach to identify the coordinated regulators driving variational gene expression. PLoS Comput Biol 20(4): e1012016. https://doi.org/10.1371/journal.pcbi.1012016

Editor: Piero Fariselli, Universita degli Studi di Torino, ITALY

Received: January 10, 2023; Accepted: March 22, 2024; Published: April 17, 2024

Copyright: © 2024 Roy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code that support the findings of this study are openly available in a GitHub repository at https://github.com/satunr/CoVar.

Funding: This work was funded by a grant from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK; P01DK094779) to S.Z.S and T.S.F. and Multi-Omic iNtegrated Analysis in Lupus Project (MONA Lupus) to S.Z.S. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. S.R. and S.Z.S received salary funding from (MONA Lupus) and T.S.F and S.Z.S. received salary funding from the NIDDK grant.

Competing interests: The authors have declared that no competing interests exist.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012016#abstract0