Structure Learning for Gene Regulatory Networks

Anthony Federico, Joseph Kern, Xaralabos Varelas, Stefano Monti

Abstract

Inference of biological network structures is often performed on high-dimensional data, yet is hindered by the limited sample size of high throughput “omics” data typically available. To overcome this challenge, often referred to as the “small n, large p problem,” we exploit known organizing principles of biological networks that are sparse, modular, and likely share a large portion of their underlying architecture. We present SHINE—Structure Learning for Hierarchical Networks—a framework for defining data-driven structural constraints and incorporating a shared learning paradigm for efficiently learning multiple Markov networks from high-dimensional data at large p/n ratios not previously feasible. We evaluated SHINE on Pan-Cancer data comprising 23 tumor types, and found that learned tumor-specific networks exhibit expected graph properties of real biological networks, recapture previously validated interactions, and recapitulate findings in literature. Application of SHINE to the analysis of subtype-specific breast cancer networks identified key genes and biological processes for tumor maintenance and survival as well as potential therapeutic targets for modulating known breast cancer disease genes.

Introduction

Biological networks can model functional relationships at different cellular levels–genes, proteins, metabolites–and can be integrated to depict system-wide connectivity. Gene regulatory network (GRN) reconstruction aimed at inferring putative mechanistic interactions associated with disease phenotypes can support the identification of drivers of disease severity and treatment response [1–3]. Importantly, changes in network connectivity across experimental conditions or phenotypes may help pinpoint important context-specific regulators or mediators, and inform functional experiments aimed at elucidating mechanisms of action (MOAs), targetable vulnerabilities, and resistance to treatment [4–7].

Materials and method

Module detection and extension

Genes are clustered by their co-expression similarity sij−measured by the absolute value of the biweight midcorrelation coefficient: si,j = | bicor(xi, xj) | as well as a soft thresholding value ß which pushes spurious correlations to zero, resulting in a symmetric p x p weighted adjacency matrix aij = sijß. Co-expression modules are detected using hierarchical clustering of a topological overlap dissimilarity transformation di,j of aij resulting in Q modules. Genes are assigned a membership score across all modules, where the membership of gene i in module q is the correlation of i and a module eigengene E(q), which for the qth module, is the first principal component of the expression profiles of genes within q, thus MM = | bicor(xi, E(q)) |. Within each module, each gene is assigned a probability of membership through quadratic discrimant analysis based on MM1/MM2. Module membership is extended to non-member genes above a membership probability Mp.

Results

SHINE Algorithm overview

The SHINE algorithm takes a multi-pronged approach to learning biological Markov networks, whereby multiple related networks are learned in a hierarchical procedure. First, structure learning constraints are applied based on co-expression module detection, to identify genes unlikely to be interacting and thus to reduce the complexity of the graphical search space. Second, a network hierarchy is defined based on the relationships between groups of samples representing distinct phenotypes from which networks are to be learned. The network hierarchy, the detected modules, and the omics dataset are the inputs to the learning procedure, which is outlined in S1 Fig. Using a top down approach, child networks take advantage of a-priori structural information from previously learned parent networks in the hierarchy (S4 Fig). Structure constraints are detected and applied at the root level of the hierarchy and used in a divide and conquer (DAQ) fashion, whereby subgraphs (from the feature sets of extended modules) are learned independently and then merged to create a final global network structure. When learning a child network from a parent network in the DAQ-context, the posterior distribution of edges of each parent subgraph are used as a prior in learning the child subgraphs, which are then reconstructed into a final child network structure. Each of the networks is estimated based on a birth-death Markov chain Monte Carlo algorithm for the inference of undirected GGM using marginal pseudo-likelihood maximization [24].

Discussion

In this report we present a novel multi-pronged approach called SHINE for learning biological Markov networks from limited sample sizes. We exploit known organizing principles of biological networks to limit the model parameters of structure learning and encourage shared learning of multiple networks to boost the equivalent sample size. This approach reduces the complexity of the search space, allows related networks to share data, and takes advantage of a-priori structural information, resulting in higher overall performance with fewer false positives. There is a cost of slightly more false negatives, however this tradeoff is advantageous in the context of biological network inference where n≪p, since we are primarily interested in predicting high confidence interactions for hypothesis generation. We apply SHINE to reconstruct tumor-specific networks from TCGA data as well as a focused analysis on breast cancer and find inferred networks exhibit expected graph properties of real biological networks, recapture previously validated interactions, and recapitulate findings in literature.

Acknowledgments

The authors would like to thank R. Mohammadi, author of the R package BDgraph, for his assistance with inquiries related to Bayesian structure learning through his software.

Citation: Federico A, Kern J, Varelas X, Monti S (2023) Structure learning for gene regulatory networks. PLoS Comput Biol 19(5): e1011118.
https://doi.org/10.1371/journal.pcbi.1011118

Editor: Manja Marz, bioinformatics, GERMANY

Received: February 14, 2022; Accepted: April 20, 2023; Published: May 18, 2023

Copyright: © 2023 Federico et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The presented methods are available as an open source R package named shine. The package supports high-dimensional constraint-based structure learning for network hierarchies. We have additionally made available the Nextflow-based framework for inferring multiple large-scale networks on high performance computing environments called shine-nf. The learned networks have been published to the Network Data Exchange (NDEx) and are also hosted through our publicly accessible interactive web portal. Many of the network analysis and visualization functionalities described in the manuscript are available as an open source R package named bieulergy. Portal: bieulergy.shinyapps.io/shine Repositories: github.com/montilab/{shine,shine-nf,bieulergy} NDEx (Pan-Cancer Networks): 2122e735-bf01-11eb-8ba9-0ac135e8bacf NDEx (Breast Cancer Networks): 45a1b9d9-bf14-11eb-8ba9-0ac135e8bacf Documentation: montilab.github.io/shine Operating system: Linux, OS X Programming languages: R, Nextflow License: GNU GPLv3.

Funding: This work was supported by the Find the Cause Breast Cancer Foundation (findthecausebcf.org, to SM), the National Cancer Institute (NCI U01CA243004, to SM, 31CA232683, to JK), the National Institute on Aging (NIA cooperative agreement UH2AG064704, to SM), the Moorman-Simon Fellowship in Computational Biomedicine, to AF), as well as the National Institute of Dental & Craniofacial Research (NIDCR F31DE029701, to AF, and R01 DE 031831, to SM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.