Fabio Morgante , Peter Carbonetto, Gao Wang, Yuxin Zou, Abhishek Sarkar, Matthew Stephens
Abstract
Predicting phenotypes from genotypes is a fundamental task in quantitative genetics. With technological advances, it is now possible to measure multiple phenotypes in large samples. Multiple phenotypes can share their genetic component; therefore, modeling these phenotypes jointly may improve prediction accuracy by leveraging effects that are shared across phenotypes. However, effects can be shared across phenotypes in a variety of ways, so computationally efficient statistical methods are needed that can accurately and flexibly capture patterns of effect sharing. Here, we describe new Bayesian multivariate, multiple regression methods that, by using flexible priors, are able to model and adapt to different patterns of effect sharing and specificity across phenotypes.
Introduction
Multiple regression has been an important tool in genetics for different tasks relating genotypes and phenotypes, including discovery, inference, and prediction. For discovery, multiple regression has been used to fine-map genetic variants discovered by Genome-Wide Association Study (GWAS) [1, 2]. For inference, multiple regression has been used to estimate the proportion of phenotypic variance explained by genetic variants—i.e., “genomic heritability” or “SNP heritability” [3–5]. For prediction, multiple regression has been used extensively to predict yet-to-be-observed phenotypes from genotypes. This task is relevant to the prediction of breeding values for selection purposes in agriculture [6, 7], the prediction of “polygenic scores” for disease risk and medically relevant phenotypes in human genetics [8–10], and the prediction of gene expression as an intermediate step in transcriptome-wide association studies (TWAS) [11, 12]. Traditionally, frequentist multiple regression methods such as penalized regression and linear mixed models [13–16] have been used for these tasks.
Materials and method
We consider the multivariate multiple regression model of outcomes Y on predictors X,
Where Y is an n × r matrix of r outcomes observed in n samples (possibly containing missing values), X is an n × p matrix of p predictors observed in the same n samples, B is the p × r matrix of effects, E is an n × r matrix of residuals, In is the n × n identity matrix, and MNn×r(M, U, V) is the matrix normal distribution with mean M ∈ Rn×r and covariance matrices , [46, 47].
Discussion
We have introduced mr.mash, a Bayesian multiple regression framework for modeling multiple (e.g., several dozen) responses jointly, with accurate prediction being the main goal. A key feature of our approach is that it can learn patterns of effect sharing across responses from the data, then use the learned patterns to improve prediction accuracy. This feature makes our method flexible and adaptive, which are advantages of particular importance for analyzing large, complex data sets. Our method is also fast and computationally scalable thanks to the use of variational inference (rather than MCMC) for model fitting
Acknowledgments
We thank the University of Chicago Research Computing Center for providing high-performance computing resources used to run the numerical experiments. We thank Jeff Spence and Jonathan Pritchard for helpful discussions.
Citation: Morgante F, Carbonetto P, Wang G, Zou Y, Sarkar A, Stephens M (2023) A flexible empirical Bayes approach to multivariate multiple regression, and its improved accuracy in predicting multi-tissue gene expression from genotypes. PLoS Genet 19(7): e1010539. https://doi.org/10.1371/journal.pgen.1010539
Editor: Xiaofeng Zhu, Case Western Reserve University, UNITED STATES
Received: November 21, 2022; Accepted: June 2, 2023; Published: July 7, 2023
Copyright: © 2023 Morgante et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The genotype and expression data used in our analyses are available from dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v8.p2). All code implementing the simulations, and the compiled results generated from our simulations have been deposited on Zenodo (https://doi.org/10.5281/zenodo.8014360). The methods are implemented in the R package mr.mash.alpha, available for download at https://github.com/stephenslab/mr.mash.alpha.
Funding: Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Numbers P20GM139769 and R35GM146868 to FM. MS acknowledges support from National Human Genome Research Institute grant R01HG002585. GW acknowledges support from National Institute of Aging grant R01AG076901. The content is solely the responsibility of the authors and does not necessarily represent the official
views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.