Accurate Detection of Shared Genetic Architecture From GWAS Summary Statistics in the Small-sample Context

Thomas W. Willis, Chris Wallace

Abstract

Assessment of the genetic similarity between two phenotypes can provide insight into a common genetic aetiology and inform the use of pleiotropy-informed, cross-phenotype analytical methods to identify novel genetic associations. The genetic correlation is a well-known means of quantifying and testing for genetic similarity between traits, but its estimates are subject to comparatively large sampling error. This makes it unsuitable for use in a small-sample context. We discuss the use of a previously published nonparametric test of genetic similarity for application to GWAS summary statistics.

Introduction

Genetic pleiotropy is the association of a genetic variant with more than one trait and is a pervasive property of the human genome [1]. A consequence of this phenomenon is the sharing of causal variants between phenotypes. The relationship of multiple phenotypic characters to a common heritable factor was first inferred from observations of the coinheritance of certain traits and diseases. Of these, particularly memorable are Darwin’s remarks that hairless dogs have imperfect teeth [2] and that blue-eyed cats are ‘invariably deaf’ [3, 4].

Materials and method

The GPS test evaluates a null hypothesis of bivariate independence for two random variables U and V. Each random variable models the data-generating process from which p-values for tests of association of SNPs with a phenotype are drawn. We assume that a GPS test statistic which is improbable under this null hypothesis is evidence against the null hypothesis of no sharing of genetic architecture between the two phenotypes.

Discussion

We conducted a comprehensive study of both GPS tests and found them to be a superior means of identifying genetic similarity between disease traits in the small-sample context. For simulated data, the GPS tests’ power was greater than its comparators when the number of cases or genetic correlation was small. This greater power was also in evidence when applying the GPS tests to immune disease pairs from the UK Biobank. In simulations we found the GPS-GEV test to offer an advantage over the GPS-Exp test in terms of power at the cost of the computation required by its the permutation procedure. We also found use of a SNP panel pruned with a far higher value of r2 to improve the power of the GPS tests without loss of type 1 error control.

Acknowledgments

We would like to thank Dr Xavier Warin for timely assistance with the use of the Stochastic Optimisation library StOpt [54]. We would also like to thank our colleague Dr Guillermo Reales for his curation of some of the GWAS data sets used in this work and creation of the GWAS_tools pipeline. We wish to acknowledge all GWAS participants, in particular those of the UK Biobank and FinnGen, for their contribution to the data used herein. We also acknowledge the investigators who carried out these GWAS and made their summary statistics publicly available. We acknowledge in particular the Pan-UKBB team [55]. This research has been conducted using the UK Biobank Resource under Application Number 98032.

Citation: Willis TW, Wallace C (2023) Accurate detection of shared genetic architecture from GWAS summary statistics in the small-sample context. PLoS Genet 19(8): e1010852. https://doi.org/10.1371/journal.pgen.1010852

Editor: Michael P. Epstein, Emory University, UNITED STATES

Received: October 12, 2022; Accepted: June 30, 2023; Published: August 16, 2023

Copyright: © 2023 Willis, Wallace. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Statistics produced in the analysis of real and simulated data sets in this paper have been deposited at https://doi.org/10.5281/zenodo.7150454. UK Biobank GWAS summary statistics were downloaded from www.nealelab.is/uk-biobank. Reference data from Phase 3 of the 1000 Genomes Project were obtained from the Project’s FTP server at http://ftp.1000genomes.ebi.ac.uk. LD scores were downloaded from https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2. A snakemake-based pipeline is provided to automate the download and generation of the data used or produced in this work at https://github.com/twillis209/gps_paper_pipeline.

Funding: TW is funded by the Medical Research Council (MRC) https://mrc.ukri.org/ (MC UU 00002/4). CW is funded by the Wellcome Trust https://wellcome.ac.uk/ (WT107881), the Medical Research Council (MRC) https://mrc.ukri.org/ (MC UU 00002/4) and supported by the NIHR Cambridge BRC https://cambridgebrc.nihr.ac.uk/ (BRC-1215-20014). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: CW receives funding from GSK and MSD, and is a part-time employee of GSK. These funders had no involvement in this work.