CRISPR-Analytics (CRISPR-A): A platform for precise analytics and simulations for gene editing

Marta Sanvicente-García, Albert Garcia-Valiente, Socayna Jouide, Jessica Jaraba-Wallace, Eric Baptist, Marc Escobosa, Avenida Sánchez-Mejías, Marc Guell

Abstract

Gene editing characterization with currently available tools does not always give precise relative proportions among the different types of gene edits present in an edited bulk of cells. We have developed CRISPR-Analytics, CRISPR-A, which is a comprehensive and versatile genome editing web application tool and a nextflow pipeline to give support to gene editing experimental design and analysis. CRISPR-A provides a robust gene editing analysis pipeline composed of data analysis tools and simulation. It achieves higher accuracy than current tools and expands the functionality. The analysis includes mock-based noise correction, spike-in calibrated amplification bias reduction, and advanced interactive graphics. This expanded robustness makes this tool ideal for analyzing highly sensitive cases such as clinical samples or experiments with low editing efficiencies. It also provides an assessment of experimental design through the simulation of gene editing results. Therefore, CRISPR-A is ideal to support multiple kinds of experiments such as double-stranded DNA break-based engineering, base editing (BE), primer editing (PE), and homology-directed repair (HDR), without the need of specifying the used experimental approach.

Introduction

CRISPR-based gene editing has become a fundamental toolbox to cover a large variety of research and applied needs. It facilitates the editing of endogenous genomic loci and systematic interrogation of genetic elements and causal genetic variations [1–3]. Nowadays, it is even on the verge of becoming a therapeutic reality in vivo [4]. Despite tremendous advances, DNA editing and writing still involve imperfect protocols which need to be optimized and evaluated. This makes it essential to have tools that enable accurate characterization of gene editing outcomes.

Gene editing outcomes often involve complex data sets with a diverse set of genotypes. This is especially accentuated for double-stranded DNA-based gene editing such as those based on non-homologous end-joining (NHEJ) or homology directed repair (HDR). These experiments often generate complex gene editing signatures involving insertions, substitutions, and deletions. Accurate quantification of this distribution of genotypes may have important implications including knockout integrity or splicing modulation.

Material and methods
Simulations algorithm development

SimGE is built taking into account the different layers of classes and their proportions. The proportion of edited and not edited sequences can be determined by two different sgRNA efficiency predictors, Moreno-Mateos [41] and Doench 2016 [42] scores, which give the most reliable on-target activity prediction [43]. Both models give a group of weights that are assigned as descriptors of the gRNA sequence in order to define the efficiency as a value that falls between 0 and 1, being 1 the most efficient gRNA. Depending on the experimental design, one model or another suit better the data: for guides expressed in cells from exogenous promoters, like U6, Doench 2016 scores are recommended, but for guides transcribed in vitro from T7 RNA polymerase promoter, using Moreno-Mateos scores is a better option.

Results

Gene editing simulations provide design assessment

We developed CRISPR-A, a gene editing analyzer that can provide simulations to assess experimental design and outcomes prediction. These simulations are generated by SimGE, an R package that, for ease of use, is implemented within the CRISPR-A platform (Fig 1A). This algorithm is useful to generate simulated data of edited reads for CRISPR analysis tools evaluation as well as for design purposes. The SimGE algorithm is based on the characterization of repair outcomes in primary T cells [25, 26], which is a promising cell type for therapeutic ex vivo genome editing. It simulates repair outcomes of CRISPR-Cas9 knockout experiments and it is able to simulate the most common variants: insertions, deletions, and substitutions, based on observed experimental data edit distributions (Fig 1B). Same parameters and probability distributions were fitted for three other cell lines: Hek293, K562, and HCT116 [27], to make SimGE more generalizable and increase its applicability.

Discussion

NGS is the method that enables the identification of all different outcomes led by genome editing tools. There are different online and command line available tools to decipher the percentage of edits achieved in genome editing experiments. Even so, most of these tools do not retrieve all possible kinds of editing events and are not flexible enough to cover the whole diversity of genome editing tools. Moreover, none of them include simulation to help in the design or analysis performance evaluation. Furthermore, alignment, amplification, and sequencing errors have not been previously taken into account systematically to achieve a precise estimation of CRISPR-based experiments results. Neither spike-in controls or Uni-Molecular Clustering had been applied in this field to correct these errors.

Acknowledgments

We would like to thank María, Alejandro, Aitor, Andrea, Javier, Joana, Othmane, Jon, María, Leandro, Guillermo and Yabel for their collaboration in the examination of reads to generate a ground truth data set.

Citation: Sanvicente-García M, García-Valiente A, Jouide S, Jaraba-Wallace J, Bautista E, Escobosa M, et al. (2023) CRISPR-Analytics (CRISPR-A): A platform for precise analytics and simulations for gene editing. PLoS Comput Biol 19(5): e1011137. https://doi.org/10.1371/journal.pcbi.1011137

Editor: Ilya Ioshikhes, ., CANADA

Received: February 24, 2023; Accepted: April 30, 2023; Published: May 30, 2023

Copyright: © 2023 Sanvicente-García et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Next-generation sequencing data are available in the European Nucleotide Archive under the Study accession number PRJEB53901. Previously published data used in this paper can be found under the following accession numbers: PRJNA326019, PRJNA486372, PRJNA208620 and PRJNA304717. SimGE developed R package can be installed with devtools: devtools::install_bitbucket("synbiolab/SimGE"). Code for CRISPR-A pipeline has been made available in Bitbucket https://bitbucket.org/synbiolab/crispr-a_nextflow/ and through the web page application https://synbio.upf.edu/crispr-a/. This pipeline will also be added to the NF-core community. Custom analysis scripts for data analysis and visualization are freely available at https://bitbucket.org/synbiolab/crispr-a_figures/.

Funding: This work was supported by the European Commission (European Union Horizon 2020 grant 825825 to MG), Ramón y Cajal program (grant RYC-2015-17734 to MG), Fundación Ramón Areces (grant “Advanced gene editing technologies to restore LAMA2 on merosin-deficient congenital muscular dystrophy type 1A” to MG) and Ministerio de Ciencia e Innovación de España (Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020 «Advanced methodologies for precise and efficient gene delivery» grant PID2020-118597RB-I00 to MG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors declare that they have no conflict of interest.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011137#abstract0

Thermo Fisher Scientific - mRNA Services