Colorcon || One Partner
ACROBiosystems - Survey NA

A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products

Tomoki Kawano, Taro Shiraishi, Tomohisa Kuzuyama, Maiko Umemura

Abstract

Biosynthetic gene clusters (BGCs), comprising sets of functionally related genes responsible for synthesizing complex natural products, are a rich source of bioactive compounds with pharmaceutical potential. Here, we present a transformer-based framework that models functional domains as linguistic units to capture and predict their positional relationships within genomes. 

Introduction

Natural products have provided many important drugs such as penicillin, cyclosporine, tacrolimus, and paclitaxel, making them invaluable pharmaceutical resources. These compounds, primarily produced by microorganisms and plants, possess complex chemical structures that are often challenging to synthesize using conventional methods. 

Methods

Data collection and preprocessing

Genomic Data. Microbial genome data was downloaded from the National Center for Biotechnology Information. For bacterial genomes, we selected 12,186 complete genomes annotated in RefSeq. For fungal genomes, 2,670 assemblies were collected using GenBank submitter annotations. For each genome, both gene location files (GFF format) and protein amino acid sequences (protein-FASTA format) were retrieved.

Results

Pretraining on Genomes Using Functional Domain-Based Tokenization

We adopted functional domains within genes as token units for modeling genomic sequences, based on their established importance in BGC characterization as utilized in bioinformatics tools such as antiSMASH. 

Discussion

This study demonstrates that our transformer-based model can effectively learn and predict BGC patterns by treating functional domains as linguistic units. This approach provides both conceptual insights into BGC organization and practical tools for natural product discovery. Notably, divergent prediction patterns between the BGC-trained and genome-trained models suggest that broader training contexts may reveal alternative or previously unexplored biosynthetic trajectories, as exemplified in the analysis of cyclooctatin BGC.

Acknowledgments

We thank Dr Totai Mitsuyama at National Institute of Advanced Industrial Science and Technology and Dr Yuki Kanai at University of Tokyo for valuable discussion.

Citation: Kawano T, Shiraishi T, Kuzuyama T, Umemura M (2026) A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products. PLoS Comput Biol 22(2): e1013181. https://doi.org/10.1371/journal.pcbi.1013181
Editor: Boyang Ji, BioInnovation Institute, DENMARK

Received: May 29, 2025; Accepted: February 9, 2026; Published: February 23, 2026

Copyright: © 2026 Kawano et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All processed genome datasets, model construction codes, statistical analysis scripts, trained models, and other materials are available on Zenodo (https://doi.org/10.5281/zenodo.17577731). The source codes and small accompanying data files are also available on GitHub (https://github.com/umemura-m/bgc-transformer/).

Funding: This work was supported by Grant-in-Aid for Transformative Research Areas (22H05119 to TK, 23H04566 and 25H01599 to MU) and Grant-in-Aid for Challenging Research (Pioneering) (23K18120 to MU) from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.