Controllable protein design via autoregressive direct coupling analysis conditioned on principal components

Francesco Caredda, Lisa Gennai, Paolo De Los Rios, Andrea Pagnani

Abstract

We present FeatureDCA, a statistical framework for protein sequence modeling and generation that extends Direct Coupling Analysis (DCA) with biologically meaningful conditioning. The method can leverage different kinds of information, such as phylogeny, optimal growth temperature, enzymatic activity or, as in the case presented here, principal components derived from multiple sequence alignments, and use it to improve the learning process and consequently efficiently condition the generative process.

Introduction

The ability to generate novel functional protein sequences is a central challenge in computational biology and protein design. During the past decade, statistical models trained on evolutionary data, particularly those derived from multiple sequence alignments (MSAs), have demonstrated remarkable success in capturing the statistical, structural, and functional constraints that shape natural protein families.

Methods

We present a novel extension of the standard autoregressive Direct Coupling Analysis (ArDCA) method [5], in which sequence-dependent vectors of biologically relevant features are embedded in the amino acid space of a protein family, thereby constraining the sampling of new sequences to specific, user-defined characteristics.

Results

Generativity

The minimal requisite of the model is to sample sequences that are statistically indistinguishable from the natural ones. This means reproducing both the pairwise frequency statistics and the PCA projection of the natural MSA.

Discussion

In this work, we presented FeatureDCA, a feature-conditioned extension of the Direct Coupling Analysis (DCA) framework, designed for controllable generation of protein sequences within a given protein family. By integrating biologically relevant, low-dimensional features, specifically the principal components (PCs) derived from multiple sequence alignments (MSAs), FeatureDCA enables users to direct sequence generation toward targeted regions of sequence space, all while preserving the statistical and structural characteristics of natural proteins used for training.

Acknowledgments

We are deeply grateful to Martin Weigt and Leonardo Di Bari for many interesting discussions on addressable sequence generation.

Citation: Caredda F, Gennai L, De Los Rios P, Pagnani A (2026) Controllable protein design via autoregressive direct coupling analysis conditioned on principal components. PLoS Comput Biol 22(2): e1013996. https://doi.org/10.1371/journal.pcbi.1013996

Editor: Fei Guo, Central South University, CHINA

Received: September 18, 2025; Accepted: February 6, 2026; Published: February 19, 2026
Copyright: © 2026 Caredda et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All datasets used in this work are publicly available at: https://github.com/francescocaredda/FeatureDCAData. The full Julia implementation of FeatureDCA, including example Jupyter notebooks for reproducibility and application, can be found at: https://github.com/francescocaredda/FeatureDCA.jl. These repositories provide all the necessary resources to replicate the experiments and analyses described in this study.

Funding: FC and AP acknowledge financial support from the project “Explainable Models for Protein Design”, funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) Bando 2022 - grant 2022TE5B7X. FC and AP also acknowledge “Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing” (ICSC). This study was carried out within the “FAIR - Future Artificial Intelligence Research” project, and received funding from the European Union NextGenerationEU (Piano Nazionale di Ripresa e Resilienza (PNRR)–Missione 4 Componente 2, Investimento Grants No. 1.3–D.D. 1555 11/10/2022, and No. PE00000013). FC and AP acknowledge support from the European REA, Marie Skłodowska-Curie Actions, grant agreement no. 101131463 (SIMBAD). This paper reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. LG and PDLR thank the Swiss National Science Foundation for financial support under grant IC00I0-227688. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.