New AI Model

Direct Prediction of Gene Expression with Promoter-0

DNA prompt engineering to adapt the Borzoi framework to short promoter-payload sequences

Authors: Siqi Zhao, Valentin Zulkower, Jake Wintermute, Alyssa Morrow and Ankit Gupta

Promoter-0 is an AI-powered framework that allows direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the Borzoi model from Calico Labs, a CNN trained to predict RNA-seq coverage from long (524kb) DNA segments.

This white paper describes a novel prompt engineering strategy to adapt the Borzoi framework to handle the much shorter (1-3kb) DNA segments relevant for typical expression constructs that include a single promoter and gene of interest. Using new large datasets generated in the Ginkgo foundry, we benchmark Promoter-0's performance at predicting promoter activity in different cell lines and for different DNA design tasks.

We anticipate that Promoter-0 will be useful for biotech R&D projects that make use of rational promoter design. In some cases, it may be possible to design a desired promoter activity or cell-type specificity in a single step. More often, we see Promoter-0 being used to generate well-balanced and diverse libraries of candidate promoter sequences for testing in the lab, speeding up the process of interative sequence improvement.

Direct Prediction of Promoter Activity with Promoter-0

The emergence of large DNA foundational models presents an opportunity to revolutionize promoter design. Here, we describe Promoter-0, an AI model capable of generating tunable and tissue-specific promoters. Our approach builds on Borzoi, a sequence-based machine learning model that learns to predict RNA-seq coverage from DNA sequence [1]. Using Ginkgo’s high-throughput screening platform, we collected tens of thousands of data points to validate and expand this framework.

Promoter-0 can predict promoter activity across diverse cell and tissue types without requiring additional model fine-tuning. Remarkably, zero-shot predictions from Promoter-0 perform comparably to standard models trained with labeled data in some settings. To the best of our knowledge, this represents the first demonstration of direct prediction of context-specific expression of a synthetic expression cassette, an important practical milestone in promoter design.

Access, documentation & example usage

Access to Promoter-0 is available through the Ginkgo Model API. You can read additional documentation or follow this Google Colab notebook for a demonstration of usage.