Blog Post

Generating Open-Source AI-Ready Protein Expression Datasets with Align to Innovate

Today, we’re excited to announce our new partnership with Align to Innovate, a scientific non-profit organization.

Through this collaboration, we aim to develop an open-source, AI-ready dataset on recombinant protein expression in microbial hosts, specifically Escherichia coli and Pichia pastoris


Ginkgo Datapoints will generate high-fidelity biological data for Align to Innovate, starting with an initial feasibility study to evaluate the use of HiBiT tagging to measure expression of a library of proteins in both E. coli and P. pastoris, before exploring larger libraries for actual model generation.

This partnership addresses a critical need in the biotechnology and synthetic biology communities for accessible, high-quality datasets that can accelerate machine learning applications. The resultant dataset will be openly available, enabling researchers worldwide to develop and refine AI models that correlate DNA sequences with optimal protein expression hosts.

"Our mission at Align to Innovate is to accelerate scientific discovery by making biological datasets more reproducible, scalable, and shareable. Partnering with Ginkgo Bioworks is exciting because it allows us to leverage their extensive expertise in engineering microbial hosts, as well as their data generation capabilities at scale, to create impactful, publicly available protein engineering datasets. The ability to hit the ground running so quickly is a huge advantage, as is being able to leverage the economies of scale of Ginkgo’s platform. We can’t wait to bring this to the community because we believe this open-source dataset will be a valuable resource that catalyzes innovation in protein expression and beyond." - Pete Kelly, Director at Align to Innovate

"Ginkgo and Align to Innovate have a strong mission fit. We are excited to support Align to Innovate on their journey to make biological datasets more accessible, because that’s exactly what we set up Ginkgo Datapoints to do: build AI-ready datasets that bring value to researchers and developers. That’s what makes this project such a great showcase for our data generation capabilities and our day-in, day-out work of generating high-fidelity biological data for AI applications." - John Androsavich, General Manager of Ginkgo Datapoints

We were chosen as a partner due to our strong synthetic biology expertise and because we have over a decade of experience in engineering microbial hosts. Ginkgo’s automated platform is capable of generating high-quality, high-throughput datasets in support of AI and machine learning applications. This collaboration not only highlights Ginkgo’s data-driven focus but also coincides with the launch of Ginkgo Datapoints, which offers fee-for-service data generation solutions. To learn more about bringing your innovative biological solutions to life in the age of machine learning, visit Ginkgo Datapoints.

Posted by John Androsavich