New AI Model

Antibody Discrete Diffusion for Full Generation of Antibody Sequences from Noise

Generating antibody sequences reflecting natural variability at high resolution

Authors: Joshua Moller, Uri Laserson, Porfi Quintero Cadena, Jake Wintermute, and Ankit Gupta

Diffusion models are advancing rapidly as a tool for protein engineering. Antibody binding represents a particularly compelling use-case for protein engineering with the potential to accelerate the development of therapeutic antibodies for a wide range of targets.

This white paper describes Antibody Discrete Diffusion, a generative model for antibody sequences. Our approach builds on the work of EvoDiff and D3PM as discrete models for generating protein sequences. We extend the capabilities of protein modeling for antibody applications by training our model on the OAS dataset, a large sequence collection reflecting the specific sequence variability characteristic antibody somatic hypermutation.

We show that Antibody Discrete Diffusion is capable of generating antibody sequences that closely resemble natural distributions by a variety of metrics. We explore the effect of the sampling temperature parameter and offer guidance on the best practices for balancing the quality and diversity of generated sequences.

Antibody Discrete Diffusion for Full Generation of Antibody Sequences from Noise

Diffusion models are a popular class of generative AI models that produce high-fidelity samples that are similar to training data. Notable examples include DALL·E-2, Stable Diffusion, and Sora. These models learn to generate images from noise by first adding noise to an image and then learning how to reverse this process (Fig. 1). This method has been extended to generate images of various styles, subjects and textures using only a text prompt.

Biotech R&D teams often seek to develop biological molecules for specific applications. Generative AI and diffusion models are promising tools for the controlled engineering of protein sequences from prompts. Tools for diffusion-driven structure generation of proteins include RFdiffusion [1] and Chroma [2].

Structure-focused generation offers a promising tool for many design applications, such as designing and predicting binding to target proteins. However, its reliance on well-defined structures complicates its use for applications where the critical structural intermediate is unknown. Biological sequence data may help to fill this gap.

Access, documentation & example usage

Access to Antibody Discrete Diffusion is available through the Ginkgo Model API. You can read additional documentation or follow this Google Colab notebook for a demonstration of usage.