Detection is one of the key tools in our country’s fight against COVID-19. We previously discussed how qPCR and software algorithms are being used to diagnose the disease. At Ginkgo, we are developing a SARS-CoV-2 test based on Next Generation Sequencing (NGS). In this post, we’ll explore how (NGS) and software can augment the country’s testing capacity to support reopening our economy.
At the time of this writing, the US conducts about 500K COVID-19 tests per day. Researchers have proposed that we, as a nation, must scale this number to tens of millions or even hundreds of millions of tests per day so that we can test people going to work a few times a week, allowing workplaces to help contain the spread of the virus. How can the U.S. rapidly scale up testing capacity 10 or 100 fold? In our recent whitepaper, Ginkgo proposed continuing to leverage and scale up qPCR capabilities since qPCR is both well understood and is a tried and true technology. To further augment U.S. testing capacity, Ginkgo is pursuing an additional approach - leveraging genomic sequencing technologies to detect the novel coronavirus. NGS was developed and refined for sequencing the human genome during the Human Genome Project, but NGS is versatile and has found a number of uses beyond its original application for sequencing the human genome—including using sequencing to detect the presence of viral RNA in a sample.
NGS technology can read massive amounts of DNA. We could potentially use it to look for genome fragments of the novel coronavirus. In a single run, modern NGS instruments are able to read millions to billions of DNA fragments of up to 600 base pairs (individual A, C, T and G molecules that make up the genetic code of life). We could use this huge “read” capacity to simultaneously look for base pair fragments unique to the virus in individual samples taken from thousands of people, thereby enabling us to perform an enormous number of tests all at the same time, on the same instrument.
We do have to customize existing NGS pipelines to detect the novel coronavirus. First, we would collect samples from people —these may be collected via saliva, nasopharyngeal swabs, or some other mechanism. We would convert the virus’s RNA to DNA and then amplify specific DNA fragments that are unique to the virus (we don’t sequence any identifying human DNA sequences). We could then attach unique DNA “barcodes” (sequences that aren’t biologically relevant but that allow us to track sequences in bioinformatics software) for each individual’s sample so that the DNA fragments can be traced back to a particular sample. The samples are then pooled together into a single NGS run, which the NGS instrument processes. Custom software then separates out the detected DNA fragments from each individual — if coronavirus fragments are detected for a sample, then there is a high likelihood that that person is infected. This process sounds complicated (and I have omitted many important details), but the individual steps are all well understood and more importantly, much of the infrastructure for performing these steps already exists - including the software.
Software plays a major role in every step of NGS pipelines, from the initial laboratory work all the way to the end analysis. Laboratory information management software (LIMS) drives the workflows that keep track of samples and automate laboratory work such as adding the DNA barcodes discussed above. While this domain may be unfamiliar to you, modern LIMS systems are built with technologies that you are likely familiar with including React, Docker, AWS and Python. Analyzing large NGS datasets involves processing a huge amount of data and sometimes use algorithms that can consume a vast amount of computer memory and storage. Fortunately, a lot of this analysis involves pleasingly parallel tasks which lend themselves well to solutions that use tools such as AWS Batch and Apache Airflow both of which we use at Ginkgo. This means that the pipelines are easy to orchestrate and we only pay for the compute power we need to use.
In summary, NGS is an important technology whose existing capacity can be repurposed to detect the novel coronavirus in people. While biotechnology may not be a familiar domain to most software engineers, it does leverage modern software tools and technologies we are familiar with. We will discuss some of the details of our NGS pipeline in a future post.
(Feature photo by National Cancer Institute on Unsplash)
Posted by Jamie Cho