These circular DNA molecules are called plasmids and are usually hundreds or thousands of base pairs long. They are most commonly found in bacteria, but sometimes occur in more complex organisms like yeast or even plants. Addgene has some more information on them. Yeast and bacteria are some of the organisms that Ginkgo engineers modify to produce compounds that we care about, like ones for different fragrances, or ones that help fix nitrogen that helps plants to grow. So we are able to use plasmids in the organism development process, and often do so.
Plasmids are convenient to use, because it's easier and faster to get them (and their new genes) into an organism than by modifying the organism’s genome, so using them speeds up the development process. They can coexist with genomic DNA in the same organism, and the organism will (usually) express the proteins that the plasmids’ circular DNA codes for, so they are good to test the changes with. However, in most cases, we don’t use plasmids in the final production strain we deliver to customers. We usually encode the final changes in the organism’s genome, since they are easier to maintain in industrial settings. That said, a lot of the DNA molecules that we keep track of at Ginkgo are circular.
The solution that we implemented a long time ago, and that served us well for many years, was a special sql query to check if an input sequence was contained in some sequence in the database. The key idea is that if you have a stored sequence of circDNA of length n, you can concatenate it to itself to get a string of length 2*n that contains all the possible rotations of the original. Then you can check if the input value (also of length n) is contained in the concatenated string. If so, then it is a rotation of the stored sequence. For example, if the input data was ATATATAT, and the existing sequences were stored in a field called “data” in a table called “sequences”, the resulting query would be:
SELECT * FROM sequences WHERE LENGTH(sequences.data) = 8 AND CONCAT(sequences.data, sequences.data) LIKE ‘%ATATATAT%’
This query is correct, and if there are a small number of circDNA molecules in the database with the same length, it also runs quickly. However, if there are m matching sequences of length n in the database, then the running time of this query is O(m*n). Since implementing this approach several years ago, the number of molecules in our database has grown a lot, and for any given length, the odds that multiple circDNA sequences have the same length has grown substantially. Also, last fall we got reports of slowdowns after large new batches of circDNA molecules of the same length were created.We wanted to do the same for circular DNA, but then the question arose, which rotation of the string do we store a hash for? An obvious answer was the lexicographically least rotation. For our purposes, “lexicographic” just means alphabetical ordering. For example, although ACGT and GTAC are rotations of the same sequence, A is earlier in the alphabet than G (and every other character in the sequence), so the lexicographically least rotation is ACGT.
As another example, the lexicographically least rotation of GAATAA is AAGAAT. The one-letter prefixes of that sequence are A, G, and T. The two-letter prefixes, in lexically increasing order, are AA, AG, AT, GA, and TA, but there are two occurrences of AA, so we need to consider longer prefixes. The lexicographically least three-letter prefix is AAG, and there is only one occurrence of it, so that’s the start of the canonical rotation.
Fortunately, there is a linear-time algorithm for computing that rotation. In fact, there are a few. We decided to use Booth's algorithm. The discovery of this algorithm is interesting because it uses a preprocessing method originally used for the Knuth-Morris-Pratt string matching algorithm.
The Knuth-Morris-Pratt algorithm first builds an auxiliary array, in linear time, of the same length as the string to be searched, with certain information about prefixes contained within the string being matched on. The algorithm then iterates over the string to be searched, checking for a match with the input string, and sometimes using the auxiliary array to avoid brute-force string comparisons. Because of how the auxiliary array is built, and used in the search algorithm, the overall time for the search is O(n).
Booth's algorithm generates the same auxiliary array, but instead of using it to match a string, it uses it to find the lexicographically least prefix of all possible rotations of the string in O(n) time. Since this prefix is the start of the lexicographically least rotation of the sequence, we can treat that rotation as the canonical one.
Over the past couple months we made the necessary code changes to implement and use Booth's algorithm, and updated the existing circDNA rows in our sequence table with their sequences' canonical hashes, to be used with the new faster lookup code. Our test benchmarking has shown noticeable improvements in circDNA lookup. In some cases we have seen speedups of 4-6x. These changes are now in production, and we expect circDNA operations to run and scale as smoothly as the ones for linear DNA now. It was fun, and satisfying, to speed up circDNA lookup using the power of a better algorithm!
(Creative Commons Bacterial Plasmid by Spaully is licensed under CC BY-SA 2.5.)
Posted by David Hofer