Subscribe to Newsletter
Techniques & Tools Technology, Genomics & DNA Analysis, Data Analysis

Digital Versatile DNA

Image of Hardrive and DNA

Hard disc drive space is now measured in terabytes not megabytes. And yet still we run out of space – or money. Perhaps the solution for long term archiving is best supplied by Mother Nature herself. 

The Analytical Scientist caught up with Nick Goldman, from the European Bioinformatics Institute in Cambridge, UK, who has used DNA’s resilient, efficient, and compact coding abilities to archive our digital life rather than our genetic one. 

Binary to genetic – isn’t that a bit of a leap?

“A code is just a code, if you know the system. Binary is not magic - it’s just easier to have an on and off and nothing in between. All files have systems for encoding information into ones and zeros for storage, because that’s what hard discs are good at. We decided to invent a new coding system that used the letters A, C, G and T instead of binary 1s and 0s.”

Do you directly replace 1s and 0s with nucleotide bases?

“Actually, we read the files in bytes - or chunks of 8 bits, for example, in binary code, that could be: 0100100. We then re-wrote each of the possible 256 binary combinations (eight 1s and 0s) as unique five letter codes using A, C, G and T.”

What did you encode into DNA?

“The files in question were: Martin Luther King’s ‘I Have a Dream’ speech in mp3 format, a PDF of Watson and Crick’s publication describing the structure of DNA, a text file of Shakespeare’s sonnets, and a photo of EBI taken by me”

A recent paper in Nature gave DNA a half-life of 521 years in unfavorable conditions. How long could your storage DNA last? 

“Bonds do break at a certain rate and the chemicals do degrade, of course, but it’s really pretty slow. If you’ve got multiple copies and not all of them break in the same way, you can recover the data. DNA archives lasting tens of thousands of years is easily arguable and longer is not ridiculous.”

You partnered with Agilent Technologies on the synthesis side – could you briefly describe how that works?

“It’s called oligo library synthesis and it’s like an inkjet printing process. But instead of firing ink onto paper, they fire chemicals containing DNA nucleotides at a glass slide where they are linked. They can very, very accurately address different spots on the slide and grow a chain according to our designed sequence. It’s all automated such that the DNA is removed from the slide and supplied to us dried in a vial.”

You sent your code to Agilent, they sent back the novel DNA, and then?

“The DNA was purified, amplified by PCR, and sequenced using an Illumina HiSeq – a world standard and well-understood piece of kit.”

Courtesy of EBI

Courtesy of EBI

 So, now you have raw data – the next part must be pretty complex… 

“Biological experiments are messy and they don’t produce beautiful clean data. Certainly, we know that errors can occur with DNA sequencing, and we assumed that the same was true of synthesis; we discussed it with Agilent and that is indeed the case. We attempted, therefore, to devise a coding system that was somewhat resistant to the kind of errors that were most likely to occur, for example, base repeats were avoided. Several layers of redundancy were built into the system for this purpose. To decode, the system takes the fragments of DNA, separates them out into indexing components, which contain information about the contents and location, and, using a majority voting system, rebuilds the file byte by byte. And when we compared the new decoded files with the originals in a formal bit by bit comparison, they were exactly the same.”

Any surprising conclusions from the research?

“One of the things that we were really pleased about was people’s realization of the fact that genomes are just digital information – like on your computer, and, in fact, they’re interchangeable and we can go between the two and lose nothing!”

We are indeed living in a digital world and perhaps DNA is the newest, oldest code around.

Online Exclusive:

Digging Deeper and Probing the Realms of Science Fiction

I remember clearly the advent of the DVD. What a marvel! 4.7GB of information in such a compact format! I have witnessed the march of solid-state memory towards smaller, cheaper, higher capacity. Hard disc drive space is now measured in terabytes not megabytes. And yet still we run out of space – or money. Perhaps the solution for long term archiving is best supplied by Mother Nature herself. 

But what does DNA have to do with digital information?

110011001011001011001011010100101011111101
CGATCGTAGTCGTAGTCGTAGTCGTGTGTAGTGTGGGGGCTG

Anyone finding a pattern? 

The Analytical Scientist caught up with Nick Goldman, from the European Bioinformatics Institute in Cambridge, who has used DNA’s resilient, efficient, and compact coding abilities to archive our digital life rather than our genetic one. Goldman is the leader of a research group that looks at methods available for analysing genomic data to study evolution. By finding patterns and differences between the genome sequences in circulation using data extraction techniques, Goldman’s team can produce a “footprint of evolution”. But I digress. 

It is when drinking steins of ale in Hamburg with colleagues that Goldman worries how on earth he can store and manage the massive amounts of data produced by global genomic analysis, given that EBI is also one of the world’s major data repositories. The irony that DNA could act as the solution to its own problem was not lost on Goldman and Ewan Birney and they came up with the concept of creating novel DNA to their own design.

Goldman decided to invent a new coding system that used the letters A, C, G and T instead of binary 1s and 0s and chose to encode several file types: Martin Luther King's ‘I Have a Dream’ speech in mp3 format, a PDF of Watson and Crick's publication describing the structure of DNA, a text file of Shakespeare's sonnets, and a photo of EBI taken by Goldman himself.

A recent paper in Nature gave DNA a half-life of 521 years in unfavorable conditions with moisture, microbes... and mud, but Goldman estimates that DNA used for archive purposes could last significantly longer. "It's almost routine now to extract 10,000-year-old DNA in analyzable condition from historical finds, such as mammoths. In fact, the Neanderthal genome has already been sequenced – and they weren't kept in controlled or even dry conditions.”

For this project, samples were lyophilized (freeze dried), further enhancing stability. In fact, the more likely challenge is ensuring that people remember the data is there and how it can be accessed. "There are bigger practical problems for serious long-term storage. You'd need some sort of Rosetta Stone equivalent, for example. There's the whole library-craft aspect, but those aren't things we spent a lot of time thinking about - we simply wanted to prove that it was possible." 

With the concept in place, EBI needed a collaborator with expertise in DNA synthesis and turned to Agilent Technologies in the US, who liked what they heard. The market for custom DNA fragments is growing, driven by the need for custom DNA reagents within genomics research, and the team at Agilent had been working on a "new and improved" process that they were keen to trial.  

Having received the custom DNA from Agilent in the US, Goldman sent the now precious data format over to colleagues working in a central DNA sequencing unit in Heidelberg at the EBI's parent organisation, the European Molecular Biology Laboratory. Excellent foresight allowed Goldman's novel DNA to be plugged straight into standard sequencing procedures, since features to match current analytical protocols were built right into the fragments' design.

But how long does this new data storage technique take exactly? Well, don't throw away your portable hard drives... yet. “DNA synthesis in the quantities we needed took a couple of days, and the sequencing took a couple of weeks, from sample receipt to computer data. The equivalent sequencing now could be done in a couple of days. Sequencing technology has improved incredibly over the last five years with next- or second- generation machines, and we're just waiting for the third; the technology completely changes every couple of years and it's really an astonishing area in terms if innovation – both speed and cost.”

New technologies, such as Nanopore, tempt me into the realms of science fiction... Could DNA storage have application beyond archives and into portable and personal data storage? "If there's an application, like testing if there's horsemeat in your burger, or passing someone the entire contents of Wikipedia in DNA format, then maybe." 

But it seems likely that DNA writing is destined to be a service rather than a piece of standard equipment, much as we didn't press our own vinyl records and now rely on Google, Apple and their ilk to provide us with cloud storage. Perhaps these cloud providers will need to consider a storage alternative in the near future, consolidating and playing host to our media collections, and storing our own genomes for the Government, on vast banks of DNA. Or we might see DNA cartridges instead of Blu Ray discs for our ultra high definition, interactive 4D movies... Just maybe. 

Back to reality and the present, economies of scale is something the team has given serious consideration to. And while now, at a cost of several thousands of dollars to convert 739kb of data, it's not a storage option for most, that's not to say Goldman isn't interested in potential commercialization given its attributes. 

As interesting as the DNA synthesis and sequencing aspects are, it is perhaps the actual encoding, decoding and error correction of the data that sets the project apart. Recompiling the data with an astounding 100% accuracy is no mean feat given the small yet inherent errors of synthesis and sequencing. "Biological experiments are messy and they don't produce beautiful clean data. It's actually my day job to spend time working out how to interpret results that aren't straightforward.” Goldman attempted, therefore, to devise a coding system that was somewhat resistant to the kind of errors that were most likely to occur. 

One well-known error is that, if you try to read back a sequence of DNA where the same base is repeated, longer repeats increase the probability of error. To counter, Goldman's system avoided base repeats all together, which, while placing significant constraints on the code, meant that if base repeats were found in sequencing, the information could be discarded. Several levels of in-built redundancy made this possible but also protected the system against other errors, such as the complete inability to read a particular sequence. With many copies of each fragment, all tagged with information about where the sequence belongs in the file, the "majority voting system" could essentially build the correct code sequence – imagine a 1000-piece jigsaw puzzle that self-assembles from 1,000,000,000 similar pieces. 

Goldman jokes about one journalist (not me) who insisted on receiving the decoded image of EBI rather than the original encoded image. "But it's just the same! Likewise, if I was to email to you the file, the copy you've got isn't the same file. It's gone over the Internet. Electrons have moved and new areas on a hard disc drive have been written to a certain pattern – but that pattern is exactly the same, not just similar. We're very good at handling digital information because we can analyze and count it. It's much easier to tell if two files are the same than to spot a well-forged painting, for example."

What are your thoughts on DNA data storage? Are there potential challenges or applications not addressed here? Let us know by commenting below.

Receive content, products, events as well as relevant industry updates from The Analytical Scientist and its sponsors.
Stay up to date with our other newsletters and sponsors information, tailored specifically to the fields you are interested in

When you click “Subscribe” we will email you a link, which you must click to verify the email address above and activate your subscription. If you do not receive this email, please contact us at [email protected].
If you wish to unsubscribe, you can update your preferences at any point.

About the Author
Rich Whitworth

Rich Whitworth completed his studies in medical biochemistry at the University of Leicester, UK, in 1998. To cut a long story short, he escaped to Tokyo to spend five years working for the largest English language publisher in Japan. "Carving out a career in the megalopolis that is Tokyo changed my outlook forever. When seeing life through such a kaleidoscopic lens, it's hard not to get truly caught up in the moment." On returning to the UK, after a few false starts with grey, corporate publishers, Rich was snapped up by Texere Publishing, where he spearheaded the editorial development of The Analytical Scientist. "I feel honored to be part of the close-knit team that forged The Analytical Scientist – we've created a very fresh and forward-thinking publication." Rich is now also Content Director of Texere Publishing, the company behind The Analytical Scientist.

Register to The Analytical Scientist

Register to access our FREE online portfolio, request the magazine in print and manage your preferences.

You will benefit from:
  • Unlimited access to ALL articles
  • News, interviews & opinions from leading industry experts
  • Receive print (and PDF) copies of The Analytical Scientist magazine

Register