Establishing reliable chemical criteria for identifying ancient life remains an analytical challenge in geochemistry and astrobiology. Many established methods rely on the preservation of specific molecular markers, which can be difficult to apply to highly altered geological materials.
In a recent study, a research team led by Robert Hazen took a different approach, asking whether fragmented organic mixtures can still retain diagnostic information when considered collectively. Using pyrolysis-GC-MS combined with supervised machine learning, the team analyzed hundreds of modern, fossil, meteoritic, and synthetic samples to assess whether statistical patterns in molecular fragmentation can distinguish biogenic from abiogenic origins, as well as photosynthetic from non-photosynthetic signatures.
In this Q&A, Hazen discusses the analytical rationale behind the approach, the challenges of working with limited and imbalanced datasets, and how machine learning may extend the reach of molecular biosignatures deeper into Earth’s early history.
What initially motivated your team to investigate life’s earliest chemical traces in Earth’s oldest rocks?
Studies of ancient biomolecules preserved in rocks have long been limited because they degrade over time. No individual, diagnostic biomolecule has been identified in rocks older than about 1.6 billion years. However, much older rocks are often rich in fragmented organic molecules.
With this in mind, I hypothesized that the statistical patterns and distribution of these fragments could serve as a reliable biosignature themselves, even when all biomolecules had been fragmented. If that was possible, it would open a path to detecting life in far older rocks – and potentially to detecting signs of life on other worlds too.
Could you explain how your method works?
We take samples that hold mixtures of lots of organic molecules and analyze many thousands of their molecular fragments, even when no specific biomolecules remain. Each sample is represented by a spreadsheet with about 500,000 values. We then use training sets of known samples in pairs of classes (i.e. biotic vs. abiotic), before identifying the values that best discriminate between those two classes via machine learning. With this, we can then test unknown samples to see if they fall reliably into one of the two categories.
Did any of your findings particularly surprise you?
We were surprised and delighted to find that we could reliably assign samples as old as 2.5 billion years as photosynthetic, and samples as old as 3.33 billion years as biotic. That greatly extends the record of life based solely on molecular remains.
We were also very pleased to see that the method corrected our own errors: we mistakenly labeled two samples as "nonphotosynthetic" – a wasp's nest and the external shell of a sea squirt – but the method correctly identified both samples as photosynthetic, reflecting their plant-derived and algae-coated origins respectively. But we’re also aware that this work is just a starting point, with many more detailed studies, larger datasets, and additional attributes still to come.
What was the biggest challenge your team faced – and how did you overcome it?
At this stage, our greatest challenge was the limited number of samples in certain classes. For example, we only had nine fossil animals. The machine learning method does not perform optimally with unbalanced training, with many more of one class than the other. We’re intending on fixing that problem in our next, much larger study.
Do your results change the scientific conversation about when life first emerged on Earth?
These results merely support what paleontologists have long known about the antiquity of life and photosynthesis based on other kinds of evidence, including 3D fossils of cells and stromatolites, isotopic evidence and genomic studies. But we now also have the possibility of detecting much older life than before, as well as determining biochemical details that remain unknown.
Looking ahead, what are the next steps for refining this approach?
It feels as though we’ve only dipped our toes into a vast ocean of possibilities. For progress to be made moving forward, we’ll need more samples, more attributes and a wider range of analytical methods to combine with pyrolysis-GC-MS, as well as diverse machine learning methods and more researchers working in this new, exciting field.
Robert M. Hazen is a Senior Staff Scientist at the Earth and Planets Laboratory of the Carnegie Institution for Science and Clarence Robinson Professor of Earth Sciences, Emeritus, at George Mason University, USA.
