Conexiant
Login
  • The Analytical Scientist
  • The Cannabis Scientist
  • The Medicine Maker
  • The Ophthalmologist
  • The Pathologist
  • The Traditional Scientist
The Analytical Scientist
  • Explore

    Explore

    • Latest
    • News & Research
    • Trends & Challenges
    • Keynote Interviews
    • Opinion & Personal Narratives
    • Product Profiles
    • App Notes

    Featured Topics

    • Mass Spectrometry
    • Chromatography
    • Spectroscopy

    Issues

    • Latest Issue
    • Archive
  • Topics

    Techniques & Tools

    • Mass Spectrometry
    • Chromatography
    • Spectroscopy
    • Microscopy
    • Sensors
    • Data and AI

    • View All Topics

    Applications & Fields

    • Clinical
    • Environmental
    • Food, Beverage & Agriculture
    • Pharma and Biopharma
    • Omics
    • Forensics
  • People & Profiles

    People & Profiles

    • Power List
    • Voices in the Community
    • Sitting Down With
    • Authors & Contributors
  • Business & Education

    Business & Education

    • Innovation
    • Business & Entrepreneurship
    • Career Pathways
  • Events
    • Live Events
    • Webinars
  • Multimedia
    • Video
Subscribe
Subscribe

False

The Analytical Scientist / Issues / 2025 / August / Living the DreaMS
Mass Spectrometry

Living the DreaMS

DreaMS, a transformer-based AI model trained on 700 million spectra, is reshaping how researchers explore chemical space – without a single label

5 min read

Share

Roman Bushuiev and Tomáš Pluskal

A new AI model is helping scientists explore the molecular makeup of complex biological and environmental samples – without relying on extensive reference databases. Developed by researchers at the Czech Academy of Sciences and collaborators, DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) outperforms traditional tools in predicting molecular properties, assessing spectral similarity, and identifying challenging compound classes, including fluorinated molecules.

Inspired by the success of large language models, DreaMS adopts a similar learning strategy to analyze mass spectrometry data. Rather than requiring labeled examples, it learns by reconstructing missing parts of spectra and predicting the order in which molecular signals appear. Trained on more than 24 million raw tandem mass spectra, the model generates rich, structured “embeddings” that cluster similar molecules together – even across different instruments and experimental conditions.

The model and code are openly available, offering a foundation for large-scale, data-driven mass spectrometry annotation. With its flexible and scalable design, DreaMS could accelerate progress in diverse areas such as drug discovery, food chemistry, and environmental monitoring.

To find out more about the technology – as well as its development story – we spoke with Roman Bushuiev and Tomáš Pluskal, co-authors of the study.

Can you describe, in a nutshell, how the DreaMS model learns from unannotated mass spectra?

Much like how large language models such as ChatGPT learn the structure of language without knowing the meaning of individual words, DreaMS learns to interpret mass spectra without prior knowledge of their underlying chemical structures. It’s trained to reconstruct randomly masked signals from spectra and predict chromatographic retention order from randomly ordered spectrum pairs which originate from the same LC-MS/MS experiment. This self-supervised setup requires no molecular annotations, and can therefore be applied to millions of unannotated spectra from MassIVE. Our study demonstrates that this training approach leads to the emergence of molecular representations within the model.

What was the initial spark that led you to try a foundation model approach for mass spectrometry?

The vast majority of tandem mass spectra from untargeted metabolomics remain unannotated, mainly due to the limited coverage of current spectral libraries. We wanted to overcome this bottleneck by learning from the abundance of unannotated spectra directly – unlocking previously inaccessible regions of chemical space through self-supervised learning. We were also inspired by the success of protein foundation models such as ESM, which, once trained, can be adapted for diverse tasks such as structure or function prediction. We believed a similarly powerful and versatile model could be built for metabolomics, to address a range of computational challenges such as molecular structure prediction or molecular networking.

Were there any major blockers or challenges during dataset creation or model training? If so, how did you overcome them?

In our initial experiments, we trained the model on all spectra mined from the metabolomics portion of MassIVE GNPS (around 700 million spectra). From this, the model unfortunately didn’t yield any meaningful understanding of molecular structures. We hypothesized that the issue was due to noise and redundancy in the data, so we shifted our focus onto building a high-quality, non-redundant subset of spectra from MassIVE – which we called GeMS – comprising around 24 million spectra. The main challenge was developing scalable yet accurate algorithms for spectral quality detection and identification of duplicate spectra across MassIVE datasets (terabytes of data).

Could you speak about the broader potential of using DreaMS-style models in metabolomics and small molecule discovery?

DreaMS is designed as a general-purpose model that can be fine-tuned for a wide range of spectrum interpretation tasks. Avoiding rigid, rule-based systems, it instead learns directly from data, which makes it adaptable and powerful. A good example is the detection of fluorinated molecules directly from mass spectra. While around 30 percent of small-molecule drugs and agrochemicals contain fluorine, only a few fluorinated natural products have been discovered to date.

Detecting fluorine via mass spectrometry has long been considered extremely difficult, as it only has one stable isotope and lacks distinctive MS1 isotopic patterns. However, with a minimal extension, we enabled DreaMS to predict the presence of fluorine with high precision. We’re currently applying this to the discovery of fluorinated natural products from plants, and already we’ve been able to confirm new structures using NMR.

We’ve recently published a series of articles on in-source fragmentation and the “dark metabolome.” How does your work fit into this broader conversation?

As DreaMS doesn’t incorporate MS1 spectra, we cannot directly comment on in-source fragmentation. However, our work is highly relevant to the dark metabolome. The DreaMS Atlas – a molecular network built from 200 million spectra in MassIVE GNPS – reveals that most public spectra don’t match anything in current spectral libraries and often form large, unannotated clusters. In our view, that’s the essence of the dark metabolome: real, abundant molecules that remain unidentified. DreaMS and the DreaMS Atlas provide new tools for navigating and investigating this hidden chemical space.

How might this approach influence future research in areas like drug discovery or environmental monitoring?

We believe DreaMS opens a new chapter in computational metabolomics for two key reasons: versatility and scalability. It allows researchers to apply the same underlying model to a variety of metabolomics tasks – eliminating the need to design new tools for each specific challenge. And thanks to its efficiency, DreaMS can be scaled to hundreds of millions of spectra, enabling large-scale analyses in fields such as drug discovery and environmental monitoring.

What are the next steps? How do you see researchers using the DreaMS Atlas in their own work?

On the methodological side, we’re actively extending the DreaMS project in several directions, including the detection of structurally novel molecules using the DreaMS Atlas and support for multi-stage MSn data. Additionally, we’ve already developed an alpha version of DreaMS-Mol: a fine-tuned version of the DreaMS model to predict complete molecular structures – not just embeddings – which we plan to release next year.

On the application side, we’re integrating DreaMS into biological applications in our lab. Our focus is on discovering plant natural products and their biosynthetic pathways, as well as exploring the chemodiversity of plants using the DreaMS Atlas.

Roman Bushuiev is a PhD student and Tomáš Pluskal is a Group Leader in the Department of Metabolomics at the Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences.

Newsletters

Receive the latest analytical science news, personalities, education, and career development – weekly to your inbox.

Newsletter Signup Image

False

Advertisement

Recommended

False

Related Content

 This Week’s Mass Spec News
Mass Spectrometry
This Week’s Mass Spec News

April 4, 2025

2 min read

 What If Computers Could Smell?
Mass Spectrometry
What If Computers Could Smell?

April 3, 2025

13 min read

Computers can “see” and “hear,” but fully digitizing scent has so far eluded science – but that may soon change

The Analytical Scientist Innovation Awards 2024: #6
Mass Spectrometry
The Analytical Scientist Innovation Awards 2024: #6

December 3, 2024

3 min read

Syft Technologies’ William Pelet introduces the Syft Explorer – the world's first fully mobile, real-time, and direct trace gas analyzer

The Analytical Scientist Innovation Awards 2024: #4
Mass Spectrometry
The Analytical Scientist Innovation Awards 2024: #4

December 5, 2024

6 min read

Thermo Fisher Scientific’s high-sensitivity mass spec for translational omics research – the Stellar MS – is ranked 4th in our annual Innovation Awards

False

The Analytical Scientist
Subscribe

About

  • About Us
  • Work at Conexiant Europe
  • Terms and Conditions
  • Privacy Policy
  • Advertise With Us
  • Contact Us

Copyright © 2025 Texere Publishing Limited (trading as Conexiant), with registered number 08113419 whose registered office is at Booths No. 1, Booths Park, Chelford Road, Knutsford, England, WA16 8GS.