Gurus of Chemometrics
How can analytical scientists handle the data tsunami? We grill four champions of chemometrics on the progress – and pitfalls – of this rapidly evolving field
Lutgarde Buydens, Jonathan James | | Interview
Introduced by Lutgarde Buydens, Professor of Analytical Chemistry, Dean Faculty of Science, Radboud University, Nijmegen, The Netherlands
In 2013 I penned a feature for The Analytical Scientist: “Towards Tsunami Resistant Chemometrics,” drawing attention to a paradox at the heart of data science: traditional methods, regarded as the cornerstones of chemometrics, were not designed to handle the large volumes of data available today. New methods and strategies were urgently required to extract relevant chemical information from the data tsunami.
More than half a decade on, the data analysis landscape has changed tremendously. In computer science, deep learning and artificial intelligence methods have emerged, whilst in mathematics there is a growing interest in investigating the fundamentals of data science. Together, this progress can only further benefit the chemometric field. This made me wonder: what is driving chemometricians, and how do they experience the field in relation to these developments? Asking some experienced chemometricians in key application areas such as food analysis, environmental science, metabolomics and industrial process analysis for their views on the field – and how they create value from analytical data – seemed a proper way to address this.
Professor, Department of Computer Science and Biological Sciences, University of Alberta, Alberta, Canada
Professor, Department of Environmental Chemistry, Institute of Environmental Assessment and Water Research, Spanish Council of Scientific Research, Barcelona, Spain
Assistant Professor, Department of Analytical Chemistry and Chemometrics, Radboud University Nijmegen, Nijmegen, the Netherlands
Adjunct Professor, Department of Engineering Cybernetics, Norwegian University of Science and Technology, Trondheim, Norway
What spurred your interest in chemometrics?
David Wishart: I began my career as a structural biologist with a particular interest in protein structure. Over time, I drifted towards the study of drug design, quantitative structure–activity relationship models, and small-molecule therapeutics. I’ve always been interested in combining theory with practice: pairing mathematics and computer modeling with wet-bench biology and chemistry. In 2002, I joined the Department of Computer Science at the University of Alberta, where I began to explore machine learning and its applications. Since then, my interest in chemometrics has only continued to grow – driven by the enormous volumes of data produced in omics.
I now run a large experimental facility called The Metabolomics Innovation Centre, which serves as Canada’s national metabolomics laboratory. The data we generate, coupled with that generated by many other metabolomics laboratories, is increasing year-on-year. Terabytes of data can easily be produced from a single metabolomics study. In addition to the sheer quantity of data collected, the size of individual metabolomic studies is also growing rapidly – it is not unusual to see 10,000 or even 100,000 subjects or samples being analyzed in some laboratories. This explosive growth is creating numerous data challenges.
Roma Tauler: We investigate environmental issues such as pollution, and the effects of environmental stressors on biological organisms at the omic level. My primary interests were solution chemistry and chemical speciation using electroanalytical and spectroscopic methods, which produced reams of multivariate data. This had driven me to employ chemometrics and multivariate data-analysis as part of my everyday workflow. Because chemometrics has always offered me the tools I need to solve particular problems, it seems only natural to want to contribute to the field’s ongoing development.
Jeroen Jansen: I started my university studies in chemistry at a time when many believed that biochemistry was the answer to the world’s woes. However, it is now clear that understanding life in its full complexity is a far greater challenge than anyone could have imagined. Developing molecular therapeutics or solving food poverty is no trivial task. While studying chemometrics, I recognized the important role that data could play in tackling these challenges; thus, I began pursuing projects that employ a variety of techniques, such as multivariate curve resolution (MCR) and ANOVA simultaneous component analysis (ASCA), to unravel the vague and seemingly random patterns emerging from data tsunamis. Now, I focus on two areas in which chemometrics shows considerable promise: multivariate process monitoring to make industry more sustainable, and citizen science, to bring measurement to the people.
Harald Martens: I still remember how exceedingly dull and meaningless I found the mathematics and statistics courses during my university studies. It wasn’t until I started my first job in 1972 that I learned to love modelling of multivariate chemical data – what later developed into “chemometrics”: we generated too much data and extracted far too little information from it, so multivariate data modeling was a vital tool. I discovered first-hand how important it is to ensure that modeling is statistically valid, and that data is interpretable to those with different backgrounds and experiences. I’ve had to learn basic mathematics and statistics the hard way, and have come to love its power and practicality. Sitting through a mathematical proof for proof’s sake is a sure-fire way to send me to sleep! For me, the purpose of mathematics is to facilitate data-modeling in order to better understand the world.
What impact is the data tsunami having on your field?
Tauler: There’s no doubt that the data tsunami is one of the major driving forces behind the digital boom. In the chemical and analytical sciences, this is a consequence of more and better measurements and the digital information revolution. Deeper insights are now possible in many areas: for example, extraction of spatial in-vivo chemical information from hyperspectral imaging and video, the investigation and biochemical interpretation of the chemical changes in biological systems instigated by stress factors, and the extraction of chemical, environmental, and climate information from huge data repositories. The amount of data acquired by these studies is enormous. This raises a number of interesting challenges, impacting everything from data preparation and pre-treatment, to interpretation and pattern recognition.
Wishart: The tsunami of metabolomic data has brought with it both opportunities and challenges. The good news is that more data means more “training” and “testing” datasets available for various analytical or machine-learning techniques. Some of the most useful datasets will be those compiled for reference metabolites, which can be used to develop prediction or classification software. These tools will be particularly useful in identifying the tens of thousands of still unknown (or unidentifiable) metabolites in the human metabolome – the so-called “dark matter” of the metabolome.
However, simply storing and retrieving such large amounts of data in a suitable manner is itself a significant challenge, let alone the difficulties in analysis, standardization, and interpretation of the data. Furthermore, most metabolomic data derived from biological studies are highly variable, both in quality and type. As a result, its interpretation can produce widely different results; relatively few studies, even if carried out in identical conditions, produce similar outcomes. This generates confusion and doubt in end-users, who question the utility of this whole approach. Of course, this isn’t a new problem: in the past, similar issues have beset other branches of the omics science. Overcoming this challenge will necessitate the development of standardized protocols for data analysis, better handling of false-discovery rates, increased community training, and a greater focus on collecting or creating reference standards.
Jansen: My work covers a broad spectrum, though those with the most potential for impact and innovation are industrial process control and citizen science. Experimental design has proven itself essential for understanding systems biology: we now use these same ideas to systematically explore data from large industrial processes. This provides a wealth of new information about these large, man-made systems. We’ve quickly learned that the calibration of new, handheld sensors for citizen science requires fundamental knowledge about experimental design to ensure robust and dependable data. The challenge? Making it possible for end-users of these technologies, who are not trained analytical chemists, to operate them effectively. This is no simple task; developing methods that are complex enough to deliver the information we need but also intuitive enough to be used by non-experts is a careful juggling act.
Martens: Humanity is faced with a number of serious challenges: climate change, a loss of biodiversity, human poverty, the spread of new diseases, migration crises, and war. The huge data sets we’re generating – if handled correctly – give us the tools to better understand and tackle these problems. But then our data modelling also needs to respect the laws of physics and the explicit and tacit knowledge that humanity already has. If instead we handle it wrongly with lots of black box machine learning, the combination of the data tsunami and hyped unexplained artificial intelligence will lead to societal dementia.
How can we adapt to the incoming data tsunami?
Jansen: What is required is closer cooperation between quantitative modelers and the end-users, improving industrial sustainability and empowering citizens to take control of their lives based on quantitative evidence. Chemometrics should take on the role of “translator”; bringing together the knowledge floating around on factory floors and inside tech-savvy living rooms with that gained through big-data approaches. It should not become a science of servitude, but rather empower us to produce novel, data-analysis-backed solutions with real-life value to society. Bringing the end-user on board early to cocreate the solution makes adoption easier, no matter how abstract the chemometric methods employed.
To have any kind of impact, we chemometricians can never work alone. Luckily, collaboration is in our blood; talking to any of my colleagues reveals collaborations across a broad spectrum of professions, from the hospital to the industrial plant. Having built these relationships, it’s key that we now move from collaboration to co-creation. Using MCR, almost all chemical knowledge can be translated into mathematical models with appropriate constraints: we can code any kind of experimental design into ASCA to extract only the most relevant information from our data.
Tauler: Since its humble beginnings in the 1980s, progress in chemometrics has been intimately linked to the increasing power of computers, analytical instrumentation, and measurement science. The goal of chemometrics is to extract chemical information from measurements via computational means; thus, it is a field well prepared to handle big data challenges. The tools developed in the last few years are already being implemented in new processes. Methods developed in chemometrics, together with the explosion of statistics and machine learning, have proven key to generating, extracting and interpreting useful information from the data tsunami.
The results can already be seen in real-world applications. The implementation of powerful, multivariate calibration and resolution methods to allow better monitoring, modeling and control of chemical processes has resulted in improved classification, quality, and pattern recognition of industrial and food products. New challenges and perspectives have arisen as society evolves, with a focus on health and the environment now at the forefront.
Wishart: To its credit, the metabolomics community has risen to the challenge. The creation of two databases – MetaboLights and the Metabolomics Workbench – has resulted in open-access archival resources for metabolomic data deposition, particularly for biological studies. Unfortunately, the development of resources capable of depositing and storing chemical data still lags behind. The creation of open-access tools for standardizing data analysis and data interpretation is well underway; indeed, developing open-access, web-enabled software tools, and hardware resources has been a major priority for my laboratory for over a decade.
Comprehensive, web-enabled databases have allowed our lab to obtain the chemical and metabolic data needed for appropriate training and testing of novel machine learning approaches. By ensuring our data are freely available on the web, we also help other scientists to develop new, or related, applications. Database construction is a win–win, both for our lab and for the broader metabolomics community. We have developed a large number of web-enabled databases or web-servers to aid in standardization. These include the Human Metabolome Database (HMDB), MetaboAnalyst, Bayesil, ClassyFire, and others. The HMDB, which has collected information on over 100,000 human metabolites, has helped the metabolomics community to conduct consistent compound identification, compound annotation, and spectral deconvolution.
Tauler: One of the main challenges associated with analytical measurement is limited sensitivity and selectivity; although progress in the development of analytical measures has been enormous, the number of chemical species under study is near infinite. Thus, the goal of having one analytical measurement for every chemical species is unattainable. What we measure is always a mixture of species, and only part of what is measured relates to our interest. There are different ways to increase the selectivity of analytical measurements, and one of the most powerful is using chemometrics. As well as broadening the extent and power of analytical measurement, chemometrics also allows for the discovery and interpretation of the latent (hidden) multivariate or multifactor structure and response of chemical systems. These latent variables reveal the complex behavior of the analyzed systems – especially those in natural systems. This would not be possible using traditional means of analysis, even if your target has been very well characterized.
Martens: We need to deploy mathematical models to properly handle big data. While purely mechanistic models may be good for simple systems, this traditional approach to modeling is inadequate when faced with real-world complexity. We do understand most of the laws of nature, but not how they combine with each other in concrete situations. The real world must form an integral part of our modelling processes, by combining massive streams of multichannel measurements to update the relevance of our models based on feedback.
What have been your biggest sucesses to date?
Jansen: For us, the cornerstone of our success has been co-creation. When we began working with process engineers and operators at a large chemical factory we dusted off Path-PLS – an almost forgotten analytical methodology (but still widely used in the social sciences). The chemical company had a clearly-stated challenge in their process, that they themselves – employing their own knowledge – had already broken down into manageable questions. Together, we filled in the pathways between the different unit operations, which proved very valuable. The resulting model was much more informative – and far more predictive – once we had incorporated information from the end-user. I think this is reflective of a wider trend: with the emergence of the data tsunami, people are far more data-conscious. This is making conversations with non chemometricians much easier for us.
Martens: The on-the-fly-processing algorithm for ever-lasting chemometric model-building that we developed in the 1990s is now having a significant impact in industry and shipping – allowing users to handle very big data streams without being overwhelmed or alienated.
Analysis of ever-lasting streams of high-dimensional measurements can be likened to the full sound of a symphony orchestra in a concert hall. We prime our mathematical models with prior knowledge about “what music to expect” (the laws of physics, prior observations, and so on) before opening the metaphorical data flood gates. From the ensuing tsunami, we estimate parameters and variables, often via mathematical metamodels. Following this theorydriven approach, we start the data-driven model.
We listen critically to the unmodelled residuals, to discover unexpected but clear “rhythms” and “harmonies” that we quantify and display for interpretation and practical use. Finally, we check for possible “arrhythmia” and “disharmony” – data sticking out from the random background noise. Based on this, we can then correct our initial models, whilst extending or improving our understanding of the processes. This entire process forms a rational, interpretable base for deep learning in technical systems.
Tauler: We’ve conducted significant work using multivariate curve resolution-alternating least squares (MCR-ALS) to investigate equilibrium and kinetic reaction based chemical systems using spectroscopic methods. The MCR-ALS soft-modeling approach competed favorably with other traditional, hard modeling parameter estimation data-fitting methods. Implementing hybrid hardsoft modeling has allowed us to extend the range of chemical systems we can analyze as well as expanding the range of data-sets we can cover. MCR-ALS has been extended to the analysis of very complex multiway and multiset data structures. Recent applications have also seen it applied in omics and hyperspectral imaging.
What are your goals for the future?
Jansen: We want to further implement and refine novel chemometric methods incorporating industrial knowledge and resources. While we’ve already gained good insights into how to create better data analysis solutions with lasting value to the end-user, the greater challenge will be engaging the measuring citizen to the same degree. Understanding the relations between human behavior and our measurement and prediction innovations, is likely to prove critical.
Martens: I hope to continue developing professional software that can bridge the math gap in society. I also want to combine different pragmatic mathematical sciences, particularly chemometrics, control theory and cognitive science. Of course, I also want to use more advanced tools for mathematics, statistics, and computer science – but only when needed, and without losing the real-world connection.
Moreover, I want to continue the work that my colleagues and I have done to develop continuous chemometric- and cybernetics-based machine learning tools for ordinary people to use in their daily work. We need even better ways to summarize, compress, quantify and understand the essence of real-world data without being overwhelmed by numbers, caught up in mechanistic oversimplifications, fooled by good-looking p-values. or alienated by black-box descriptions.
Tauler: In the next few years, we want to develop better approaches to ascertain the reliability of MCR solutions, in order to consolidate their general use. New data fusion strategies and analyses are also needed, for instance in multimodal hyperspectral imaging data analysis or in multi-omic data analysis.
Wishart: Our focus over the next 2–3 years will be to utilize machine-learning technology to develop software to more accurately predict MS/MS and nuclear magnetic resonance spectra of small molecules. In tandem, we hope to develop algorithms capable of predicting biologically feasible compounds or chemical biotransformations. These tools will allow us to more easily identify new or unknown compounds using “in-silico metabolomics.” We are also keen to construct ontologies and pathway databases that can be used to better annotate metabolites and aid their biological interpretation.