Pioneering Proteomics
My love of biology, mass spectrometry and computer programming has led me down an uncommonly rewarding path of discovery and innovation. Here, I share some of my story and look towards the horizon of my exciting and ever-changing field.
1983... Somewhere in the United States of America...
“I really should combine my interests in biology, chemistry and mass spectrometry. What’s out there? OK. MS-based protein sequencing looks like the future. And that means there are really only two serious options: the laboratory of Don Hunt at the University of Virginia or Klaus Biemann’s lab at MIT...”
The wonder years
I grew up in a military family, which meant moving often, changing schools and making new friends. One advantage to the nomadic lifestyle was crisscrossing the US several times and visiting spectacular sights like the Grand Canyon, the Painted Desert, redwood forests, and meteor craters – you can’t look at the Grand Canyon and not wonder how and why! It was a really natural introduction to wondering about ‘life and nature’.
I was first excited by science and medicine when I was a freshman at high school. It wasn’t very far into the American football season (the first game) when I broke my leg in a tackle. I was hospital bound in a military facility, and for the first time, I began to consider the science behind what was going on around me. Certainly, the experience made me take science – particularly biology and chemistry – more seriously at school. And when my chemistry teacher presented a demonstration in class making nylon, I was captivated; the fact that you could mix two chemicals together and create something new simply blew me away.
After high school I went to the University of Maine to study zoology, and during this time I took a course in organic chemistry. I loved it and was tempted to change my major, but I was already a third year student, so changing my major would delay my graduation. Instead, I stayed in zoology, applied to medical school in my forth year and didn’t get in, so I had to think about a Plan B. I figured I’d do chemistry for a year, and then re-apply to medical school at a later stage. However, it turned out that the chemistry department had just invested in a mass spectrometer. It was amazing – in part because it was attached to a computer, which was about as close as you could get to a PC at the time (1980). It was a Hewlett Packard GC-MS with a HP computer and it was so cool; you could do library searching of spectra! This serendipitous experience completely changed my plans, and I never once reconsidered medical school.
I was immediately hooked on chemistry and made plans to move onto a PhD. I decided I wanted to study MS-based protein sequencing, as it combined my interests in chemistry and biology. Don Hunt had just published a paper on fast atom bombardment (FAB – invented by Michael Barber in 1981) for soft-ionization of peptides (1). It was at the cutting edge – and I wanted in on it. Don Hunt is a great innovator and his lab was an exciting place to be in 1983; joining his group turned out to be one of the best decisions of my life.
A source of change
Advances in mass spectrometry are very often driven by changes in ionization techniques. New methods come along that allow you to do new things – then instrument manufacturers modify their instruments to take full advantage, like boosting mass range in the case of FAB. And so throughout the 1980s, researchers were all racing ahead with FAB, and in Don’s group that meant understanding what it could do for peptide analysis. While our main competition focused on using tandem double-focusing magnetic sector instruments (behemoths that required a million bucks and a room of their own), Don was really pushing the idea of using triple quadrupoles for the sequencing of peptides. He believed in the power of triple quads and by 1986 we made it work to sequence peptides (2).
Of course, when electrospray ionization (ESI) came along in the late 80s, it totally changed everything. ESI worked well with triple quads, and as a result, sector instruments all but disappeared since it was difficult to do HPLC with their slow scan speed. At about the same time, I completed my PhD and joined Lee Hood’s lab at Caltech for a post doc. Lee’s group had developed the gas-phase protein sequencer that was a 1000 times more sensitive than Edman’s spinning cup protein sequencer and was pushing to use it with 2D gels. Lee’s lab (with Reudi Aebersold – also a post doc in the Hood lab) was championing the use of two-dimensional gel electrophoresis to do entire proteomes – by pulling spots off the gels and running them through the gas phase sequencer. Once again, I was immersed in a lab that was leading advances in the field, and surrounded by colleagues who went on to make significant contributions in science.
The human genome distraction
Hood’s new sequencing technology helped ignite the biotechnology boom in the US. That breakthrough meant that people were sequencing proteins that they hadn’t been able to before, and coupled with recombinant technologies, many of these new proteins could be used as therapeutics.
It was really fortuitous that I joined Lee’s lab, but it didn’t seem like it at first. Lee had about 70 people working with him and had just put together the crucial components that made up the microchemical facility – the ability to sequence and synthesize both DNA and proteins. Naturally, many in the lab were turning their attention to sequencing the genome and there were big discussions on how best to do it. I was in the protein sequencing section of his lab and remember thinking to myself, “what’s the future of protein chemistry?!” Despite concerns about the relevance of protein sequencing, I started working on how I could interface mass spectrometry with information that would come out of the genome projects. I also considered the possibility of using mass spectrometry to sequence DNA – after all, the search was on for faster and more accurate sequencing technology. Where did I fit in the grand scheme?
I didn’t get far on the latter idea, but I had a bunch of different software tools and a small database of protein sequences; it occurred to me that we might be able to use mass spectrometry information with the database to identify proteins in a faster timeframe, which led me to a peptide mass fingerprinting strategy in 1993 (3). A number of other groups came up with the same idea – there were about five papers on the subject that year. But ultimately, I noticed that it was pretty easy to get fooled with mass fingerprinting in terms of false positives, and that triggered my thoughts on how to use tandem mass spectrometry data.
Code master
While we were manually sequencing some MHC Class 2 proteins, and waiting for some partial sequencing results to come back from the BLAST server at NIH, I started to wonder why we couldn’t just send the entire spectrum off for a database search so we didn’t waste time sequencing things that were already known. I took a day off to figure out how to match spectra to sequences and wrote some trial computer code. I convinced myself it would work and hired a programmer – Jimmy Eng – to work on the project full time. He was an electrical engineer but his masters project involved neural networks for language processing – and that intrigued me. Could we use the same strategy to interpret mass spectra? Jimmy didn’t know what a protein or mass spectrometer was, but I figured I could teach him that. He had some programming experience, but not as much as I thought might be needed. But Jimmy was exceptionally bright and hardworking, and hiring him turned out to be a lucky turn of fate and the beginning of a great partnership. There were a couple of technical problems to solve at first, like getting information out of proprietary MS files, but somewhat surprisingly, the program worked pretty well from the get go. And so, in 1993 SEQUEST was born, but it took several rejections of the paper before it was published in 1994 (4).
Disruption often seems to come from people new to a field rather than those who have been around for a long time. And I was well prepared to take the leap of faith – I knew how to code, I knew how to manually sequence peptides from mass spectra, I’d gotten interested in databases through working with a program called PC/GENE, and I had smart, enthusiastic people in my group willing to take risks with me. Everything just came together at the right moment.
I guess what separated me from the crowd at the time was my love of coding. Going back a couple of years, a lot of my classmates at Maine used to moan about how hard computer programming was and I felt quite intimidated by it. Eventually, I forced myself to take a class to see for myself. I loved it, but it was not easy to do on a regular basis, as it required having time on a mainframe computer. But when I came to Virginia, PCs were becoming more widely available and I picked the programming back up. I even convinced my wife to let me buy a PC (it cost about $1500 – quite an investment for a pair of grad students!) and I started writing programs in Turbo Pascal, back when programs were limited to 64kb or something (hard to imagine for some of the younger readers I’m sure...)
I have a philosophy that anything you do is not necessarily wasted effort as it may be useful in the future. In that spirit I wrote a lot of programs at Virginia that were not necessarily practical or useful. One of the programs I wrote predicted all aspects of a protein sequencing workflow. I could take a protein sequence, do a virtual tryptic digest, predict the HPLC retention time and plot out the chromatograms for the resulting peptides and predict and plot the fragmentation patterns for all the peptides. Therefore, for a given protein sequence I could generate quite a bit of theoretical information. Working in that way really got me in the right frame of mind for what was to come a few years down the road.
Must scan faster
Tandem MS is a fantastic mixture analysis tool and armed with a way to quickly interpret the data I started thinking about a way to circumvent the use of gel electrophoresis. What became clear from database searching strategies was that the tandem mass spectrum for a peptide was more or less a zip code for a protein. By extension, you could digest a mixture of proteins and run them through the mass spectrometer and match the individual peptide MSMS back to the protein. While I was at Virginia, Don Hunt used to say that you could sequence two proteins simultaneously from a mixture with tandem MS methodology. While true, the data interpretation was slow and data collection needed to be comprehensive. The creation of sequence databases would make this concept both possible and practical.
LC was getting to the point where we could run increasingly complex mixtures and we really started pushing the technology to identify proteins in mixtures. We moved from complexes, to organelles to cells and finally to tissues. Of course, each step required more and more sophisticated separation technology, more advanced mass spectrometers and better software tools.
Ian Jardine has already told his story in The Analytical Scientist (tas.txp.to/0515/Jardine) and he was indeed heavily involved in a lot of the work I was doing. I took SEQUEST to Ian, and he immediately understood the potential and capability in mixture analysis without gel electrophoresis. Basically, he recognized that it was a game changer and wanted to exclusively license it from the University of Washington where I was working at the time. Part of Ian’s genius was being able to cut through all the academic noise to figure out what direction the field was heading. Ian wanted to push mass spectrometry into the biochemistry field, and considered SEQUEST to be one of the main stepping stones towards that goal.
I remember visiting Thermo (Finnigan back then) to see the beta version of the new LCQ Ion Trap – Ian’s first big project – and because mixture analysis was at the forefront of my mind, my mantra was “must scan faster, must scan faster.” The guys were justifiably proud of the instrument and I was so impressed and excited when they showed it to me, but to their dismay I kept asking, “Can’t you make it scan any faster?” I think they were dumbstruck, but I explained why – and sure enough, every model that followed was able scan faster. Indeed, faster scanning speed is still at the heart of discussions today.
As part of the licensing agreement for SEQUEST, I got an LCQ Ion Trap and that set us on a long road of collaboration on direct protein mixture analysis using LC-MS. Essentially, I wanted to use technology to understand diseases. As the mass spectrometers got better, we were able to collect more data, so we needed to develop more advanced software tools. It was a very positive cycle of invention and re-invention.
Even back when I was at Caltech, Ian put a mass spectrometer in my lab, a move that was absolutely key at the start of my career but also an important catalyst for moving mass spectrometry into the field of biochemistry as Lee’s laboratory was one of the epicenters of protein biochemistry. I remember giving talks on mass spectrometry and proteins in the early days and there would always be someone from conventional protein sequencing grilling me. It was with great satisfaction that about five years later, they were all moving to do mass spectrometry. Needless to say, traditional sequencing was being obviated pretty quickly by the rapid changes in the mass spectrometry world.
Most of the equipment in my lab today is from Thermo Fisher Scientific – partly because of my long-term collaboration with them, but also because it just works. I bought a QTOF once – and while the instrument looked spectacular on paper in terms of mass accuracy, it just didn’t seem up to the task in terms of robustness. On the other hand, my ion traps were working 24/7 and they never broke down – and even though I wasn’t getting the same mass accuracy or resolution, it has to be said that it’s better to have data than no data. I eventually got to the point where I would only buy Thermo equipment. Later, when the Orbitrap was introduced, it was a complete game changer for the field because it produced robust high resolution, high mass accuracy data – now everyone has to have one.
The Yates Lab today
Unsurprisingly, I’m still very interested in proteins from a very disease-centric perspective. We’re trying to increase our understanding of how protein networks, protein-protein interactions and protein modifications change as a function of disease. Today, the tools and developments we’ve been working on for the last 20 years have led us to the point where we are asking very specific questions about disease.
Unfortunately, in terms of funding, proteomics frequently seems to take a backseat to genomics. Funding for genomics is quite good and funding agencies don’t hesitate to spend a billion dollars on certain strategies even if the results aren’t that great. Proteomics simply doesn’t get that kind of attention. And I think that’s because genomics has a very significant track record for finding disease genes, with the hope that gene discovery will lead to cures.
A good example is cystic fibrosis. The gene was found in 1989 with genomic technologies, but it’s taken 26 years of traditional biochemical and proteomics studies to better understand the biochemistry of the problem. Clearly, discovering disease genes is a good thing (and can lead to spectacular careers in science – the current head of the NIH was on the team discovering the CF gene), but equal resources need to be devoted to the technologies that allow genomic discoveries to both be understood and to be turned into cures.
To that end, we’ve been working on cystic fibrosis for over a decade and we’ve developed some interesting approaches to help understand the mechanisms of the disease. Now, we are looking at six very clear drug targets. One candidate can rescue the disease to the same extent (in cell cultures) as a drug that is about to come on the market. There are several other disease projects in my group at various stages, but the CF project is the most advanced.
I guess I have come full circle; I started out wanting to go into medicine, and in a sense I have. As a physician you can have a positive impact on the lives of thousands of people, but as a scientist, you can make a difference to millions by making discoveries that change the way disease is diagnosed or treated. I remain grateful that what seemed like a setback early in my life turned out to be the auspicious opportunity that led me to a fulfilling career in science.
Protecting our proteomic future
The big difference between genomics and proteomics (apart from the complexity) is funding. There was a very deliberate and well-crafted push by US funding agencies to develop next-gen sequencing technology, with the aim of getting down to the $1000 genome. One of my former post docs now works at NIH and was involved in developing a focused program to create disruptive, next-gen tools for protein analysis. Unfortunately, the NIH decided not to pursue it. This is disappointing not only from the loss of funding to the field, but also because a focused and deliberate strategy could have yielded big results.
Mass spectrometry and proteomic methods are having a very broad impact on biological science. I feel strongly that the breadth of this impact is not yet well understood or appreciated. Almost every study published that involves proteins will have used mass spectrometry in some form or other. But very often that aspect of the work gets buried in the supplemental methods – if it’s reported at all. It’s almost become commodity science (Rich Whitworth focused on this problem in his editorial last month based on our discussions: tas.txp.to/0515/commodity). When science is treated as a commodity, people stop citing papers. Consider electrospray ionization – no one cites the work of Yamashita and Fenn (5) when they use it anymore, but they do cite BLAST if they use it. I really don’t understand why it’s acceptable for some areas of science and not others. I worry that with increasing commoditization of proteomics, funding agencies don’t appreciate its impact either.
Without recognition, it’s difficult to develop new technology that allows you to ask new questions. To quote the theoretical physicist Freeman Dyson, “[Technology] is the mother of civilizations, of arts and of sciences.” Indeed, new technology allows us to do things that we couldn’t do before. And while I think people appreciate this fact if they stop and think about it, they quickly forget. Almost all technology began life in a laboratory somewhere – likely funded by a fundamental science grant. It can be frustrating to see my own field of science being commoditized before its time.
Regardless, mass spectrometry instruments are rapidly evolving and continually pushing the frontiers of bioanalysis, particularly in proteomics. But has the technology reached incremental status or is there a potential disruptive innovation that could emerge and completely alter the landscape?
We have to ask: how can proteomics gain access to the kind of massively parallel next-generation sequencing strategies that genomics have adopted? Mass spectrometry is inherently serial; we need to spend time and effort on speeding up proteomics platforms. At this point, it is hard to imagine a new and disruptive technology emerging to replace mass spectrometry. But then again – I guess that is the nature of a disruptive technology: you don’t see it coming!
I am keenly aware of the need for disruptive innovation in proteomics, but once you have a large established group, moving in a new direction can feel like turning the Titanic. I’d love to have a ‘Skunk Works’ section in my group, working on exotic, high-risk projects, but NIH would probably need to be better funded for that to happen…
Maintaining our “momentome”
From a technical standpoint, we’re on the verge of being able to do whole proteomes. Admittedly, the term ‘whole proteome’ is still somewhat up for debate, as expression and modifications change over time, unlike the genome. My definition is our ability to identify all of the proteins present in a particular mixture in a routine and robust way – and I think that’s coming in the next few years.
Once we’ve achieved that goal, we can increase the amount of sequence coverage (from a bottom-up perspective) so that we can start asking questions about modification states. While we’re doing that, we need to focus on improving top-down approaches, where there are still a lot of technical challenges. The latest generation of mass spectrometers will somewhat democratize the world of top-down proteomics because we won’t all need 15-Tesla magnets – and that should move the field along much faster. Over the next 10 years, I expect to see increasingly improved ways of fragmenting larger and larger proteins, which will also have a huge impact on top-down approaches.
As I indicated, finding ways to parallelize analyses must also be high on the agenda to drive technological advances. We’ve been working on ways to look at entire networks of proteins in a single experiment so that we can investigate the dynamics of pathways. And we are also developing new software approaches to meet the needs of new methods of analysis. These advances don’t parallelize mass spectrometry per se – we’re still acquiring data in a serial fashion – rather they parallelize the questions being asked.
Software is still key, but I think the standard proteomics tools are now pretty robust –you know, we probably don’t need a 40th version of SEQUEST. Instead, the community must focus on new tools that allow us to ask new and different questions.
Seeking new technologies that can help solve new problems always stimulates me. However, it can also be frustrating when we gain access to those technologies; how do you choose between 10 different great applications of that technology when resources are limited? The answer is “with difficulty,” but I’m immensely proud of the work that my group has done, is doing and will continue to do.
Fundamentally, what motivates and excites me the most is solving problems and finding clarity. And that is what I will continue to do.
- D. F. Hunt et al., “Sequence Analysis of Oligopeptides by Secondary Ion/Collision Activated Dissociation Mass Spectrometry”, Anal. Chem. 53, 1704-1708 (1981).
- D. F. Hunt et al., “Protein Sequencing by Tandem Mass Spectrometry”, Proc. Natl. Acad. Sci. 83(17), 6233-7 (1986).
- J. R. Yates III et al., “Peptide Mass Maps: a Highly Informative Approach to Protein Identification”, Anal. Biochem. 214, 397-408 (1993).
- J. K. Eng, A. L. McCormack, and J. R. Yates III, “An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database,” J. Am. Soc. Mass Spectrom. 5 (11), 976–989 (1994). DOI:10.1016/1044-0305(94)80016-2
- M. Yamashita and J. B. Fenn, “Electrospray Ion Source. Another Variation on the Free-Jet Theme”, J. Phys. Chem. 88 (20), 4451–4459 (1984). DOI:10.1021/j150664a002
John Yates is Ernest W. Hahn Professor of Chemical Physiology and Molecular and Cellular Neurobiology at Scripps Research, LaJolla, California, USA. He was recently named Editor of the Journal of Proteome Research.