Quantitative Survey

This survey, promoted on Twitter, started asking about quantitative proteomics software and (maybe) devolved into a search engine survey. I am not too surprised as computational proteomics seems to begin and end with search engines. Do people need validation for their favorite search engine? (How, exactly, do you win pissing contests?) Search engines, like the mass spectrometers that produce their input, are pretty good these days (and more similar than different).

There is interest in writing up the survey results to share with the proteomics community. @nesvilab (Alexey Nesvizhskii) quickly countered with this thought: “…However, I would suggest changing the focus from counting who uses what tool to something that will help understand current issues and move field forward.” The survey itself is not really something you can have as the center piece of a paper. How to create and analyze survey data is a mature field in psychology and social science. It would require much more thought and effort to do a proper survey design to have the results be publishable. However, the current survey can be a starting point for a review, tutorial, or perspective paper.

If we ignore the search engine popularity contest, we can ask what about all of the other steps in proteomics data analysis workflows? I would argue that the search engine step has become a smaller and smaller part of the overall workflows as proteomics has matured. There is no single focal point in one workflow that we are all stuck at. Like proteomics experiments themselves, there is a lot of diversity in the questions/problems people are tackling, so there are many workflows. One workflow direction is PTMs, the dark proteome, etc. Major challenges are in consolidating/summarizing results and validation. Cross-linking and structural questions are also an emerging area. The challenges there are pretty much limited to all steps. Single cell and other sensitivity driven challenges are also getting a lot of attention these days. It seems that some cells are smaller than chicken eggs…

Another area that requires extensive workflows above and beyond search engines are protein expression studies (a.k.a. quantitation, quantification, etc.). Quantitative proteomics has been going on for so long that is must be a completely solved problem, right? ICAT was Steve Gygi’s thesis work (1998 or 1999). TMT and iTRAQ publications are from 2003-2004. The SILAC MCP paper is from 2002. A year in proteomics must be like dog years. Work from the early 2000s seems like a lifetime ago.

Quantitative methods and reagents have been around a long time, but the capability to do actual experiments has taken years and we are still actively working on it. The early instruments were sufficient to demonstrate proof of principle for quantitative proteomics but not really capable of doing biological studies. It was not until we had Orbitraps and high resolution that MaxQuant could be developed to actually analyze SILAC studies. It took many years to increase the multiplex capacity and understand inference in isobaric labeling. High resolution allows N- and C-reagents, and new tribrid instruments can do complicated acquisition strategies to reduce interference. Isobaric labeling is finally ready to do biological experiments, at least on the front-end side. We do not even have easiTAG reagents available yet for the next chapter in isobaric labeling. ICAT was so ahead of its time that it missed its chance to find a modern instrument solution.

Label-free quantitation also seems like it has been around a long time. Dick Smith’s group published accurate mass tag papers in 2002. There has been work on MS1 feature quantification for an awful long time, but we are still trying to get it right. Boxcar acquisition and MaxQuant Live are still in development to try and make the acquired data consistent enough to handle studies with larger number of samples. Chromatography stability is also being actively pursued to facilitate all of this. I could argue that we are still in the proof of principle stage or not much past it. The other “label-free” method of spectral counting had a surprisingly long run. It is a pretty poor measurement method, but all of the other “clearly superior” methods were just not ready for prime time. I am sure many people that ended up publishing spectral count studies tried other methods first. At the end of the day, you have to get something done the best way that you can.

Data independent acquisition is also pretty new and rapidly evolving. DIA is more complicated on many fronts so figuring where it is at on a usability spectrum is hard. I would guess somewhere past proof of principle, but still a way out from routine biological studies. E. coli or yeast are not really proper stand ins for human studies.

The survey had many interesting questions and responses. In the analysis notebook, Skyline and MaxQuant dominate (Skyline has more limited applications). Commercial software is harder to unpack. I think the popularity of Proteome Discoverer and Mascot might be less if they were compared to freely available options. The software features summary pretty clearly shows that there is only one price the community wants to pay for software, and that is zero. The downstream analysis tools are also biased. I do not think that there are many users of MSstats that did not produce the data outside of Skyline. I also do not think many people ever use Perseus on non-MaxQuant data.

What do these survey results tell me? Software captivity or “users held hostage” is a big issue. The ability to mix and match analysis tools is beyond the capability of most users (or maybe most tools). Black box end-to-end solutions do not exist. I actually find the last point encouraging. I think black boxes are sometimes initially useful, but end up holding back innovation.

The inability to mix and match tools has a lot to do with how users are educated on these topics. Clearly, the design of the tools plays a part, too. Many tools are “designed” to work with other tools but never get out of the starting gate, most others stumble not far from the gate (for many reasons). Methods sections in traditional publications are woefully inadequate as templates to follow for data analysis. Newer ways of communicating, like notebooks, would be a huge improvement, but there are not many examples in Supplemental files yet. Summer schools and Nature Protocol papers tend to be silo-based. You will learn how to mix Skyline and MSstats, or how to mix MaxQuant and Perseus. Commercial tools do not like to acknowledge the existence of any other sources of tools; they want to hold you captive.

What do I propose that will “…will help understand current issues and move (the) field forward”? I think quantitative proteomics is the right topic. The survey results can be used to raise questions a paper could answer. Would that be a quantitative software review? Or some sort of quantitative tutorial? I would say no to both. I argue that software tools have mostly been designed before we knew enough to design such tools. This is not the fault of the software creators. Who would have figured it would take 15 years to work the bugs out of these methods? I am not sure there is all that much to review that actually works. Along the same lines, you need something to use as an example in tutorials.

I think we should outline what functionality quantitative software should have. There are a few common types of quantitative experiments that are more frequently done, but there are always many experiments that do not fit into pigeon holes. We cannot design for new experiments that we have not thought of yet. We need flexible designs so we can tackle new things without starting over. Requirements might be somewhat different for the major types of experiments: stable isotope labels, label-free studies, DIA, or targeted. There may be some core functionalities that they share. Some analysis of what is similar and what is different could be useful. Many people do not work across these boundaries. Some do and can provide perspective.

A serious problem so far has been too much focus on upstream workflow steps. It is really the last steps that are critical. I think a lot of quantitative proteomic experiments are falling down within sight of the finish line. Obviously, the most important step is the “last” step. Ha-ha. We need study designs that match biological complexity. We need larger numbers of samples and the statistical frameworks to handle more than 3x3 datasets. We need to do multi-factor designs and time courses. We need to integrate multi-omic datasets. We need to design quantitative tools to be able to do these things. Taking the ratios of heavy labels (ICAT or SILAC) to light labels did not really cut it back in the day and is something proteomics need to put in the rearview mirror.

The results of quantitative studies are going to be highly dimensional and will need new tools to explore and visualize. The goal should be to create frameworks that facilitate actual biological queries. The data need rich annotations (functions, pathways, interactions, etc.). The concepts of gene sets need to be adopted and expanded. How are pathways and processes changing between conditions, or as a function of time? This is the real frontier for proteomics bioinformatics. This intellectual landscape is almost completely undiscovered.

It would be really useful to have a paper describing what we need the next generation of quantitative tools to do to advance proteomics.