The Story Behind the PAW pipeline

Sitting down with Dr. Ben Neely to discuss my PAW pipeline (video available here) got me thinking about what makes my pipeline different and how did that come to pass?

Pipeline History

When I started doing proteomics in 2003, instruments like the Thermo LCQ classic could acquire something like 750 MS2 scans in typical 90-min LC gradient. There were also the early Q-TOF platforms (ABI Q-Star and Waters QTOF2) that could do maybe 150 MS scans per LC run. Large-scale separation experiments (MudPIT), with 30 or so fractions, could make huge datasets with 25K MS2 scans.

The PAW pipeline coding was started in late 2006. We had gone from a Thermo LCQ Classic 3-D ion trap to an LTQ linear ion trap. The scan rate on the LTQ was dramatically faster (5-10 times faster) than the LCQ and that directly translated into much larger data files with many more MS2 scans. The two software packages we were using at the time, BioWorks (Thermo) and Scaffold (Proteome Software), were having performance issues on the dual core XP workstations we had at that time. Those desktop computers had mighty Pentium 4 processors, 1 or 2 GB of RAM, and 250 GB spinning hard drives.

The low-resolution ion traps could only distinguish 1+ peptide charge states from any higher charge states. With tryptic digests, most peptides are either 2+ or 3+. The solution was to make two identical MS2 scans (one for 2+ charge state and one for 3+ charge state) for all peptides that were not 1+ peptides. This doubled the number of starting MS2 scans in already larger datasets. The ID rates in the searched datasets were low (and cut in half), typically in the 10% to 20% range. That meant that 80% to 90% of the search results data were not useful and just taxing the computing resources.

There was not much we could do about the large number of redundant MS2 scans on the low resolution instruments and having to run a search engine to sort out which IDs were 2+ and which were 3+. This assumed that only the correct charge state scan could return a good enough scoring peptide-spectrum-match (PSM) to pass filtering. We were stuck with large starting dataset sizes and lengthy database searches no matter what. However, if we could find the 10% to 20% of the scans that were accurately identified in the search results and filter them out of the dataset, we would have subsequent pipeline steps being way faster because they would be working with much smaller data files.

The linear discriminant classifier function used in PeptideProphet (2001) for SEQUEST (1994) was established in the Trans Proteome Pipeline and used in Scaffold. This was adopted in my pipeline with some coefficient tweaks to better position the two distributions across charge states. Around the time this was being developed, the target/decoy method was gaining traction. There were some efforts to do distribution fitting initially, but that was abandoned for the empirical FDR estimates of the target/decoy method. Similar score distribution visualizations to those used in the distribution fitting were retained in the interactive score threshold setting GUI (graphical user interface).

The discriminant function used a different calculation for each charge state. It was also common back in the dark ages of proteomics to do non-specific cleavage (no enzyme) for tryptic searches and then require fully- or semi-tryptic matches. Peptides without any tryptic cleavage motif on either end are unlikely to be correct matches when using trypsin and this strategy functioned as a coarse noise filter. These factors led to the FDR analysis being done in separate subclasses of peptides (different charge states and different numbers of tryptic termini).

The pipeline modules were written in Python where SEQUEST results were processed, discriminant scores computed, and score histograms generated as tab-delimited text files. The score histogram text files were loaded into Excel template files for visualization and setting score thresholds. The score cutoffs were used to filter out the correct matches, greatly reduce dataset sizes, and make protein inference much more manageable. This approach was cumbersome but workable for low resolution data.

The performance of Orbitrap instruments was game changing and support for accurate masses needed to be incorporated. The overlaid target and decoy score histogram visualizations were very useful for cursory quality control assessment. This led to making histograms of delta masses (the difference between the measured mass and the theoretical mass of the assigned peptide sequence) in addition to the score distribution histograms. SEQUEST reported masses as MH+ ions (the 1+ charge state of the peptide) in Daltons. The pipeline computes delta masses in Daltons. This has interpretation advantages because common delta masses like deamidation and the M1 isotopic peak have defined masses (0.984 Da and 1.003 Da, respectively), in contrast to PPM scales.

SEQUEST searches for low-resolution instruments always needed wide precursor ion mass tolerances. A 2.5 Da window with average masses worked the best for the LCQ and LTQ. When we had datasets from Orbitraps, the most natural idea was to keep doing the wide tolerance searches and then use the accurate masses as a filter for correct matches. This creates a design choice, as they are multiple ways that one can use the accurate mass information.

In probability modeling frameworks, (somewhat) independent probabilities of correctness can be computed from discriminant score distributions and from delta mass distributions. These probabilities can be combined into joint probabilities of the overall match correctness. Alternatively, accurate masses can be used in a more post validation step where the primary cut is made from the score distributions. This raises the question of what score cutoff to use with accurate mass post validation. Sensitivity will be improved if a more relaxed score cutoff passes more putative correct matches to the accurate mass test.

Examination of the delta mass histograms made it clear that correct matches cluster at very specific delta masses in narrow peaks. Most correct matches will have a delta mass of 0.0 Da. There are also two peaks around 1.0 Da that come from two common sources. Asparagine residues followed by glycine residues can readily deamidate (+0.984 Da mass shift) during sample processing (like oxidized methionine). As peptides get larger, the monoisotopic peak decreases and the M1 first isotopic peak grows. It is not uncommon for the M1 peak to be incorrectly assigned as the monoisotopic mass of the ion. This results in a +1.003 Da mass shift in delta mass. These two peaks are typically baseline resolved in Orbitraps (with common 120K resolution settings).

Analogous to separate score distributions separated by charge and number of tryptic termini, adding these narrow delta mass regions to the peptide subclasses was logical. By setting regions of interest in delta mass histograms (by peptide charge) before score cutoffs are set, sensitivity is much improved over testing accurate masses afterwards. The order of pipeline steps can be important.

This approach of conditional score histograms (conditioned on accurate masses, charge states, tryptic status, and modification state) is conceptually simple but it does lead to a lot of score histograms to examine. This was no longer feasible with Excel. Luckily, we had a gifted undergraduate student, William Rathje, looking for a good summer 2014 project. Billy was able to create the interactive Python GUI application for managing the score histograms and setting the FDR cutoffs.

A final switch from Thermo’s SEQUEST (which became impossible to run by itself when Proteome Discoverer replaced BioWorks) to open-source Comet completed the current pipeline form. The use of MS2 and SQT file formats were carried over from the SEQUEST days. The workflow consists of separate scripts for each step. In many cases, the steps are run in order and all of the steps could have been combined to make the pipeline simpler to run. There are times when you need to control the workflow by doing some file management in between the steps. These flexible ways to process data become more important in larger studies. A project only requires that the same FASTA protein file is used in all the searches. Data can be analyzed in subsets and then combined, data from different instruments (maybe needing different Comet settings) can be combined, etc. Having to individually execute each step was retained (by choice) to allow this flexibility for larger projects.

Pipeline Steps

I detailed the pipeline steps in this blog post and the pipeline has lots of documentation. I will not repeat all that here. I will make one observation about pipelines in general. Good pipelines do not perform well because they have some amazingly good tricks up their sleeve. They perform well by not having any poor performing steps. Most steps in pipelines (and there are many steps) have a wide variety of algorithm choices. Only a few of those choices will actually work well. One or two weak steps in a pipeline is all it takes to ruin the results. Every step needs to be transparent, understandable, and evaluated for performance.

Accurate Masses and Wide Tolerance Searches

The accurate mass conditional discriminant score approach goes hand in hand with wide precursor mass tolerance searches. I have made the case for wide tolerance searches in this blog post. I usually use 1.25 Da precursor mass tolerance (with monoisotopic masses) with Comet. I do not like the PPM scale because it is unit-less ratio and does not have direct physical meaning.

Because of the historical use of wide tolerance searches and SEQUEST/Comet reporting masses in MH+ scale, two common types of mass measurement discrepancies are automatically handled. We do not have to consider isotopic peak errors in the search engine settings. The M1 peak errors will be at 1.003 Da and be within our precursor tolerance (M2 is not so likely for tryptic peptides and the tolerance can be increased if needed to capture the M2 peak). Deamidation of asparagine residues, like oxidized methionine, is common enough in most sample processing that it should be considered in the search. With a wide search tolerance, deamidation is also effectively captured without having to specify it as a PTM. We only need to add variable oxidation of methionine to the Comet searches to cover the most common situations.

Systematic mass calibration errors are easily accommodated. Most mass errors will be extremely small on the 1.25 Da mass scale. Mass errors just move the location of the zero Da delta mass peak a little. Data acquired at different times may have different mass calibrations errors. This can cause delta mass peak broadening or multiple peaks. Lock mass can also result in multiple peaks if the mass errors are at the limits of the lock mass correction range. Windows can be shifted or widened to accommodate that (or data processed in batches and combined downstream). In this sense, the pipeline is adaptive to the data being analyzed by having “you learning” instead of “machine learning”. You don’t really gain much by having your machine learn things instead of you. You retain memory of datasets (after a little time) and will be capable of making judgement calls about the data that a machine probably can’t.

Another way to think about the effect of the narrow delta mass windows on the score distributions is as an empirical posterior probability adjustment. Because the noise levels are reduced to the small baseline areas that the narrow peaks sit on, the magnitudes of the correct and incorrect score distributions change when conditioned on the delta mass peaks and that affects the score cutoff needed to achieve the desired FDR. These conditional score histograms also adapt to the dataset being analyzed in a natural, transparent way.

There is much built-in quality control that can be checked in the delta mass histograms: mass calibration accuracy, mass resolution, extent of deamidation, and extent of M1 triggering. M1 triggering can depend on instrument, acquisition settings, size of peptides, and precursor charge states.

Discriminant Score Transformation

The PeptideProphet discriminant function approach used in the pipeline is a static classifier method. The classifier was trained on ancient LCQ data with SEQUEST search engine scores. The basic functional form has stood the test of time. The PAW pipeline uses a slightly modified deltaCN term where the top score is compared to the average score of matches 4-12. This gap score is more tolerant of things like positional isomers in PTM searches. Some minor tweaking of term coefficients was all that was done to center the incorrect score distributions at zero score and make the range of correct scores similar across charge states. There is no self-training on the data being analyzed (and all those complications). Classifier performance could likely be improved by revisiting the functional form. Comet now has an expectation value computation for each PSM that may be a useful feature to add, for example.

Categorical terms (charge state, number of tryptic termini, etc.) are not explicitly included in the discriminant function. They end up being separate sets of score histograms in the pipeline (i.e., different peptide subclasses). In a machine learning approach, the categorical terms would likely be reward or penalty terms so that a single, unified classifier score could be computed for each PSM.

There is no model fitting of any distributions needed with the empirical target/decoy FDR method. The visual approach also allows checking that the decoy score distributions are a good proxy for the incorrect target score distributions. Simple, sequence-reversed decoy databases (when concatenated to the target sequences) have always produced sufficiently accurate estimates of the random incorrect target matches in the 20 years that I have been doing this.

Adaptive FDR Control

A static classifier differs from dynamic classifier methods (like Percolator), which have the freedom to change the weighting factors of their feature sets for each dataset being analyzed. I think that feature weights in Percolator are more similar across datasets than most people realize. The approach in my pipeline does not learn anything from the data, nor does it adapt to the data in any real sense. A better way to think about it would be that the delta mass and conditional score histograms reflect the dataset being analyzed.

The histograms track the scores of correct (if any) matches and incorrect matches across the peptide subclasses. The incorrect matches partition according to relative search space size (they are uniformly distributed across the full search space). Correct matches partition according to the samples being studied (in some sensible scientific pattern). The score value locations of the correct and incorrect score distributions are quite invariant across peptide subclasses and across datasets.

What does change are the relative magnitudes of the two score distributions and this affects the location of the score cutoff that achieves the desired FDR (see this repo for more details). There is no common scoring scale (like Bayesian probabilities or other types of classifiers) used to determine a single dataset cutoff to control FDR. Instead, different score cutoffs are used in each peptide subclass to control the subclass FDR to the desired value for the entire dataset. This seems to work well in practice, although there is no guarantee that this approach will maximize PSM recovery.

Who knows what label to give the FDR control in my pipeline. Is it learning? Is it adaptive? Is it dynamic? Is it informed? Is it reflective? Call it what you want. It is understandable, visual, defensible, and tells you tons about the experiment and the dataset.

What’s Not in the Pipeline?

The pipeline does not use any data modeling or any machine learning. My pipeline is proof that those methods are not needed (they may work okay in some contexts, though). Proper use of mass measurements and a sensible framework for FDR control is all that is necessary to get excellent performance. You cannot model data if you do not understand the data. Learning data characteristics from a model-free analysis framework will benefit any subsequent data modeling efforts. The first question to answer in data modeling is whether or not data modeling is needed.

Built-in Quality Control

There is a surprising amount of quality control available in the pipeline steps. The delta mass histograms check mass calibrations, mass resolution, and some sample processing QC (deamidation, digestion performance [charge state distributions of peptides]). Overall PSM ID rates can tell you if there was protein in the samples and could point to other issues if the ID rate is atypically low (poor LC or mass spec performance, use of the wrong FASTA file, etc.). My pipeline does not work well for more complicated database searching (many PTMs), but it is handy to see if a particular oddball modification is present in your samples. You will get separate score distributions for that modification and can see, at a glance, if there are high scoring target matches. In addition to the QC information within a single dataset, trends across datasets and time can be monitored because all data is processed consistently. You learn what good samples and good measurements “look” like. It becomes easier to recognize poor datasets quickly when there might be time to fix the problems.

Quality control was never a conscious design goal in the pipeline. It has been extremely useful over the years, though. Many of the analytical steps in modern proteomics are very demanding and, thus, a little fragile. Failures happen and they can be hard to recognize in a timely fashion. We have caught many problems in our core facility with the pipeline in time to fix the issues and re-run the samples. The pipeline is also very effective when optimizing data acquisition settings on current instruments (so many choices…). PSM counts at a fixed FDR is an effective optimization metric.

My Historical View of Major Computational Advances

What concepts in computational proteomics have really contributed to advancing proteomics over its 27+ years? This is a space with a lot of noise. There are hundreds (thousands?) of papers on specific methods and algorithms to sift through. There are not very many major concepts at play here, though. Some concepts do seem to get too much attention. Everyone who has been around a while will have a different list. This is my list:

Database searching: Using theoretical peptides from sequenced genomes to assign peptides to fragment ion spectra is an amazing breakthrough. The vision in 1994 to go that route with SEQUEST is hard to fathom. Were there really sequenced genomes back then? The FASTA file situation 25 years ago would be like human life before fire was invented…
Classifier functions: The PeptideProphet paper is better known for Bayesian probability calculations, but development of a classifier function that dramatically improved the power to distinguish incorrect and correct matches was the major advance for me. Doing this in 2001, several years before the target/decoy method was in use, is kind of mind blowing.

Percolator is an automated procedure for building a classifier. Percolator does not really need to be run on each dataset and I do not like its black box nature. I don’t care if my machine learns something or not. I want to learn something. Comparing and contrasting Percolator classifier functions on different instrument platforms, different data types, etc. to learn what really distinguishes incorrect and correct matches is far more interesting to me than just getting a q-score results list.

The target/decoy method: It took the field almost a decade to work all the bugs out of how to estimate incorrect match scores and set score thresholds, but we did it. Why reversing protein sequences seems to work well as decoys and why the decoys need to be concatenated with the target sequences will be saved for another blog post. It seems hard to imagine proteomics without FDR estimates, but that is how it was for the first 15 or so years. Now the target/decoy method is so central to validation methods, it seems like we have always done that.
Protein inference: This is a big concept in shotgun (bottom up) experiments. However, I do not think we have fully solved this one yet (at least compared to the other concepts listed above). Protein inference is messy, so messy that we had to have journals (circa 2004) require proper parsimony logic to results lists. We still live in the dual world of peptides and proteins. The MS2 spectrum comes from a given peptide sequence. It is when we ask what protein that peptide came from that all Hell breaks loose. Now we have added ambiguity. I do not think we have really come to grips with what to do with that ambiguity. Ambiguity complicates protein inference and may need additional protein grouping steps beyond the basic parsimony logic, it affects resolving proteoforms, and has direct bearing on peptide and protein quantification. I think we will be wrestling with this for some time.

Phil Wilmarth
June 6, 2021