What (and when) to do with zeros in TMT data?

One of the advantages of TMT labeling is much less missing data compared to other methods like MS1 feature based label-free quantitation. That does not imply that there is no missing data. The reporter ions are measured in single scans and some of the reporter ions may be below the detection limits for lower abundance PSMs. If isotopic purity correction factors are used, some channels might have all of their intensity accounted for from adjacent channels. There are also cases where a PSM can have a sequence assignment but not be abundant enough to have any reporter ions. These scenarios are more like true missing at random cases.

If multiple TMT labeling are required in a larger experiment, then there are proteins (with associated peptides and PSMs) unique to each individual TMT experiment. This creates TMT data for these proteins that are missing by TMT experiment cases. There can even be the case where there is evidence for a protein in one TMT experiment but there were no reporter ions, and that protein was not seen in other TMT experiments. Then you have a protein with no reporter ions at all.

Some of you will have noticed that I have been talking about reporter ions from PSMs and from proteins. It is advantageous to aggregate (sum) reporter ions up to the level that is appropriate for the question being asked. If the experiment is testing for differential protein expression, then protein level aggregation is called for. If the experiment is something like a phosphopeptide enrichment experiment, then a more peptide-centric aggregation is needed. Should something be done about replacing zeros before aggregation or after aggregation?

First, the what part. Zeros will cause problems in many mathematical steps later in the data analysis (log transformations and denominators in ratios) so we usually want to replace them with some non-zero value. There are many options here. Some methods try to determine a realistic value of what the data might have been. Other methods choose something that solves the math issues but is still recognizable as missing data. I prefer the later. In Fusion (not Fusion Lumos) SPS MS3 datasets, the smallest non-zero reporter ion peak heights (intensities) are about 300. I usually replace zeros with 150 (I have used values from 50 to 150). At the end of the day, I want to be able to know when there was missing data. I want a value different enough from the regular data to be easily distinguished.

If we replace zeros at the PSM level and we are aggregating up to higher levels, then we can get multiples of the zero-replacement value (150, 300, 450, etc.) and that can make recognizing replaced zeros harder afterwards. I think it is safer to wait until after the aggregation to test for zeros and then do any replacements. I like to do the replacements before doing normalization steps. The single factor normalization adjustments tend to fuzz out the replacement values some and that effects the statistical testing a bit less. We do not want several replaced zeros for a condition to seem like a low variance data point and trigger a false positive DE call. However, if we can recognize the replaced zeros, this is less of an issue during validation.

We have one more wrinkle in all of this. We might want to filter out data points where we have too many missing data values. How do we do that in a less heavy-handed way? The degree of missing data in any proteomics experiment is mostly isolated to the lowest abundance signals. Right near the detection limit, things are pretty variable. You might get a signal, or you might not. As you get even a little above the noise level, things get much more consistent. Missing data will drop off like a rock. Removing some of the lowest signal level PSMs can dramatically reduce missing data.

What I have adopted in my pipeline is to test a trimmed average intensity against a specified value (I like 500). The trimming is just removing the highest and lowest intensity. When a PSM’s trimmed average does not pass the test threshold, all of the channels are zeroed out, but the PSM is retained.

I have compared median intensity to the trimmed average and I like the trimmed average better. I mostly work with 10 or 11-plex data. Consider a 5 by 5 experiment and the protein is not present in one condition. If the PSM is low abundance, then the 5 channels where the peptide is present might be near the lowest levels 350-ish. The other 5 might be all zeros. The trimmed average would be about half of 350. That would also be the median in this case with an even number of data points. However, if one of the 350 values is missing, then the median jumps to zero. The median in these cases can be less stable than the average. The average is made more robust to an intense interference in one channel by trimming. The trimmed average test passes through PSMs in cases where the protein is not present in a condition when the intensities of the other channels are high enough to get the average over the test threshold.

Some final comments. I have a pipeline that can start with Thermo RAW files and provide final protein summary tables with TMT intensities suitable for processing in R. That uses Proteowizard and the Comet search engine (up through 2016 versions). Those tools require a Windows OS.

My pipeline can also process Proteome Discoverer PSM export files (it will need the FASTA protein file used in the search) where Percolator has assigned q-values. The second half of the pipeline will run on other platforms (MacOS). The exported file can be restricted to PSMs that meet a desired cutoff (0.05 is okay for protein expression, 0.01 is better for phosphopeptides). PD exports contain the reporter ion values. My processing of PD exports does not use any of the protein inference from PD. It only uses the PSM measured and theoretical masses, q-values, sequence string, protein accession, and reporter ions. There is no PSM FDR control per se other than delta mass and q-value cutoff. PD 1.4 does not include any decoy matches so no target/decoy counting is even possible.