Open Search Considerations

What are some things to be aware of when doing open or unrestricted searches?

Disclaimer: this was done quickly and without an exhaustive literature review. This is a blog entry, not a review article.

You may be hearing about open searches, such as MSFragger or MetaMorpheus, to study post-translational (and co-translational) modifications these days. The idea is not new. It was explored quite a bit in the early-to-mid 2000s. The instrumentation and computer performance in the day was a brick wall we could not break through. There were many tools created to try and do this: OpenSea (first paper, second paper), Mann and Wilm,GutenTag, DirecTag, TagRecon, MODa, Inspect, error tolerant option for Mascot, wildcard option in Byonic, Interrogator, Paragon, Chick et al. and, to some extent the refinement step in X!Tandem. There were many other concepts and techniques explored that I do not have time to try and recall.

At some higher level, these are all attempts to match MS2 spectra to tryptic or semi-trpytic peptide sequences derived from known protein sequences where the mass difference of the theoretical peptide and the measured mass can have any value in some defined large range (up to several hundred daltons). Your first thoughts should be “what a can of worms” or “that is insane”. Indeed. There were no Orbitraps back then. Q-tofs had better mass accuracy but poorer sensitivity. Ion traps had better sensitivity but only nominal mass accuracy. The instruments did not have more than one way to fragment peptides (ion traps were CID, Q-tofs were HCD). Can you even imagine trying to do this without high resolution, high sensitivity fragment ion data? No proteomics problems are easy, but this was really hard.

There are so many challenges here. Search space explodes. Search times move into geological time scales. How can we make the algorithms efficient enough to be usable? How can we control error rates? Can there be more than one modifications in a peptide? Do we want to know where the PTM is located? Does the adduct have any chemical specificity? How can we validate the PTMs?

Before comparing and contrasting algorithms (which we won’t try to do), let’s think about the peptides in the sample. Some have adducts regulated by biological processes and are present before we start processing samples. Those are things we care about. Biological processes use enzymes and can be transient. For these and many more reasons, the modifications may be substoichiometric (not very abundant). Those modifications may be labile, or too large, or too heterogeneous, or affect chromatography, or affect fragmentation, or be too highly charged, or too lowly charged, etc. There are other biological processes that are not adduct formation, such as, proteolytic processing, single amino acid mutations, alternatively spliced peptides, addition/removal of trafficking signals, etc. Some modifications can disrupt proteins structures but have no associated mass shifts (isomerization and epimerization). There can be oxidative damage from many mechanisms, exposure to toxins, cross-linking, and the list goes on and on. Those are just some of the types of in vivo modifications that might be biologically relevant and present in you samples.

Once you have collected your samples, the biological assault on your proteins may be ending, but the chemical assault to your proteins and peptides is just beginning. Do you use protease inhibitors? Why? Is there oxygen gas in the air in your lab? Do your bench steps use any chemicals? Do any of your bench steps take more than a few seconds? Are you doing your lab work at -80 degrees? Literally everything under the sun (including the sun) is attacking your samples: time, temperature, air, surfaces, every single solution, every sample manipulation, etc. How do you distinguish these in vitro modifications from the biologically interesting in vivo modifications?

Asking questions about PTMs is like any other scientific question. You have to use the scientific method and design a proper experiment. You need controls (probably multiple controls). Some examples might help. Any sample we study is going to have a mixture of in vivo modifications and artifactual in vitro modifications. We do not know which is which. They may not be mutually exclusive. Some in vitro mods might the the same kinds as the in vivo mods of interest (like oxidative damage). We may need to be looking for differential abundance changes of modifications. We might have biological samples that will not have the modifications we are interested in, or we might be able to express proteins without modifications. If we process those modifications-of-interest-free samples in parallel with the samples of interest (same bench steps, LC and mass spectrometry, and data analyses), then they should have just the uninteresting in vitro modifications present at the same relative abundance as the in vitro background in the samples of interest. We can find the modifications of interest by subtracting this uninteresting background.

Some of the early work involved eye lens proteins. The proteins in the eye lens do not turnover. They are stuck in the lens for your lifetime. We are all doing decades-long protein damage incubation experiments on our lens proteins as we speak. Various chemical insults and radiation exposure are accumulating for decades. That causes lens coloration resulting in altered color perception and poor night vision. Modifcations might create changes that affect lens flexibility and loss of accommodation results in the need for reading glasses after age 40 due to lenses becoming stiffer. Finally, we all get cataracts and lose lens transparency at some point if we live long enough. Cataract is kind of a protein aggregation disease with an oxidative stress component. Long story short, old lenses have lots of accumulated PTMs not found in young lenses. We work mostly with donor lenses to research these topics and we greatly appreciate the willingness of donors and their families to facilitate this. Unfortunately, we do have newborn and infant lenses to use as controls for aged lenses. The three papers listed below studied aged lens changes where we used a young lens as a control.

Tsur, D., Tanner, S., Zandi, E. et al. Identification of post-translational modifications by blind search of mass spectra. Nat Biotechnol 23, 1562–1567 (2005). https://doi.org/10.1038/nbt1168

Wilmarth, P.A., Tanner, S., Dasari, S., Nagalla, S.R., Riviere, M.A., Bafna, V., Pevzner, P.A. and David, L.L., 2006. Age-related changes in human crystallins determined from comparative analysis of post-translational modifications in young and aged lens: does deamidation contribute to crystallin insolubility?. Journal of proteome research, 5(10), pp.2554-2566.

Tanner, S., Payne, S.H., Dasari, S., Shen, Z., Wilmarth, P.A., David, L.L., Loomis, W.F., Briggs, S.P. and Bafna, V., 2008. Accurate annotation of peptide modifications through unrestrictive database search. Journal of proteome research, 7(01), pp.170-181.

We also looked at some techniques for trying to validate the unrestricted search results in these papers. These techniques are critical because datasets are large and manual validation of every single MS2 spectrum is not possible (until we train the robots). We looked at traditional searching using a richer set of putative PTMs to try and confirm results. Abundance based PTM frequency tables are also very useful. We end up with sharp peaks for residue-specific modifications and ridges for peptide/protein N-terminal modifications. We used spectral counts for abundances but intensity-based estimates would be better. Prior knowledge also needs to be leveraged. Novel PTMs in very low abundance proteins (without specific PTM enrichments) are not likely. Some abundance ranked list of the proteins in the sample should be used to focus on the major novel PTMs associated with the higher abundance proteins. There may also be prior studies supporting types of modifications or biological process that are expected to be important. Any internal negative controls are also critical for accessing error rates. This is a relatively unexplored informatics topic where the return on investment would be high.

Parsimony thinking is also applicable. Is one relatively rare large deltamass PTM likely to be the result of two more common PTMs on one residue? An example is an oxidized and acetylated methionine at a protein N-terminus. M+58 is more likely to be M+16+42. You always need to remember that the first residue in a peptide is really two sties: the residue and the peptide N-terminus. In-source fragmentation can cause ragged peptide termini that may show up as large negative deltamasses that match sums of animo acids. Some PTMs may be amino acid substitutions. You get the idea. There are many alternative interpretations to pursue. It is not clear if this needs to be interactive using domain knowledge expertise or can be more fully automated.

Now that we have more publicly available data, it may be possible to compile lists of common in vitro artifacts seen across many datasets. Such knowledge could be helpful in interpreting open search results. Maybe it needs to be said that mining data from repositories where proper PTM study designs were not used, is not a great idea if you do not have a deep understanding of the limitations on interpreting any novel PTMs you might find.

One last thought, what are your goals for finding PTMs? For routine peptide identifications, we can tolerate a 1% FDR, or a 1 in 100 error rate. What do you need for PTMs? Are your PTMs at greater than 1 in 100 levels relative to the unmodified peptide? Is that above the error rate? Are you looking for a one in a million PTM? What kind of error control do you need to have confidence in that? Promiscuous chemical modifications can make a large number of distinct modified peptides. Counting PSMs might suggest that they are abundant. However, they may not be collectively very abundant on other scales. We have tremendous sensitivity these days. We may identify very low abundance peptides. We need to be thinking in quantitative terms for PTMs, too. We need to measure PTM relative abundances with intensity scales rather than with spectral counting.

I wanted to share some of my experience and draw attention to some earlier work that may be overlooked when learning and thinking about open searches. This is very hard work. Actually a bit too hard for the previous generations of instruments. New instrument advances are true game changers for PTM studies. Computational performance has also dramatically improved in the 15-20 years since we have been looking for modified peptides. Happy hunting!

Phil Wilmarth July 3, 2020