Spectral Counting (SpC) Analysis with R and edgeR¶

Phil Wilmarth, OHSU¶

October 1, 2018¶

The data is from the ABRF iPRG 2015 study and is described in (Choi-2017). Six proteins were prepared in 4 different abundance mixes and spiked into a yeast cell lysate background. Each of the 4 different spike-in experiments were analyzed in triplicate on a Q-Exactive instrument. Each sample was analyzed using a single LC run. It is not clear from the publication if the triplicate runs are technical replicates or biological replicates.

The RAW files were downloaded along with a yeast FASTA file. RAW files were converted to MS2 files using MSConvert (Kessner-2008; Chambers-2012) and scripts that are part of the PAW pipeline (Wilmarth-2009). Comet (Eng-2013) searches used a wide parent ion mass tolerance (Hsieh-2009) to maximize the number of peptide identifications using the target/decoy (Elias-2007) validation method. Other parameters were trypsin cleavage, M+16 variable mod, C+57 static mod, and a 1.0005 Da fragment ion tolerance (better setting for a QE should have been used). Accurate peptide delta masses (differences between theoretical and measured masses) were used to create conditional score histograms and filter peptide matches to 1% FDR.

The Yeast FASTA file available with the RAW files has the sequences for the 6 spike-in proteins hidden amongst the yeast proteins. Six random yeast proteins had their sequences changed to those of the spike in proteins, but the accessions and descriptions were not altered. I added the 6 spike-in sequences to the database along with common contaminants and created an appropriate target/decoy database using utilities from this repository.

Protein inference followed the basic parsimony logic rules and was applied experiment-wide (using data from all 12 samples). Protein identification required two peptides per protein per sample in at least one sample experiment-wide to make the final list.

The goal here is to demonstrate some R and edgeR (Robinson-2010) statistical testing that can be used for spectral counting (Liu-2004) semi-quantitative experiments. The PAW pipeline has some features for shotgun quantitation that we will be using. One is a protein grouping algorithm where highly homologous proteins (majority of peptides are shared with few unique peptides) are grouped together and the context for shared or unique peptides are updated accordingly. This extended parsimony logic is described in more detail here. The other feature is the splitting of shared peptide spectral counts based on relative unique peptide evidence.

EdgeR has been used in some other spectral counting experiments I have been involved with. The Fei et al. 2011 study had smaller replicate numbers and batch effects. Smith et. al. 2018 was a large (over 5 million MS2 spectra) paired study design. I may do a notebook for the later study to demonstrate paired study designs.

Chambers, M.C., Maclean, B., Burke, R., Amodei, D., Ruderman, D.L., Neumann, S., Gatto, L., Fischer, B., Pratt, B., Egertson, J. and Hoff, K., 2012. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology, 30(10), p.918.

Choi, M., Eren-Dogu, Z.F., Colangelo, C., Cottrell, J., Hoopmann, M.R., Kapp, E.A., Kim, S., Lam, H., Neubert, T.A., Palmblad, M. and Phinney, B.S., 2017. ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC–MS/MS Experiments. Journal of proteome research, 16(2), pp.945-957.

Elias, J.E. and Gygi, S.P., 2007. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3), p.207.

Eng, J.K., Jahan, T.A. and Hoopmann, M.R., 2013. Comet: an open‐source MS/MS sequence database search tool. Proteomics, 13(1), pp.22-24.

Fei, S.S., Wilmarth, P.A., Hitzemann, R.J., McWeeney, S.K., Belknap, J.K. and David, L.L., 2011. Protein database and quantitative analysis considerations when integrating genetics and proteomics to compare mouse strains. Journal of proteome research, 10(7), pp.2905-2912.

Hsieh, E.J., Hoopmann, M.R., MacLean, B. and MacCoss, M.J., 2009. Comparison of database search strategies for high precursor mass accuracy MS/MS data. Journal of proteome research, 9(2), pp.1138-1143.

Kessner, D., Chambers, M., Burke, R., Agus, D. and Mallick, P., 2008. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics, 24(21), pp.2534-2536.

Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

Smith JR, David LL, Appukuttan B, Wilmarth PA. Angiogenic and Immunologic Proteins Identified by Deep Proteomic Profiling of Human Retinal and Choroidal Vascular Endothelial Cells: Potential Targets for New Biologic Drugs. Am J Ophthalmol. 2018 Sep;193:197-229. doi: 10.1016/j.ajo.2018.03.020. Epub 2018 Mar17. PubMed PMID: 29559410; PubMed Central PMCID: PMC6109601.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Load libraries¶

library(edgeR)
library(limma)
library(tidyverse)
library(psych)
library(RColorBrewer)

Warning message:
“package ‘edgeR’ was built under R version 3.5.2”Loading required package: limma
Warning message:
“package ‘limma’ was built under R version 3.5.1”── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.3.0
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
Warning message:
“package ‘tibble’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Warning message:
“package ‘psych’ was built under R version 3.5.2”
Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Some dataset and search information¶

There were 12 LC runs on a QE that generated 383,054 MS2 scans. The Comet/PAW pipeline is sensitive, and 238,365 scans passed accurate mass and score thresholds. There were 2,489 REVERSED PSMs for an FDR of 1.04%. The overall MS2 identification rate was 62%. The PSMs mapped to a final protein list of 2879 proteins (50 REVERSED) using a requirement of at least two peptides per protein per sample in at least one sample experiment-wide.

Do the data cleaning and prep using R¶

We will start with the text summary file produced by the protein grouping script. That has some extra lines above the main table and some additional lines after the table. We can take care of the top lines during file reading. We will use some dplyr (tidyverse) techniques for filtering rows, etc. We need to drop the trailing lines, and double check what has been automatically tagged as common contaminants. We need to make sure that the extra BSA sequence has not been put in with the contaminants. We will ignore any proteins with entries in the Filter column (contaminants and reversed).

# try and start with the actual results file and wrangle data
# Summary table has lines before and after that are not strictly in table
# skip the leading lines when loading the table and read in just the table rows
full <- read_tsv("grouped_protein_summary_8.txt", skip = 4,
                 n_max = 2848, guess_max = 2848)

# check what are contaminants (we do not want any EXTRA accessions excluded)
check <- full %>% 
  select(Accession, Description, Filter) %>%
  filter(Filter == "contaminant")
check

Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Coverage = col_character(),
  SeqLength = col_character(),
  MW = col_character(),
  Description = col_character()
)
See spec(...) for full column specifications.

Table is ready to prep for spectral counting¶

We have read in the slightly messy summary file, dropped prefix and suffix lines, and made sure that bovine serum albumin is not being excluded as a contaminant. We will want to do some prep before getting count data for statistical testing. We want a minimum count cutoff to avoid the ragged lower end of the identifications.

We will add a new column that is the average corrected spectral counts and set that to zero for contaminants and decoys. We can also add a count of how many samples had missing data (counts of zero). Then we will sort that table by decreasing average corrected count. We also need a parsed accessions column (no repeat counts in parentheses).

# add the new columns to data frame and sort
full  <-  full %>%
  mutate(acc = sapply(strsplit(Accession, " "), `[`, 1)) %>%
  mutate(ave = ifelse(is.na(Filter), rowMeans(select(., starts_with("Corrected_"))), 0.0)) %>%
  mutate(missing = apply(select(., starts_with("Corrected_")) == 0, 1, sum)) %>%
  arrange(desc(ave))
head(full)
nrow(full)

Set an appropriate average SpC cutoff¶

In any shotgun experiment, more things can be identified than can be quantified. For quantification we need data points for all samples. The "bottom" of the dataset will be proteins seen in one or a few samples. We will need to exclude those. How do we identify them? An ad hoc approach is to consider a two-condition experiment. These are often some sort of biomarker experiment so there might be a protein in one condition that is not in the other. A very good rule for protein identification is two peptides per protein. That means that an SpC (spectral count) of 2 is at the detection limit. We want something above that lower limit to define a protein as confidently "present". A value of 5 is a good robust spectral count value; few decoy proteins will ever have an SpC of 5 or greater. If we have 5 counts in one condition and no counts in the other, then an average SpC of 2.5 could be a good cutoff value. Can we use the data itself to set the cutoff?

Plot number of missing data points versus average SpC¶

ggplot(full, aes(x = ave, y = missing)) +
  geom_point() + geom_smooth(method = loess) +
  ggtitle("Missing versus Average SpC") + labs( x = "Ave SpC", y = "Missing")

We have sharp increase in missing data at low average SpC¶

We need to expand the x-axis to see where the increase starts.

ggplot(full, aes(x = ave, y = missing)) +
  geom_point() + geom_smooth(method = loess) +
  coord_cartesian(xlim = c(0, 10)) +
  ggtitle("Missing versus Average SpC") + labs( x = "Ave SpC", y = "Missing")

Data indicate an average SpC cutoff of 2.5¶

The data happens to confirm our intuition about what a good average SpC cutoff might be. We will use an average SpC of 2.5 or greater. That will eliminate most missing data. EdgeR can tolerate a few zero counts, so we will not have to worry about any data imputation.

We need to get the corrected spectral count columns that have an average SpC of 2.5 or greater into a data frame. In any bottom up proteomics study, what to do with shared peptides is an issue. The quantitative information associated with shared peptides is very much less than for unique peptides. The PAW pipeline redefines the status of shared and unique peptides during protein inference. The initial context is with respect to the protein database. Once indistinguishable proteins are grouped and proteins without any independent evidence (subsets) are removed (parsimony), the context becomes the list of identified proteins. The PAW pipeline also has a protein grouping step to combine proteins with highly homologous peptide sets (such as actins, tubulins, etc.). These changing contexts shift many shared peptides to unique peptides.

There will still be shared peptides. They can be excluded, they can be lumped into the "best" protein that contains them (some definition of "best" is needed in this "all or nothing" strategy), or they can be split into fractional counts based on relative abundance estimates (from the unique counts). The PAW pipeline does the latter. These "corrected" counts will be floating point numbers for some proteins and will need to be rounded to integers.

# get proteins (rows) where average SpC is greater than 2.5
# edgeR will want integer counts so round the corrected counts
counts <- full %>%
  filter(ave >= 2.5) %>%
  select(starts_with("Corrected")) %>%
  round(., 0)
accessions <- full %>%
  filter(ave >= 2.5) %>%
  select(acc)

# create a frame to hold results from edgeR
results <- accessions

Check within condition (samples 1, 2, 3, or 4) replicates (runs A, B, C)¶

We will see below that the total counts are similar between runs and the composition is also very similar (it is mostly the yeast proteins) so we do not need to worry about normalizations in order to do some data visualizations. We will check how similar each of the 3 replicates are for the 4 samples (with linear and log plots). The average SpC cutoff of 2.5 eliminates a lot of missing data but there are still some zero counts. EdgeR will not complain about a few zeros. The plotting will complain so we will add one to counts for the log plots.

S1 <- 1:3
S2 <- 4:6
S3 <- 7:9
S4 <- 10:12
pairs.panels(counts[S1], main = "Sample 1")
pairs.panels(log10(counts[S1]+1))
pairs.panels(counts[S2], main = "Sample 2")
pairs.panels(log10(counts[S2]+1))
pairs.panels(counts[S3], main = "Sample 3")
pairs.panels(log10(counts[S3]+1))
pairs.panels(counts[S4], main = "Sample 4")
pairs.panels(log10(counts[S4]+1))

We can get started with edgeR¶

We will use the counts and accessions that we created in R above from the main summary file and load data into some edgeR data structures. We will run the built-in TMM normalization and see if the samples have any structure in a cluster plot. With only 6 spike-in proteins out of 1742 yeast proteins, there may not be much to drive any sample differences.

# load the data into edgeR data structures
# group labels need to be factors
group = factor(c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)))
yglm <- DGEList(counts = counts, group = group, genes = accessions$acc)

# run the TMM normalization (and library size corrections)
yglm <- calcNormFactors(yglm)

# check clustering (6 different out of 1748 may not do much)
plotMDS(yglm)

Clustering does not show much difference between samples¶

How to analyze this data?¶

Four samples with some protein differences might be a time course or a control group and some different treatments, etc. Something like an ANOVA might be appropriate. EdgeR can do more complicated study designs if we use the linear modeling options. Let's try an ANOVA like test. To do that, we will have to create a model matrix with an intercept term.

There are other ways to create linear model matrices and this topic is beyond the scope of this notebook. EdgeR and limma user's guides are good places to start. Some R books and statistics books that use R may have useful details on this topic. R blogs and other online content can also help. I am still looking for a good information source on this topic.

# create the experimental design matrix
design <- model.matrix(~group)
rownames(design) <- colnames(yglm)
design

# extimate the dispersion parameters and check everything
yglm <- estimateDisp(yglm, design)
# yglm

# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(yglm)

# this tests if any conditions caused differences
any <- glmQLFTest(fit, coef = 2:4)
topTags(any)
summary(decideTests(any))

       group4-group3-group2
NotSig                 1742
Sig                       6

We have only the 6 spiked in proteins as significant¶

Visualizing multi-condition results like an ANOVA is difficult. Another way to explore this data is to look at samples in pairwise comparisons.

If we wanted to look at pairwise comparisons between the different samples, we would want a design matrix without an intercept term. Then we could explore different contrasts (see edgeR user's guide for much more details). Another option is to use the classic exact test in edgeR. We can get dispersion estimates experiment-wide to make the statistical testing more robust.

Make new DGEList object for classic testing¶

group = c(rep("S1", 3), rep("S2", 3), rep("S3", 3), rep("S4", 3))
yc <- DGEList(counts = counts, group = group, genes = accessions$acc)
yc <- calcNormFactors(yc)
yc <- estimateDisp(yc)
plotBCV(yc)

Design matrix not provided. Switch to the classic mode.

Biological variation is flat because of the contant Yeast background¶

We can check TMM normalization and save normalized counts in "results"¶

# Compute the normalized counts (start with counts)
# sample loading adjusts each channel to the same average total
lib_facs <- mean(colSums(counts)) / colSums(counts)

# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / yc$samples$norm.factors

# compute the normalized data as a new data frame
counts_norm <- sweep(counts, 2, norm_facs, FUN = "*")
colnames(counts_norm) <- paste(colnames(counts), "TMMNorm", sep = "_")

# add normalized counts to results
results <- cbind(results, counts_norm)

# look at count distributions across samples
boxplot(log10(counts_norm + 0.25), 
        col = c(rep("purple", 3), rep("red", 3), rep("green", 3), rep("blue", 3)),
        xlab = 'Samples', ylab = 'Normalized Counts', 
        main = 'TMM Normalized data', notch = TRUE)

We can also see what SpC CV distributions are like¶

The normalization factors in the DGEList object are all pretty close to 1.0 so we may not see much of an effect on the CV distribution before and after normalization. Since there are only a few spike-in proteins, the CVs were computed across all 12 samples (it is mostly the same yeast background).

# input: data frame, output: vector of CVs (%)
CV <- function(df) {
  ave <- rowMeans(df)
  sd <- apply(df, 1, sd)
  cv <- 100 * sd / ave
}

# put CVs in a data frame to simplify plots and summaries
cv_frame <- data.frame(Raw = CV(counts), Norm = CV(counts_norm))
medians <- apply(cv_frame, 2, FUN = median)
round(medians, 2)
boxplot(cv_frame, notch = TRUE, main = "SpC CV distributions", ylim = c(0, 100), ylab = "CV (%)")

Let's run through the 6 comparisons¶

We will be using the exact test for each pair; however, the variance is estimated experiment-wide. We have the data loaded into "yc", have done TMM normalization, and estimated dispersions. We can specify which pairs we want to compare and compute a modified Fisher's exact test. We will make a table of the spike-in levels for each of the 6 proteins, so we know what to expect. We will print out the test results for the most significant candidates and visualize them with an MA plot. limma has a function to make MA plots easily.