The data from the publication below was downloaded from PRIDE. One set of samples was an E. coli background with a few (13) spiked in human proteins. The amounts of the spike ins varied by channel in the 10-plex TMT labeling. The same sample was run with a more traditional MS2 reporter ion method, and the newer synchronous precursor scan MS3 method. The dataset does not lend itself to an analysis of accuracy and dynamic range, but we can see if the duty cycle or signal intensities are very different between the two ways of acquiring reporter ions signals. The E. coli background is also the same in all channels and can be used to check technical replicate reproducibility.
D’Angelo, G., Chaerkady, R., Yu, W., Hizal, D.B., Hess, S., Zhao, W., Lekstrom, K., Guo, X., White, W.I., Roskos, L. and Bowen, M.A., 2017. Statistical models for the analysis of isobaric tags multiplexed quantitative proteomics. Journal of proteome research, 16(9), pp.3124-3136.
The data were generated on a Thermo Fusion using either a basic HCD MS2 method or the newer SPS MS3 method. Resolution was 60K for the scans containing the reporter ion signals in either method.
Peptides and proteins were identified using Comet and the PAW pipeline. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine was specified as a variable modification, trypsin enzyme specificity was used, and a UniProt reference E. coli protein database was used (with addition of the 13 human protein sequences). Fragment ion tolerance was set to 0.02 Da with offset of 0.0 for MS2 mode data, and 1.0005 Da with offset of 0.40 for the MS3 mode data. Confident peptide identifications were obtained using accurate mass conditional score histograms, the target/decoy method, and the protein inference used basic parsimony principles.
The PAW pipeline was used to infer proteins, perform homologous protein grouping, establish peptide uniqueness to the inferred proteins, and sum unique PSM reporter ions into protein intensity totals.
Results tables were filtered to remove common contaminants and the protein intensity tables were saved as tab-delimited text files for reading by the scripts below. Normalizations and statistical testing were performed using the Bioconductor package edgeR as detailed in the steps below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.
Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.
Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.
Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
A previous analysis was done with MaxQuant (v 1.5.7.4) and looked at how the data at different levels (PSM, peptide and protein) behaved. We also had PSM identification numbers (1% FDR) that we can compare to the new PAW analysis.
Mode | MS2 Scans | MQ PSMs | PAW PSMs | Gain |
---|---|---|---|---|
MS2 | 192649 | 46591 | 59348 | 27% |
MS3 | 144634 | 33475 | 44440 | 33% |
The numbers of MS2 scans acquired and identified for the SPS MS3 method was lower than for the MS2 method. There is a duty factor hit for the more complicated method. Comet/PAW again outperforms MaxQuant/Andromeda by a large margin.
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
# load the data and check the table
data_import <- read_tsv("JPR-2017_grouped-proteins_TMT.txt", skip = 4, guess_max = 2500)
# the "Filter" column flags contams and decoys
# the prepped table from pandas is sorted so these are the upper rows
data_all <- filter(data_import, is.na(Filter))
data_ms2 <- data_all %>%
select(-Accession) %>%
select(ends_with("_MS2")) %>%
select(starts_with("TotInt_")) %>%
select(-contains("_131C_"))
data_ms3 <- data_all %>%
select(-Accession) %>%
select(ends_with("_MS3")) %>%
select(starts_with("TotInt_")) %>%
select(-contains("_131C_"))
# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession
# see how many rows in the table
nrow(data_ms2)
head(data_ms2)
We need to load the data into an edgeR data structure, run TMM normalization, and then generate the cluster plot. A multi-dimensional scaling plot (similar to hierarchical clustering or PCA plots) is a good way to check the data.
# get the biological sample data into a DGEList object
group = c(rep('MS2', 10), rep('MS3', 10))
y <- DGEList(counts = cbind(data_ms2, data_ms3), group = group, genes = accessions)
# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)
# set some colors by condition
colors = c(rep('red', 10), rep('blue', 10))
# check the clustering
plotMDS(y, col = colors, main = "All MS2 and MS3 samples (E. coli only)")
We have pretty clear separation by acquisition mode (reporter ions in MS2 scans or reporter ions in MS3 scans). There may be more variability in the MS3 data (more vertical spread) compared to the MS2 data. We could have some dynamic range compression in the MS2 data, or we could have some increased uncertainties in the MS3 data due to lower signals after the extra round of isolation.
# see what the normalization factors look like
y$samples
The MS2 data have factors around 1.2 and the MS3 data have factors of 0.85.
EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We have to compute the normalized intensities by hand below.
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute grand total (library size) scalings
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / y$samples$norm.factors
cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n",
colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
}
tmt_tmm
}
# compute the normalized data as a new data frame
data_tmm <- apply_tmm_factors(y, colors)
Box plots for well normalized data should be similar in size and the medians should align with each other. There can be problems with the data that may not be apparent with boxplots (some data distortions can average out and not appear different with boxplots). Regardless, we should have boxplots with good median alignment. Good boxplot behavior is a necessary but not sufficient requirement of proper normalization.
We will use ggplot2 for the density plots. We need tidy (long form) data for ggplot:
long_results <- gather(data_tmm, key = "sample", value = "intensity") %>%
mutate(log_int = log10(intensity)) %>%
extract(sample, into = 'group', ".*_(.*?)_", remove = FALSE)
head(long_results)
# plot density distributions
ggplot(long_results, aes(x = log_int, color = sample)) +
geom_density() +
guides(color = FALSE) +
ggtitle("edgeR TMM normalized data")
# separate plots to see which is which
ggplot(long_results, aes(x = log_int, color = sample)) +
geom_density() +
facet_wrap(~ group) +
guides(color = FALSE) +
ggtitle("edgeR TMM normalized data")
Within acquisition mode, the distributions are very well aligned. The shapes of the distribution are a little different between modes.
The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This seems to be an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the two acquisition modes.
# save column indexes for different conditions (indexes to data_raw frame)
# these make things easier (and reduce the chance for errors)
ms2 <- 1:10
ms3 <- 11:20
# create a CV computing function
CV <- function(df) {
ave <- rowMeans(df)
sd <- apply(df, 1, sd)
cv <- 100 * sd / ave
}
# put CVs in data frames to simplify plots and summaries
cv_frame <- data.frame(ms2_raw = CV(data_ms2), ms2_tmm = CV(data_tmm[ms2]),
ms3_raw = CV(data_ms3), ms3_tmm = CV(data_tmm[ms3]))
# see what the median CV values are
medians <- apply(cv_frame, 2, FUN = median)
print("Median CVs by mode, before/after (%)")
round(medians, 1)
# see what the CV distibutions look like
# need long form for ggplot
long_cv <- gather(cv_frame, key = "condition", value = "cv") %>%
extract(condition, into = 'group', "(.*?)_+", remove = FALSE)
# traditional boxplots
ggplot(long_cv, aes(x = condition, y = cv, fill = group)) +
geom_boxplot(notch = TRUE) +
coord_cartesian(ylim = c(0, 50)) +
ggtitle("CV distributions")
# density plots
ggplot(long_cv, aes(x = cv, color = condition)) +
geom_density() +
coord_cartesian(xlim = c(0, 50)) +
ggtitle("CV distributions")
# multi-panel scatter plot grids
idx2 <- sample(1:10, 4)
idx3 <- sample(11:20, 4)
pairs.panels(log2(data_tmm[idx2]), lm = TRUE, main = "MS2 (random 4)")
pairs.panels(log2(data_tmm[idx3]), lm = TRUE, main = "MS3 (random 4)")
# multi-panel scatter plot grids
left <- sample(1:10, 2)
right <- sample(11:20, 2)
pairs.panels(log2(data_tmm[c(left, right)]), lm = TRUE, main = "random 2x2 between modes")
We see that the data between MS2 and MS3 methods are not well correlated compared to within method data. The precision in isobaric labeling is because the reporter ion measurements come from the same instrument scan. Data aggregation of PSMs to proteins preserves the precision within a given TMT experiment. When we are comparing data from different sets of scans (MS2 versus MS3), we lose the inherent precision. Keep in mind that the scatter plots in the 4 lower left panels are still of the exact same E. coli digests. We could probably fix the differences between the MS2 and MS3 data with some sort of a batch correction method (mock IRS based on averages). The data in the above plot has already been adjusted by very good conventional normalization methods. Taking logarithms and median scaling will do nothing to improve the situation. Only a solid understanding of the nature of the data measurements can lead to correct analyses.
The scatter plots above look similar to how pooled channels behave without IRS normalization; namely, within TMT data are "tight", but data between TMT plexes are not very similar. We can use experiment-wide average intensities as mock reference channels and see if we can get the MS2 and MS3 data on a common intensity scale.
The steps:
#================= sample loading normalization ================================
SL_Norm <- function(df, color = NULL, plot = TRUE) {
# This makes each channel sum to the average grand total
# df - data frame of TMT intensities
# returns a new data frame with normalized values
# compute scaling factors to make colsums match the average sum
norm_facs <- mean(c(colSums(df))) / colSums(df)
cat("SL Factors:\n", sprintf("%-5s -> %f\n", colnames(df), norm_facs))
df_sl <- sweep(df, 2, norm_facs, FUN = "*")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(df_sl), col = color, notch = TRUE, main = "SL Normalized data")
}
df_sl
}
# make new data frame with row sums from each frame
sl_ms2 <- SL_Norm(data_ms2, plot = FALSE)
sl_ms3 <- SL_Norm(data_ms3, plot = FALSE)
irs <- tibble(rowSums(sl_ms2), rowSums(sl_ms3))
colnames(irs) <- c("sum_ms2", "sum_ms3")
# get the geometric average intensity for each protein
irs$average <- apply(irs, 1, function(x) exp(mean(log(x))))
# compute the scaling factor vectors
irs$fac_ms2 <- irs$average / irs$sum_ms2
irs$fac_ms3 <- irs$average / irs$sum_ms3
# make new data frame with IRS normalized data
data_irs <- sl_ms2 * irs$fac_ms2
data_irs <- cbind(data_irs, sl_ms3 * irs$fac_ms3)
# see what the IRS data look like
boxplot(log2(data_irs), col = rep(c("red", "blue"), each = 10),
main = "Internal Reference Scaling (IRS) normalized data: \nMS2 (red), MS3 (blue)",
xlab = 'TMT Sample', ylab = 'log2 of Intensity', notch = TRUE)
# can also look at density plots (like a distribution histogram)
plotDensities(log2(data_irs), col = rep(c("red", "blue"), 10), main = "IRS data")
The data above look pretty darn good. We can still run the TMM normalization. If the data do not need any further adjustments, the TMM factors will be very minor (very close to 1.0).
We can also see how the samples cluster after we have done the mock-IRS and TMM steps.
# get the biological sample data into a DGEList object
group = c(rep('MS2', 10), rep('MS3', 10))
y_irs <- DGEList(counts = data_irs, group = group, genes = accessions)
# run TMM normalization (also includes a library size factor)
y_irs <- calcNormFactors(y_irs)
data_irs_tmm <- apply_tmm_factors(y_irs, colors)
# set some colors by condition
colors = c(rep('red', 10), rep('blue', 10))
# check the clustering
plotMDS(y_irs, col = colors, main = "All MS2 and MS3 samples after IRS")
CV distributions and median CVs are also useful to check. We will compute those for all 20 channels at the various stages of normalizations.
# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
# Makes a box plot with the median value labeled
# df - data frame with data to compute CVs of
# ylim - upper limit for y-axis
# title - plot title
cv = CV(df)
boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
text(x = 0.65, y = boxplot.stats(cv)$stats[3],
labels = round(boxplot.stats(cv)$stats[3], 2))
}
par(mfrow = c(2,2))
labeled_boxplot(cbind(data_ms2, data_ms3), 100, "Raw data")
labeled_boxplot(data_tmm, 100, "Just TMM data")
labeled_boxplot(data_irs, 60, "Just IRS data")
labeled_boxplot(data_irs_tmm, 60, "Both IRS/TMM data")
In the MS2 data (above), we had a median CV of 5.3%. The MS3 data was 6.5%. After the mock-IRS re-scaling, we get 6% median CV across all 20 channels. We can also check the scatter plots.
# multi-panel scatter plot grids
left <- sample(1:10, 2)
right <- sample(11:20, 2)
pairs.panels(log2(data_irs_tmm[c(left, right)]), lm = TRUE, main = "random 2x2 between modes after IRS")
The mock-IRS was a fun data exercise. We would need to look at more data to see if it really makes any sense to combine reporter ion data acquired with different methods. The data here can be deceptive because we are only looking at proteins that are not at different intensities. It is helpful to think of IRS as working on hidden sentinel channels. IRS generates multiplicative factors to make different things that are not supposed to be different be essentially the same. This only applies to the sentinel channels per se. The adjusting all of the channels in a plex by those factors only works because of the reporter ions being measured in the same scan. The sentinel channels effectively eavesdrop on all of the channels because of the multiplexed aspect of isobaric tagging.
write.table(cbind(accessions, data_tmm, irs, data_irs_tmm), "TMM_results.txt", sep = "\t",
row.names = FALSE, na = " ")
sessionInfo()