Most ways that isobaric labeling data are analyzed hinder statistical testing and data visualization. This notebook will analyze data from (Ref-1) to demonstrate a two-condition (3 controls and 4 treatments) comparison done with 10-plex TMT reagents using the SPS MS3 method (Ref-2) on a Thermo Fusion instrument. The notebook will show how to:
The KUR1502 data were mouse bone marrow cell cultures where the "media" samples are controls (one of the 4 did not label) and there are leukemia exosome-dosed cells (4 "exosome" samples). 10-plex TMT was used on a Thermo Fusion using the SPS MS3 method. The PAW pipeline (Ref-3) is our in-house processing using the Comet search engine (Ref-4). The pipeline converts the RAW files into text files using MSConvert from the Proteowizard toolkit (Ref-5).
Data from the MS2 scans are extracted for the Comet searches and the reporter peak heights are extracted from the MS3 scans. The pipeline uses the target/decoy method to make score histograms and determine score filtering thresholds. Accurate mass is used to create conditional score histograms where target/decoy delta mass histograms are used to set the accurate mass windows. Basic parsimony principles are used for protein inference and 2 peptides per protein were required. An additional protein grouping step was used to combine nearly identical peptide sets (often these are housekeeping genes).
1. Huan, J., Hornick, N.I., Goloviznina, N.A., Kamimae-Lanning, A.N., David, L.L., Wilmarth, P.A., Mori, T., Chevillet, J.R., Narla, A., Roberts Jr, C.T. and Loriaux, M.M., 2015. Coordinate regulation of residual bone marrow function by paracrine trafficking of AML exosomes. Leukemia, 29(12), p.2285.
2. McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.
3. Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.
4. Eng, J.K., Jahan, T.A. and Hoopmann, M.R., 2013. Comet: an open‐source MS/MS sequence database search tool. Proteomics, 13(1), pp.22-24.
5. Chambers, M.C., Maclean, B., Burke, R., Amodei, D., Ruderman, D.L., Neumann, S., Gatto, L., Fischer, B., Pratt, B., Egertson, J. and Hoff, K., 2012. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology, 30(10), p.918
There are some hidden constraints in these experimental designs. The total amount of protein is usually fixed at the same value across all samples (an individual biological replicate). High abundance proteins are constrained by the total protein amount and do not have the same freedom for abundance differences that proteins that are smaller fractions of the total have. Higher abundance proteins must, by experimental design, have moderated differences in means and reduced variances. Lower abundance proteins have more freedom and will not be as constrained.
The constrained nature of these experimental designs (both genomics and proteomics) has implications in normalization algorithms. An increased expression in abundant proteins will "push" all other proteins down (their abundance total has to remain the same). Grand total normalization like we use above can be too simplistic. The differential expression testing package edgeR (Ref-6) includes a trimmed mean of M values (TMM) normalization step (Ref-7) designed for these types of samples.
6. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
7. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
The references below outline the reasons for using both edgeR and limma when analyzing Next Gen sequencing data. These publications are from the leaders in statistical analysis of microarray and next gen sequencing data. Their arguments are more informed and elegant than I can make. My summary is that the TMM normalization algorithm in edgeR is a good one to use. limma can integrate a fold-change cutoff into the modeling p-values (when there are too many DE candidates) and do gene set enrichment testing - features that depend on the normal distribution.
Both edgeR and limma do variance estimated from multiple nearby genes/proteins. The important question is what is meant by nearby? Generally speaking, nearby means similar in relative abundance. Variance in count data depends on the magnitude of the value. Many other wide dynamic range measurements like gene expression or protein expression have more uncertainty with signals near the noise levels (limit of detection). There is also an inherent constraint on the total amount of stuff being measured. That squeezes the more abundant genes/proteins and reduces apparent variance. Lower relative abundance molecules have more freedom for expression changes.
The long story is that there is usually an abundance dependence to the variance. We need to know the abundance scale to get a reliable trended variance. Counts (or intensities) or log2 counts (intensities) has the needed abundance scale. If we were to take expression ratios, we would lose the abundance information and not be able to do trended variance estimates. This is a really strong and compelling reason to never take ratios in expression studies until after statistical testing. This is not a personal opinion; it is a mathematical reality.
In the Law et al. 2016 paper, the prescription for combining edgeR with limma is spelled out:
After we do the usual edgeR data workup, we will use limma following the details in Law et al. 2016 and compare the results. This is probably the best example of how to use limma for TMT data.
Law, C.W., Chen, Y., Shi, W. and Smyth, G.K., 2014. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology, 15(2), p.R29.
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W. and Smyth, G.K., 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research, 43(7), pp.e47-e47.
Law, C.W., Alhamdoosh, M., Su, S., Smyth, G.K. and Ritchie, M.E., 2016. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research, 5.
# load libraries
library("tidyverse")
library("psych")
library("gridExtra")
library("scales")
library("limma")
library("edgeR")
# read the grouped protein summary file
paw_raw <- read_tsv("grouped_protein_summary_TMT_8_sorted.txt", skip = 4,
n_max = 5427, guess_max = 5427)
# extract protein accession column and TMT data
# need to exclude any rows with an entry in "Filter" column
# there are also 4 unused channels (of 11-plex slots)
paw_tmt <- paw_raw %>%
filter(., is.na(Filter)) %>%
select(Accession, starts_with("TotInt_")) %>%
select(-contains("127N"), -contains("128N"), -contains("130N"), -contains("131C"))
# separate accessions from the data
accession <- paw_tmt$Accession
paw_tmt <- paw_tmt %>% select(-Accession)
# rename the columns and gather the conditions
colnames(paw_tmt) <- c("Media_2.1", "Exo_2.1", "Exo_2.2", "Media_2.2",
"Exo_3.1", "Exo_3.2", "Media_3.2")
paw_tmt <- paw_tmt %>% select(contains("Media"), contains("Exo"))
head(paw_tmt)
nrow(paw_tmt)
# define the experimental groups
group <- c(rep("media", 3), rep("exosome", 4))
# define a scale parameter for playing purposes
scale <- 1.0
# load the data in DGEList object
y <- DGEList(counts = paw_tmt/scale, group = group, genes = accession)
y$samples
EdgeR has several features (TMM normalization, shared variance modeling) that are nice. The downside is the count-based model. The bad aspects of counts (discrete values, large fluctutations for small values) go away as the magnitude of the values increase. EdgeR models variance with a component for the Poisson nature of counts and an over-dispersion term that picks up the biological variability. As the magnitude of the values increase, the variance asymptotically approaches the biological variability.
How big is big enough? In the above code cell, there is a scale
variable. That can be used to reduce the intensity values so that we can see when we bump into the Poisson nature of count values. A cool thing about notebooks is that you can make a change and re-run the notebook and look at all of the results. I stepped through scale
values of 0.1, 1, 10, 100, 1000, and 10000. The table shows the numbers of up, down, or not changed proteins.
Scale | Up | Not DE | Down |
---|---|---|---|
0.1 | 1236 | 2488 | 1252 |
1 | 1236 | 2487 | 1253 |
10 | 1250 | 2461 | 1265 |
100 | 1284 | 2413 | 1276 |
1000 | 939 | 3005 | 1032 |
10000 | 263 | 4390 | 323 |
It is not until we have reduced the intensities by a factor of 1000 that we start to see changes. The exact test in edgeR does have dependence on the magnitude of the numbers, like a Chi square test. Charnging the scale of the intensity data does change the edgeR p-values, with p-values getting less small as the intensities are reduced in scale. This explains why the edgeR p-values have a wider dynamic range than from limma; interestingly, the numbers of DE candidates at the set FDR cutoff values (0.01, 0.05, 0.10) are pretty independent of the intensity scale (see table below).
Scale | high | medium | low | non |
---|---|---|---|---|
0.1 | 1576 | 536 | 376 | 2488 |
1 | 1580 | 534 | 375 | 2487 |
10 | 1590 | 554 | 371 | 2461 |
100 | 1638 | 549 | 376 | 2413 |
1000 | 1265 | 424 | 282 | 3005 |
10000 | 388 | 137 | 61 | 4390 |
Practically speaking, intensities (reporter ion peak heights) are sufficiently large that using edgeR is not a problem. NOTE: this is not true if transformed values like the Proteome Discoverer 2.x default of signal-to-noise ratios are used. I do not recommend ever using this quantity. Signals should have noise subtracted (that is already done on Orbitraps).
EdgeR normalization is actually done in two steps. The first, called a library size adjustment, is like a sample loading normalization. This gets rid of the big differences between samples so that the TMM algorithm has better starting data. We need to compute the normalized intensities from the TMM factors.
# run the TMM normalization
y <- calcNormFactors(y)
# set colors for plotting
color <- c(rep("red", 3), rep("blue", 4))
# function to compute the normalized intensities
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute grand total (library size) scalings
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / y$samples$norm.factors
cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n",
colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
}
tmt_tmm
}
# get the normalized data values
paw_tmt_tmm <- apply_tmm_factors(y, color)
EdgeR can transform next gen sequencing data onto a standardized scale (counts per million reads) to correct for library size differences and allow easier comparisons between samples. How do the counts-per-million values compare to the TMM normalized intensities?
# get the cpm matrix and the log2 cpm matrix
cpm_tmt <- cpm(y, log = FALSE)
lcpm <- cpm(y, log = TRUE)
# check lcpm
class(lcpm)
head(lcpm)
# see how media samples correlate
x <- data.frame(log2(rowMeans(paw_tmt_tmm[, 1:3])), log2(rowMeans(cpm_tmt[, 1:3])))
colnames(x) <- c('TMM', 'CPM')
plot(x, main = "Scatter plot of TMM normalized intensities and CPM values")
# we need to get the trended dispersion estimates
y <- estimateDisp(y)
plotBCV(y, main = "Dispersion trends")
There are abundance-dependent dispersion effects in many measurement methods. Higher abundance signals have better signal-to-noise (or are the result of more averaging), so they can appear more precise than lower abundance signals. We see that the trended variance blue line increases as the protein intensity decreases. The statistical testing will use trended variances (estimated from multiple proteins) as a function of protein intensity, which spans several orders of magnitude. EdgeR puts the data on a common "counts per million" scale. The library size and TMM scaling factors essentially normalize the total signal per channel to 1.0. Multiplying by one million is done to convert to the CPM scale.
There are two important observations about the data and the blue line. The blue line is greater than the per protein variances for most proteins (the region with log CPM between 4 and 8 is a good example). This prevents proteins with small mean difference that might also have atypically low variance from throwing really small p-values. This will make edgeR testing a bit on the conservative side.
There is also a flip side. Some of the proteins have variances that are greater than the blue line. If they have larger differences in means, the lower trended variance rather than their higher direct variances will result in small p-values. Real sample contaminant proteins (such as from surrounding tissue in dissections) can vary sample-to-sample, have high variance, and potentially be biased among conditions. These proteins might not have small p-values in traditional tests despite large differences in means if the variance was also very large.
We will use the exact test in edgeR for this simple two-state comparison. You may wonder why we are using statistical testing built on a negative binomial distribution for counting experiments. The shortest answer is because it seems to work okay. The variance is modeled by two terms: one for Poisson numbers and an over-dispersion term. As counts get larger (several hundred or so), the Poisson term becomes small and the variance is handled by the over-dispersion term. The values of reporter ion peak heights, particularly when summed into protein totals, are large numbers. After we get the results from edgeR, we will compare them to limma with the voom variance modeling, which uses a normal data distribution.
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# compute the exact test models, p-values, FC, etc.
et <- exactTest(y, pair = c("media", "exosome"))
# define variables for the columns in each condition
M <- 1:3
E <- 4:7
# make the results table
tt <- topTags(et, n = Inf, sort.by = "none")$table
med_exo <- collect_results(paw_tmt_tmm, tt, M, "med", E, "exo")
The exactTest
function does the modeling for the pairwise comparison. More than one pair can be present in DGEList objects to improve the dispersion estimates. The topTags
function does the multiple testing corrections and returns more concise summaries of the testing results. The collect_results
function combines some of the normalized data and test results into a concise med_exo
data frame.
It is important to see if the modeling looks reasonable. Our general assumptions are that we have a large fraction of the proteins that are not differentially expressed. Those will have a uniform (flat) p-value distribution from 0.0 to 1.0. We also expect (hopefully) some true differential expression candidates. Those should have very small p-values and have a sharper distribution at low p-values.
pvalue_plot <- function(results, title) {
# Makes p-value distribution plots
# results - results data frame
# title - plot title
ggplot(results, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(results$PValue, breaks = 100,
plot = FALSE)$counts[26:100])) +
ggtitle(str_c(title, " p-value distribution"))
}
# check the p-value distrubution
pvalue_plot(med_exo, "Media vs Exosome-dosed")
We have the two distributions of p-values, so the testing seems reasonable. We can also see how the up-regulated protein number compares to the down-regulated number. We can use the topTags
function to see which proteins have the smallest p-values.
# see how many up and down candidates (10% FDR)
summary(decideTests(et, p.value = 0.10))
# see which proteins have the smallest p-values
topTags(et)$table
In many discovery experiments, a Benjamini-Hochberg adjusted p-value (FDR) cutoff of 0.05 (5%) might be pretty strict. We use a 10% cutoff to distinguish DE from non-DE candidates. We define three cuts on the FDR: 10% to 5% are "low" significance, 5% to 1% are medium significance, and less than 1% are more "highly" significant. Cut values can be adjusted depending on the experimental situation.
We can look at expression ratio distributions as a function of candidate category. If variance is not too variable protein-to-protein, then we would expect larger mean differences to be associated with lower FDR values. Faceted plotting in ggplot2 is a nice method for showing such things.
# see how many candidates are in each category
med_exo %>% count(candidate)
log2FC_plots <- function(results, range, title) {
# Makes faceted log2FC plots by candidate
# results - results data frame
# range - plus/minus log2 x-axis limits
# title - plot title
ggplot(results, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
ggtitle(title) +
coord_cartesian(xlim = c(-range, range))
}
# can look at log2FC distributions as a check
log2FC_plots(med_exo, 3, "(edgeR) LogFC by candidate for Media vs Exosome-dosed")
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the frame
}
MA_plots <- function(results, x, y, title, make_facet = TRUE) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# make_facet - flag to plot facet views
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
if (make_facet == TRUE) {
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
}
# make the plots visible
print(ma)
if (make_facet == TRUE) {
print(ma_facet)
}
}
# MA plots of DE candidates
MA_plots(med_exo, "ave_med", "ave_exo", "(edgeR) Media versus exosome-dosed")
The solid diagonal line is 1:1, the dotted lines are 2-fold changes. The axes are in log scale.
NOTE: We will skip these to shorten the notebook.
scatter_plots <- function(results, x, y, title, make_facet = TRUE) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# make_facet - flag to plot facet views
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
if (make_facet == TRUE) {
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
}
# make the plots visible
print(scatter)
if (make_facet == TRUE) {
print(scatter_facet)
}
}
#scatter_plots(med_exo, "ave_med", "ave_exo", "(edgeR) Media versus exosome-dosed")
Volcano plots are another common way to visualize DE candidates.
volcano_plot <- function(results, x, y, title, ymax) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# ymax - upper limit for y-axis
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
coord_cartesian(xlim = c(-5, 5), ylim = c(0, ymax)) +
ggtitle(str_c(title, " Volcano Plot"))
}
# finally, a volcano plot
volcano_plot(med_exo, "ave_med", "ave_exo", "(edgeR) Media versus exosome-dosed", 50)
We can see how the intensities of the individual samples compare for the top 10 up- and down-regulated edgeR candidate proteins.
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, "\\|", simplify = TRUE)
identifier[,3]
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags, prefix) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top up-regulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down-regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(prefix, get_identifier(row$Acc),
", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title)
}
}
# plot the top 10 up and 10 down proteins
set_plot_dimensions(7, 4)
plot_top_tags(med_exo, 3, 4, 10, "(edgeR) ")
set_plot_dimensions(7, 7)
EdgeR normalizations and testing seemed reasonable for this data.
limma is basically a moderated t-test modeling. We have already loaded the intensities into a DGEList edgeR object and done the TMM normalization. We also converted to counts-per-million scale. Following the Law et al. 2016 prescription, we will run the voom variance modeling to get the model weights for limma. We need to have a desgn matrix to run voom.
voom
¶# set up the design matrix and the contrast
group <- as.factor(c(rep("media", 3), rep("exosome", 4)))
group <- factor(group, levels(group)[c(2, 1)]) # set the factor order
design <- model.matrix(~ 0 + group)
colnames(design) <- c("media", "exosome")
# see what the desgn matrix looks like
design
# define the comparison contrast
contrast <- makeContrasts(exosome-media, levels = design)
contrast
# run voom to get model weights for each intensity value
v <- voom(y, design, plot = TRUE)
# do the linear model fitting
vfit <- lmFit(v, design)
# get the fit for the contrast of interest
vfit <- contrasts.fit(vfit, contrast)
# do the empirical Bayes moderation of the test f statistic
efit <- eBayes(vfit)
# grab the information in topTable so we can get the data to plot candidates
# the coef parameter has to do with the contrast of interest
# specify no sorting of results and a number that is longer than the data table
tt_limma <- topTable(efit, coef = 1, sort.by = "none", number = Inf)
# let's see how many up and down candidates, and the top tags
summary(decideTests(efit, p.value = 0.10))
topTable(efit)
limma_PAW
¶# get the CPM intensities
limma_PAW <- as.data.frame(cpm_tmt)
# add average ratio columns (non-logged ratios), fold-change column, and row names
limma_PAW$ave_med <- rowMeans(paw_tmt_tmm[M])
limma_PAW$ave_exo <- rowMeans(paw_tmt_tmm[E])
limma_PAW$logFC <- log2(limma_PAW$ave_exo / limma_PAW$ave_med)
limma_PAW$FC <- ifelse(limma_PAW$ave_exo > limma_PAW$ave_med,
(limma_PAW$ave_exo / limma_PAW$ave_med),
(-1 * limma_PAW$ave_med / limma_PAW$ave_exo))
limma_PAW$Acc <- accession
# statisticl test results
limma_PAW$PValue <- tt_limma$P.Value
limma_PAW$FDR <- tt_limma$adj.P.Val
# add a DE candidate status column
limma_PAW$candidate <- cut(limma_PAW$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# count candidates
print("Candidate Counts:")
summary(limma_PAW$candidate)
# what does the test p-value distribution look like?
pvalue_plot(limma_PAW, "PAW data with limma")
The total up and down candidate numbers are around 1450, a little larger than we had with edgeR. The number of candidates in the different FDR classes is similar; we have the most candidates in the "high" category. The p-value distribution is qualitatively quite similar. These bird's eye views seem reasonable.
We can look at the more detailed plots (MA plots, scatter plots, and volcano plot) to see what a deeper look into the DE candidates can tell us. We will repeat the edgeR DE plots from above for comparison.
# can look at log2FC distributions as a check
log2FC_plots(med_exo, 3, "(edgeR) LogFC by candidate")
log2FC_plots(limma_PAW, 3, "(limma) LogFC by candidate")
MA_plots(med_exo, "ave_med", "ave_exo", "PAW edgeR results")
MA_plots(limma_PAW, "ave_med", "ave_exo", "PAW limma results")
limma results are more similar in appearance to edgeR than we had with the basic t-test. Proteins with very small fold-changes are excluded in both packages. We still have some large fold-change proteins that are not significant in limmma that ended up significant (at a 10% FDR cutoff) in edgeR.
In the comparison to limma used without the voom step (but using a trended variance), we have very similar results to what we get when using voom.
#scatter_plots(med_exo, "ave_med", "ave_exo", "PAW edgeR results")
#scatter_plots(ttest_PAW, "ave_med", "ave_exo", "PAW t-test results")
# compare volcano plots
volcano_plot(med_exo, "ave_med", "ave_exo", "PAW edgeR results", 40)
volcano_plot(med_exo, "ave_med", "ave_exo", "PAW edgeR results (expanded y-axis)", 6)
volcano_plot(limma_PAW, "ave_med", "ave_exo", "PAW limma results", 6)
The range of FDR values returned by the two tests are not the same (easy to see in the volcano plots where the edgeR y-axis goes to 50 (-log10 FDR) in contrast to limma's maximum of 5). The FDR values from limma are more compressed. The center of the "V" is also a little more filled in for limma compared to edgeR. However, the "V" is much better defined for limma than we saw for the t-test.
set_plot_dimensions(7, 4)
plot_top_tags(limma_PAW, 3, 4, 10, "(limma) ")
set_plot_dimensions(7, 7)
Comparing two sets of results in omics datasets is far more difficult than most people can imagine. It is a very fuzzy endeavor and most ways to compare things do not like fuzzy. We can step through the FDR thresholds and do some set comparisons (intersections, differences, etc.).
for (cutoff in c(0.10, 0.05, 0.01)) {
edgeR <- get_identifier(filter(med_exo, FDR < cutoff)$Acc)
limma <- get_identifier(filter(limma_PAW, FDR < cutoff)$Acc)
cat("cutoff:", cutoff, "\n")
cat("number of candidates (edgeR, limma):", length(edgeR), length(limma), "\n")
x_union <- union(edgeR, limma)
x_intersect <- intersect(edgeR, limma)
unique_edgeR <- setdiff(edgeR, limma)
unique_limma <- setdiff(limma, edgeR)
cat("union and intersection:", length(x_union), length(x_intersect), "\n")
cat("intersection out of union:", round(length(x_intersect)/length(x_union), 2), "\n")
cat("intersection out of edgeR:", round(length(x_intersect)/length(edgeR), 2), "\n")
cat("intersection out of limma:", round(length(x_intersect)/length(limma), 2), "\n")
cat("unique to each (edgeR, limma):", length(unique_edgeR),
length(unique_limma), "\n\n")
}
edgeR_candidates <- get_identifier(filter(med_exo, FDR < 0.10)$Acc)
limma_candidates <- get_identifier(filter(limma_PAW, FDR < 0.10)$Acc)
unique_edgeR_candidates <- setdiff(edgeR_candidates, limma_candidates)
unique_limma_candidates <- setdiff(limma_candidates, edgeR_candidates)
cat("Candidates in limma but not candidate (<0.1) in edgeR")
interesting_limma <- setdiff(unique_limma_candidates, edgeR_candidates)
length(interesting_limma)
cat("Candidates in edgeR but not candidate (<0.1) in limma")
interesting_edgeR <- setdiff(unique_edgeR_candidates, limma_candidates)
length(interesting_edgeR)
We had 460 more DE candidates with limma/voom compared to edgeR. There were only 49 proteins (out of almost 5000) that were significant in edgeR but not also significant in limma/voom. Given the magnitude of the intenstiy values, modeling expression with a negative binomial distribution or a normal distributions seems pretty similar in terms of test outcomes. Considering the compicated nature of the statistical testing, the agreement is pretty good.
# explore high in limma but not high in edgeR (104)
interesting <- setdiff(unique_limma, edgeR_candidates) # high in limma but not a candidate in edgeR
not_interesting <- setdiff(unique_limma, interesting) # high in limma and also a candidate in edgeR
cat(length(interesting), length(not_interesting))
plot_selected <- function(results, nleft, nright, selected, prefix) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# selected is list of identifiers to plot
proteins <- results
proteins$ident <- get_identifier(results$Acc)
proteins <- filter(proteins, ident %in% selected)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(prefix, row$ident, ", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title)
}
}
#set_plot_dimensions(7, 4)
#plot_selected(limma_PAW, 3, 4, sample(not_interesting, 10), "(limma) ")
#set_plot_dimensions(7, 7)
# plot the t-test candidates that were not significant in edgeR
#set_plot_dimensions(7, 4)
#plot_selected(limma_PAW, 3, 4, interesting, "(limma) ")
#set_plot_dimensions(7, 7)
ggplot (and visualization in general) has several ways to explore dimensionality. Different point shapes and colors can be used in the same plots. Plots can also be separated by categories and aligned to facilitate comparisons. We can do both techniques. We can color code by the candidate status in one test and facet plot by the candidate status in the other test. We might gain some useful insight, or be hopelessly confused.
set_plot_dimensions(7, 7)
volcano_plot_facet <- function(results, x, y, other, title, ymax) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# ymax - upper limit for y-axis
# title - plot title string
# other - candidate flags from the other test
# uses transformed data
temp <- transform(results, x, y)
temp$other <- other
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate), alpha = 0.5) +
xlab("log2 FC") +
ylab("-log10 FDR") +
coord_cartesian(xlim = c(-5, 5), ylim = c(0, ymax)) +
facet_wrap(~ other) +
ggtitle(str_c(title, " Volcano Plot"))
}
# cross-candidate faceted volcano plots
volcano_plot_facet(med_exo, "ave_med", "ave_exo", str_c("limma_", limma_PAW$candidate),
"edgeR faceted by limma candidate", 10)
Most non-candidate proteins are the same from both tests. There are a lot of purple points in the lower right panel. EdgeR has few candidates that were not candidates from limma, and they tend to have larger fold-changes. That trend is also present in the upper right panel for the low candidates in limma. The agreement between more significant candidates from limma and the more significant candidates from edgeR look pretty good.
# with the tests switched
volcano_plot_facet(limma_PAW, "ave_med", "ave_exo", str_c("edgeR_", med_exo$candidate),
"limma faceted by edgeR candidate", 5)
We have effects similar to the t-test results where some smaller fold-change proteins achieve significance in limma that did not in edgeR. Those are the low and medium color points (from limma) in the lower right panel. There were 460 more DE candidates from limma compared to edgeR. The points in the edgeR_low and edgeR_med plots are mostly coming from some DE category shuffling. There are not many purple points in either panel.
The comments after the plot of the biological coefficient of variance are seen in the comparisons. The trended variance blue line lies above the individual protein variances in most cases. The means that there will have to be a bit more of a difference in means before edgeR will return small p-values. That effectively eliminates small fold change proteins without having to invoke an ad hoc cutoff. Conversely, large fold changes can overcome high variance when using moderated testing. Using counts-per-million and voom here was very similar to using log2 of the TMM normalized intensities and trended variance in the KUR1502_PAW_limma
notebook.
There are so many factors that can affect testing p-values: the data itself, normalizations and transformations, the test and test parameter choices, the underlying distribution assumption (if parametric), how trended variances are computed in moderated tests, choice of multiple testing correction method, and probably many more factors. We need better ways to explore, evaluate, and compare the choices.
The negative binomial modeling in edgeR is mostly benign when we work with reporter ion peak heights. The numbers are large. EdgeR would likely be less appropriate if the scale of the numbers were smaller. The non-sensical signal-to-noise ratios that are used by default in Proteome Discoverer 2.x are a case in point.
There are some features in limma that are not in edgeR. One was TREAT
, a way to incorporate a fold-change threshold cutoff into the statistical model (reference below). This is now also available in edgeR if you use the general linear model framework instead of the simpler exact test. This feature can be very useful when the numbers of DE candidates becomes large. It is a way to home in on the key driver genes. Follow up experiments are expensive and have to be prioritized.
Interpretation of DE candidates to form actionable hypotheses is a tough challenge. This is a real cart and horse situation. It is very hard to interpret results in a vacuum. Intimate and deep biological knowledge of the system under study is needed. The people with those brains are typically not well equipped to wrangle large data sets. The folks that love to wrangle large data sets and model things you never thought needed modeling can be rendered a lot less effective without deep domain knowledge. Enrichment analyses are one class of techniques that try to get knowledge from the data without actually knowing anything about the data (sounds kind of weird, right?). Nevertheless, new tools test sets of genes for enrichment rather than individual genes. This makes much more sense scientifically because most biological processes involve multiple genes/proteins. limma has options for gene set enrichment testing. I have not had time to learn about them yet. When I do, I will add a new repository and share what I learn.
McCarthy, D.J. and Smyth, G.K., 2009. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6), pp.765-771.
One step towards more meaningful results summaries are getting the main proteomics results and the statistical results combined into a single table. If we are careful with how we bring the data into R and how we take it back out, we can keep tables in order so merging results are easier. We kept the accessions column as we worked with the data. We can be really robust and use merge functions to weave together starting tables with final results tables.
We should always end notebooks with information about what packages and versions were used in the analysis.
# save the testing results
#write.table(med_exo, file = "KUR1502_results_limma-voom.txt", sep = "\t",
# row.names = FALSE, na = " ")
# log the session
sessionInfo()