In Part 1, several normalization methods were used on a developing mouse lens TMT study that spanned three 6-plex TMT labeling experiments. It was clear that an IRS-like procedure is critical to combining data from multiple TMT experiments because the different TMT experiments act like different "batches". We also saw that increases in expression of several highly abundant lens proteins during the time course created a compositional bias in the samples that could be corrected by procedures like TMM.
In the second part, we explored how using or not using IRS affected downstream statistical testing. We used an RNA-Seq bioconductor package called edgeR to do a pairwise test where the two earliest time points were compared to the two latest time points. Crystallin expression would be greatly increased in the later times and this created a distinctive expression pattern.
Here we will do the same comparison, but using a ratio-based methodology rather than IRS normalization where the natural intensity scale of the data is maintained. We will use the experiment-wide average intensities as internal references and take the ratio of the 6 time points to the reference within each TMT experiment. If we do that in each individual TMT experiment, it should remove any TMT experiment dependency. The major difference will be that we will have ratios rather than scaled intensities as values.
We just want a basic gene/protein expression table to work with in R. We can always stitch data structures together and then write them out later. We want one row for each protein and columns for each biological sample.
# Analysis of IOVS mouse lens data (Supplemental Table S01):
# Khan, Shahid Y., et al. "Proteome Profiling of Developing Murine Lens Through Mass Spectrometry."
# Investigative Ophthalmology & Visual Science 59.1 (2018): 100-107.
# load libraries
library(tidyverse) # modern R packages for big data analysis
library(limma) # edgeR will load this if we do not
library(edgeR)
# read the Supplemental 01 file (saved as a CSV export from XLSX file)
data_start <- read_csv("iovs-58-13-55_s01.csv")
# filter out proteins not seen in all three runs
data_no_na <- na.omit(data_start)
# fix the column headers
col_headers <- colnames(data_no_na)
col_headers <- str_replace(col_headers, " {2,3}", " ")
col_headers <- str_replace(col_headers, "Reporter ion intensities ", "")
colnames(data_no_na) <- col_headers
# save the annotation columns (gene symbol and protein accession) for later and remove from data frame
annotate_df <- data_no_na[1:2]
data_raw <- as.data.frame(data_no_na[3:20])
row.names(data_raw) <- annotate_df$`Protein Accession No.` # tibbles do not like named rows
Sample loading inconsistencies seem common in TMT experiments and it is reasonable to correct for that before doing anything else. If we balance out the loading discrepancies first, we should end up with better TMT-wide averages. Those averages will be our mock reference standards. Here, we will take the ratios of the 6 time points in each TMT experiment to the "reference" for that experiment.
# separate the TMT data by experiment
exp1_raw <- data_raw[c(1:6)]
exp2_raw <- data_raw[c(7:12)]
exp3_raw <- data_raw[c(13:18)]
# figure out the global target scaling value
target <- mean(c(colSums(exp1_raw), colSums(exp2_raw), colSums(exp3_raw)))
# do the sample loading normalizations (scale to the target value)
norm_facs <- target / colSums(exp1_raw)
exp1_sl <- sweep(exp1_raw, 2, norm_facs, FUN = "*")
norm_facs <- target / colSums(exp2_raw)
exp2_sl <- sweep(exp2_raw, 2, norm_facs, FUN = "*")
norm_facs <- target / colSums(exp3_raw)
exp3_sl <- sweep(exp3_raw, 2, norm_facs, FUN = "*")
data_sl <- cbind(exp1_sl, exp2_sl, exp3_sl)
# make data frame with row means from each TMT experiment
refs <- tibble(rowMeans(exp1_sl), rowMeans(exp2_sl), rowMeans(exp3_sl))
colnames(refs) <- c("ave1", "ave2", "ave3")
# make new data frame with ratio values (time points divided by the "reference")
data_ratio <- exp1_sl / refs$ave1
data_ratio <- cbind(data_ratio, exp2_sl / refs$ave2)
data_ratio <- cbind(data_ratio, exp3_sl / refs$ave3)
row.names(data_ratio) <- annotate_df$`Protein Accession No.`
Remember, the developing lens over-expresses crystallins and kind of pushes everything else down. [I do not know if TMM normalization works with numbers that have the scale and range that ratios have.]
# let's see what box plots look like
boxplot(data_ratio, col = rep(c("red", "green", "blue"), each = 6),
main = "Ratio to 'average' standard", ylab = 'ratio')
boxplot(log2(data_ratio), col = rep(c("red", "green", "blue"), each = 6),
main = "Log2 ratio to 'average' standard", ylab = 'log2 of ratio')
plotDensities(log2(data_ratio), col = rep(c("red", "green", "blue"), 6), main = "Ratios")
# function computes CVs per time point
make_CVs <- function(df) {
# separate by time points
E15 <- df[c(1, 7, 13)]
E18 <- df[c(2, 8, 14)]
P0 <- df[c(3, 9, 15)]
P3 <- df[c(4, 10, 16)]
P6 <- df[c(5, 11, 17)]
P9 <- df[c(6, 12, 18)]
E15$ave <- rowMeans(E15)
E15$sd <- apply(E15[1:3], 1, sd)
E15$cv <- 100 * E15$sd / E15$ave
E18$ave <- rowMeans(E18)
E18$sd <- apply(E18[1:3], 1, sd)
E18$cv <- 100 * E18$sd / E18$ave
P0$ave <- rowMeans(P0)
P0$sd <- apply(P0[1:3], 1, sd)
P0$cv <- 100 * P0$sd / P0$ave
P3$ave <- rowMeans(P3)
P3$sd <- apply(P3[1:3], 1, sd)
P3$cv <- 100 * P3$sd / P3$ave
P6$ave <- rowMeans(P6)
P6$sd <- apply(P6[1:3], 1, sd)
P6$cv <- 100 * P6$sd / P6$ave
P9$ave <- rowMeans(P9)
P9$sd <- apply(P9[1:3], 1, sd)
P9$cv <- 100 * P9$sd / P9$ave
ave_df <- data.frame(E15$ave, E18$ave, P0$ave, P3$ave, P6$ave, P9$ave)
sd_df <- data.frame(E15$sd, E18$sd, P0$sd, P3$sd, P6$sd, P9$sd)
cv_df <- data.frame(E15$cv, E18$cv, P0$cv, P3$cv, P6$cv, P9$cv)
return(list(ave_df, sd_df, cv_df))
}
list_ratio <- make_CVs(data_ratio)
# compare CV distributions
boxplot(list_ratio[[3]], notch = TRUE, main = "Ratios CVs (14.3% ave)",
ylim = c(0, 150), ylab = 'CV (%)')
round(mean(apply(list_ratio[[3]], 2, median)), 2)
Ratios compared to total protein intensities look more pronounced in their dependence on developmental time points. The intensities after SL normalization (Part 1) looked much better in the density plots. From the plots above, there seems to be some differences between the three experiments even after taking the ratios.
The color coding in the first plot is three colors, one for each TMT experiment. The second plot color codes by development time.
plotMDS(log2(data_ratio), col = rep(c("red", "green", "blue"), each = 6),
main = "log2 of ratios to 'standard'")
plotMDS(log2(data_ratio), col = c("red", "orange", "black", "green", "blue", "violet"),
main = "log2 of ratios to 'standard'")
We do not seem to have any obvious "batch" effects. The CVs look pretty similar to what we got after IRS and have a low average median CV of 14.3%. We can try some statistical testing.
R has built in t-test and multiple-testing functions. We will use those first.
# collect the 6 early versus the 6 late columns
early <- data_ratio[c(1, 2, 7, 8, 13, 14)]
late <- data_ratio[c(5, 6, 11, 12, 17, 18)]
# get the data into a separate data frame so we can apply the t-test to each row
# to be safe, take the log2 of the ratios (maybe this helps for ratios less than one?)
pair <- data.frame(log2(early), log2(late))
# add average ratio columns (non-logged ratios), fold-change column, and row names
pair$ave_early <- rowMeans(early)
pair$ave_late <- rowMeans(late)
pair$logFC <- log2(pair$ave_late / pair$ave_early)
row.names(pair) <- annotate_df$`Protein Accession No.`
# apply the t-test (from:https://stackoverflow.com/questions/25496693/
# how-to-do-t-test-two-samples-to-rows-of-a-dataframe-using-apply-family-functio)
# data is set up to do the two-sample t.test (a built-in) on each row
t.result <- apply(pair, 1, function(x) t.test(x[1:6], x[7:12]))
# extract the p-value column from the t-test thingy
pair$p_value <- unlist(lapply(t.result, function(x) x$p.value))
# do a Benjamini-Hochberg multiple testing correction
pair$fdr <- p.adjust(pair$p_value, method = "BH")
head(pair)
# add a DE candidate status column
pair$candidate <- "no"
pair[which(pair$fdr <= 0.10 & pair$fdr > 0.05), dim(pair)[2]] <- "low"
pair[which(pair$fdr <= 0.05 & pair$fdr > 0.01), dim(pair)[2]] <- "med"
pair[which(pair$fdr <= 0.01), dim(pair)[2]] <- "high"
pair$candidate <- factor(pair$candidate, levels = c("high", "med", "low", "no"))
# count up, down and the rest (FDR less than 0.05)
all <- dim(pair)[1]
up <- dim(pair[(pair$fdr <= 0.05) & (pair$logFC > 0.0), ])[1]
down <- dim(pair[(pair$fdr <= 0.05) & (pair$logFC <= 0.0), ])[1]
up
all - up - down
down
# what does the test p-value distribution look like?
ggplot(pair, aes(p_value)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(pair$p_value, breaks = 100, plot = FALSE)$counts[26:100])) +
ggtitle("log2 ratios t-test: p-value distribution")
We will do the usual FDR cutoffs ("no" > 0.1 > "low" > 0.05 > "med" > 0.01 > "high") for color highlighting. The scatter plots will be of the non-logged ratio values, averaged across the 6 time points in each group. Ratios have a much more limited dynamic range compared to intensities.
# make the combined candidate corelation plot
de_ratio <- data.frame(pair$ave_early, pair$ave_late, pair$candidate)
colnames(de_ratio) <- c("early", "late", "candidate")
ggplot(de_ratio, aes(x = early, y = late)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle("t-test ratios: early vs late")
print('here')
# make separate corelation plots
ggplot(de_ratio, aes(x = early, y = late)) +
geom_point(aes(color = candidate, shape = candidate)) +
facet_wrap(~ candidate) +
ggtitle("t-test ratios: separated by candidate")
# make a volcano plot
vc <- data.frame(-1*pair$logFC, -1*log10(pair$fdr), pair$candidate)
colnames(vc) <- c("logFC", "fdr", "candidate")
ggplot(vc, aes(x = logFC, y = fdr)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("Fold-Change (Log2)") +
ylab("-Log10 FDR") +
ggtitle("t-test ratios: Volcano Plot")
limma can do more complicated study designs and we may have cases where a t-test will not be adequate. We have to deal with the complexities of using linear models to set things up for limma function calls. We will have to specify an experimental design matrix for the modeling. The data should already be normalized (I think) and transformed correctly. I think that means log transformed because most expression factors are multiplicative. We want those factors to be additive so that they can be modeled linearly.
limma will work better if we work with all of the data in the linear model fitting and compare the early versus late time points as a contrast.
# create a basic design matrix
group = as.factor(rep(c("early", "early", "middle", "middle", "late", "late"), 3))
group <- factor(group, levels(group)[c(1, 3, 2)]) # set the factor order
design <- model.matrix(~ 0 + group)
colnames(design) <- c("early", "middle", "late")
design
# the contrast is where the early versus late sub-comparison is selected
contrast <- makeContrasts(late-early, levels = design)
contrast
We are doing a simple early versus late comparison here, so the design matrix is pretty simple. The user's guides for limma and edgeR have some examples of design matrices. Another part is that the p-values depend on what contrasts are of interest. This is related to general linear models and modeling in R. I have not found a clear (to me) overview of this topic online yet. Some Google searching and reading are recommended.
# do the linear model fitting
data_limma <- log2(data_ratio)
fit <- lmFit(data_limma, design)
# get the fit for the contrast of interest
fit2 <- contrasts.fit(fit, contrast)
# do the empirical Bayes moderation of the test statistic
fit2 <- eBayes(fit2, trend = TRUE)
# grab the information in topTable so we can get the data to plot candidates
# the coef parameter has to do with the contrast of interest
# specify no sorting of results and a number that is longer than the data table
tt_limma <- topTable(fit2, coef = 1, sort.by = "none", number = 10000)
# let's see what columns we have
head(tt_limma)
summary(decideTests(fit2))
This is repeated from above
# add average ratio columns (not logged), fold-change, and row names
pair <- data_limma
pair$ave_early <- rowMeans(early)
pair$ave_late <- rowMeans(late)
pair$logFC <- log2(pair$ave_late / pair$ave_early)
row.names(pair) <- annotate_df$`Protein Accession No.`
# get the p-values and FDRs from the top tests
pair$p_value <- tt_limma$P.Value
pair$fdr <- tt_limma$adj.P.Val
# see if we have what we want
# head(pair)
# add a DE candidate status column
pair$candidate <- "no"
pair[which(pair$fdr <= 0.10 & pair$fdr > 0.05), dim(pair)[2]] <- "low"
pair[which(pair$fdr <= 0.05 & pair$fdr > 0.01), dim(pair)[2]] <- "med"
pair[which(pair$fdr <= 0.01), dim(pair)[2]] <- "high"
pair$candidate <- factor(pair$candidate, levels = c("high", "med", "low", "no"))
# count up, down and the rest (FDR less than 0.05)
all <- dim(pair)[1]
up <- dim(pair[(pair$fdr <= 0.05) & (pair$logFC > 0.0), ])[1]
down <- dim(pair[(pair$fdr <= 0.05) & (pair$logFC <= 0.0), ])[1]
up
all - up - down
down
# what does the test p-value distribution look like?
ggplot(pair, aes(p_value)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(pair$p_value, breaks = 100, plot = FALSE)$counts[26:100])) +
ggtitle("log2 ratios t-test: p-value distribution")
# make the combined candidate corelation plot
de_ratio <- data.frame(pair$ave_early, pair$ave_late, pair$candidate)
colnames(de_ratio) <- c("early", "late", "candidate")
ggplot(de_ratio, aes(x = early, y = late)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle("limma log ratios: early vs late")
# make separate corelation plots
ggplot(de_ratio, aes(x = early, y = late)) +
geom_point(aes(color = candidate, shape = candidate)) +
facet_wrap(~ candidate) +
ggtitle("limma log ratios: separated by candidate")
# make a volcano plot
vc <- data.frame(-1*pair$logFC, -1*log10(pair$fdr), pair$candidate)
colnames(vc) <- c("logFC", "fdr", "candidate")
ggplot(vc, aes(x = logFC, y = fdr)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("Fold-Change (Log2)") +
ylab("-Log10 FDR") +
ggtitle("limma log ratios: Volcano Plot")
Normalization aspects of ratios were less easy to understand. Box plots and density plots did not look that good, but the cluster plot seemed OK.
The candidate plots are much less informative to my eye. Any relative protein abundance information (between different proteins) is lost. The general expression pattern is still pretty clear. The majority of the proteins have higher expression ratios in the early lenses compared to the later time points.
limma p-values seem to have more dynamic range than the basic two-sample t-test and that can be seen in the volcano plots. limma is more designed for these types of data sets that a basic t-test. Six replicates per condition is probably too few for the t-test. limma looks like a better choice.
To do further work with the results, some of the data frames should be combined, such as the original data, the computed "standard" columns, ratios, and statistical results. The final data frame can then be written to a CSV file for further work (adding annotations, etc.). There are plenty of other options available in limma and in BioConductor for annotations and enrichment analyses.
There are some good resources out there for RNA-Seq analysis with edgeR and limma: http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html
# collect up some results to write out
final <- cbind(annotate_df, data_raw, data_sl, refs, data_ratio, pair)
write.csv(final, file = "final_part3.csv")
It is kind of hard to figure out what to try and plot. We do not have before and after like we did with IRS. Intensities are stretched out over a wide range and that lends itself to scatter plots and correlation coefficients. Ratios are more fuzzy dots instead of fuzzy lines. They may be harder to interpret.
library(psych)
# lets compare the combination of SL and TMM normalizations to SL/IRS/TMM
# again using the idea that replicates of the same time point should similar
pairs.panels(log2(data_ratio[c(1, 7, 13)]), lm = TRUE, main = "SL/Ratio E15")
pairs.panels(log2(data_ratio[c(2, 8, 14)]), lm = TRUE, main = "SL/Ratio E18")
pairs.panels(log2(data_ratio[c(3, 9, 15)]), lm = TRUE, main = "SL/Ratio P0")
pairs.panels(log2(data_ratio[c(4, 10, 16)]), lm = TRUE, main = "SL/Ratio P3")
pairs.panels(log2(data_ratio[c(5, 11, 17)]), lm = TRUE, main = "SL/Ratio P6")
pairs.panels(log2(data_ratio[c(6, 12, 18)]), lm = TRUE, main = "SL/Ratio P9")
Scatter plots (lower diagonal elements), linear fits (the red lines), and correlation coefficients (upper diagonal elements) make more sense if the data has an actual trend tendency. We need actual changes in the x-value to see what the y-value does in response. The correlation coefficients do not really tell us that the biological replicates in each condition are similar to each other. I think this is an artifact of the ratio transformation since the clustering of the data by time point seems reasonable.
Ratios kind of put everything on a similar scale. The general expression patterns with developmental time are preserved (see end of Part 2). The lens proteins have ratios less than 1 at early times and ratios that are greater than 1 at later times. Any insights into relative abundances between different proteins are lost, however.
# create some functions
subset_proteins <- function(df, proteins, labels, value_col) {
df_subset <- subset(df, df$protein %in% proteins)
df_subset <- df_subset[match(proteins, df_subset$protein), ]
df_subset$proteins <- as.factor(labels)
keycol <- "time"
valuecol <- value_col
gathercols <- colnames(df)[1:6]
return(gather_(df_subset, keycol, valuecol, gathercols))
}
plot_proteins <- function(df_list, annotate_df) {
ave_df <- df_list[[1]]
sd_df <- df_list[[2]]
# lists of lens proteins to extract and plot
alphas <- c("P24622", "P23927")
alpha_labels <- c("alphaA", "alphaB")
betas <- c("P02525", "Q9JJV1", "Q9JJV0","Q9WVJ5", "P62696", "Q9JJU9", "O35486")
beta_labels <- c("betaA1", "betaA2", "betaA4", "betaB1", "betaB2", "betaB3", "betaS")
gammas <- c("P04345", "P04344", "Q61597", "P04342", "Q03740", "Q9CXV3", "Q8VHL5")
gamma_labels <- c("gammaA", "gammaB", "gammaC", "gammaD", "gammaE", "gammaF", "gammaN")
ave_df$protein <- annotate_df$`Protein Accession No.`
sd_df$protein <- annotate_df$`Protein Accession No.`
# plot alphas
ave_df_wide <- subset_proteins(ave_df, alphas, alpha_labels, "intensity")
sd_df_wide <- subset_proteins(sd_df, alphas, alpha_labels, "sd")
ave_df_wide$sd <- sd_df_wide$sd
alpha_plot <- ggplot(ave_df_wide, aes(time, intensity, fill=proteins)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
geom_errorbar(aes(ymin = intensity-sd, ymax = intensity+sd), position = position_dodge(0.9), width = 0.4) +
xlab("Developmental Time") + ylab("Protein Intensity Ratio") +
ggtitle("Alpha crystallin expression")
print(alpha_plot)
# plot betas
ave_df_wide <- subset_proteins(ave_df, betas, beta_labels, "intensity")
sd_df_wide <- subset_proteins(sd_df, betas, beta_labels, "sd")
ave_df_wide$sd <- sd_df_wide$sd
beta_plot <- ggplot(ave_df_wide, aes(time, intensity, fill=proteins)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
geom_errorbar(aes(ymin = intensity-sd, ymax = intensity+sd), position = position_dodge(0.9), width = 0.4) +
xlab("Developmental Time") + ylab("Protein Intensity Ratio") +
ggtitle("Beta crystallin expression")
print(beta_plot)
# plot gammas
ave_df_wide <- subset_proteins(ave_df, gammas, gamma_labels, "intensity")
sd_df_wide <- subset_proteins(sd_df, gammas, gamma_labels, "sd")
ave_df_wide$sd <- sd_df_wide$sd
gamma_plot <- ggplot(ave_df_wide, aes(time, intensity, fill=proteins)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
geom_errorbar(aes(ymin = intensity-sd, ymax = intensity+sd), position = position_dodge(0.9), width = 0.4) +
xlab("Developmental Time") + ylab("Protein Intensity Ratio") +
ggtitle("Gamma crystallin expression")
print(gamma_plot)
return(NULL)
}
plot_proteins(list_ratio, annotate_df)
The expression of each protein is kind of normalized to itself. Proteins that have a different expression as a function of time have different patterns. However, most of the proteins look similar to each other in the expression pattern. There is probably nothing "wrong" with these expression ratios. They are just different and some careful thinking about what they represent is needed when interpreting their changes.
I think the advantages of retaining the natural intensity scale speak for themselves. The data in the visualizations in Parts 1 and 2 are more intuitive. Ratios are less informative with those same visualizations. Perhaps different visualizations are needed for ratio data; however, nothing I have read about seems appropriate.
# print out the R session for our records
sessionInfo()