Re-analysis of TMT data MORG-75: QE MS2

Analysis performed by Phil Wilmarth

January 19, 2020



Overview

This is data from the Terry Morgan lab at OHSU. Tandem mass tags were used to label human vulvar vestibulitis tissue samples. There were 4 control samples, 8 primary vestibulitis, and 2 secondary vestibulitis. The 14 samples were distributed across two TMT 10-plex channels. A master pool of all samples was created, and duplicate channels of the pooled standard were included in each 10-plex run. Because we had two extra channels out of the 16 available, the 41 and 42 control samples were labeled twice and split between the plexes.

The Internal Reference Scaling method was used to adjust each TMT experiment onto a common reporter ion scale. The data was first acquired on a first generation Thermo Fusion Tribrid instrument using the SPS MS3 method in August 2015. The frozen samples were re-ran on a Q-Exactive HF in May 2017. This notebook will look at the QE MS2 data. An on-line high pH separation was used to fractionate the samples into 17 first dimensions. A low pH 2-hour gradient was used for the second dimension. Data analysis used the open source PAW pipeline to confidently identify proteins with high sensitivity and aggregate reporter ion signals.

Statistical analysis used the Bioconductor package edgeR in a Jupyter notebook running an R kernel. The PAW pipeline produces manageable, informative summary tables. The PAW results and statistical testing results were integrated into a final Excel spreadsheet.

The data has not yet been published and is not publicly available.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.


Load the necessary R libraries

In [1]:
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor


Attaching package: ‘psych’

The following objects are masked from ‘package:scales’:

    alpha, rescale

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Define common functions for notebook use

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}


# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
    "X"
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward.

In [3]:
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 5632)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
accessions <- data_all$Accession # save gene names for edgeR so we can double check that results line up

ncol(data_all)
nrow(data_all)
Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)
See spec(...) for full column specifications.
97
4900

Check the imported RAW data channels

There are always multiple normalizations in these workflows. We have the same total amount of protein based on protein assays before the labeling reactions. We might have peptide assays after digestion to determine equal amounts of digest to label. In our core, we do a separate "normalization" run where equal volumes of the TMT-labeled reaction mixtures are mixed and analyzed in a single LC run. The total reporter ion signal per channel is used as a proxy for total digest for each sample. We adjust final mixing volumes to try and achieve more equal intensity totals across the sample.

Despite all of those "bench" normalizations, we will still need to do several software normalizations. The intensity totals per channel within each plex (after excluding contaminants) will still not be the same. We first take care of that for each plex. Each TMT plex may have different average total intensities per experiment (instruments vary over time). We can also make those match. These are the gross normalization adjustments. These single factor scalings are designed to get all of the data into the same ballpark so that more sophisticated normalizations can be more safely applied.

The IRS method to correct for pseudo-random MS2 scan selection is necessary to combine data from multiple TMT experiments (plexes). It should be done after sample-loading corrections (and overall instrument level matching between plexes). This gets all of the individual channel measurements over all of the plexes onto the same measurement scale so we can work with the data like we had a single really big TMT kit.

Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.

We adjust actual reporter ion intensities. We do not work with any ratios at any of these critical early normalization steps. Once we have all of the sample measurements adjusted to the same valid measurement scale, different biological conditions (groups) can have compositional changes. Those compositional changes can throw off grand total values (how we mostly do the normalizations). Algorithms like the trimmed mean of M-values are designed to find and correct any compositional differences.

To do good science, we should thoroughly check our data before and after these normalization steps. Visualizations are effective strategies because a picture really is worth a thousand words. The human brain is very good at processing visual patterns. We can look at the starting (RAW) reporter ion intensities before we do any software normalizations.

In [4]:
# see how raw data clusters - we have 11 channels per plex (one unused)
data_raw <- select(data_all, starts_with("TotInt_"))

# see colors by plex: black and  blue are Fusion; red and green are QE
colors_all <- c(rep("black", 11), rep("red", 11))

# check the intensity distributions
boxplot(log10(data_raw), col = colors_all, notch = TRUE, main = "RAW Data")

There are two unused channels

The PAW pipeline assumes 11-plex TMT reagents (6-plex and 10-plex are subsets of 11-plex). We had 10-plex reagents, so the 11th channel in each plex was not used. Get the sample loading (SLNorm) and IRS normalized data (IRSNorm) for the used channels.

In [5]:
data_irs <- select(data_all, contains("IRSNorm"), -contains("unused"))
data_sl <- select(data_all, contains("SLNorm"), -contains("unused"))

# define the 3 groups
control_sl <- select(data_sl, contains("Control"))
primary_sl <- select(data_sl, contains("Primary"))
secondary_sl <- select(data_sl, contains("Secondary"))

control_irs <- select(data_irs, contains("Control"))
primary_irs <- select(data_irs, contains("Primary"))
secondary_irs <- select(data_irs, contains("Secondary"))

pools_before <- select(data_sl, contains("Pool"))
pools_after <- select(data_irs, contains("Pool"))

We have more quantifiable proteins with the QE

We had a table with 3650 rows for the original Fusion data. We have 34% more proteins with the faster duty cycle on the QE. The first generation Fusion is a powerful instrument with the tribrid configuration. The newer platforms are more sensitive and faster, however.

Compare clustering before and after IRS

We should have the biological groups clustering together. We have Control samples 41 and 42 that are present in both plexes as another sanity check.

In [6]:
# put groups together into a single data frame
tmt_sl <- bind_cols(control_sl, primary_sl, secondary_sl)
tmt_irs <- bind_cols(control_irs, primary_irs, secondary_irs)

control <- 1:6
primary <- 7:14
secondary <- 15:16

# set some colors by condition
colors = c(rep('red', length(control)), rep('blue', length(primary)), rep('green', length(secondary)))

# check clustering before starting analysis
plotMDS(log2(tmt_sl), main = "Before IRS", col = colors)

We have a strong batch effect for the sample loading only corrected data

The SLNorm data has strong left and right separation between the two plexes.

In [7]:
# clustering after IRS
plotMDS(log2(tmt_irs), main = "After IRS", col = colors)

IRS removes the TMT-plex effect

After using the pooled reference channels in each plex for the IRS method, we remove the TMT plex effect and have some separations between the control samples (in red) from the syndrome samples (blue and green). It is always worth mentioning that the data (the reference channels) used to make the adjustments are independent of the actual biological sample data. This reduces any possibilities of confounding the correction factors and the expression changes. It also eliminates any need for carefully balanced allocations of samples by plex, something that can be hard with relatively small numbers of channels per TMT plex.

We also see that the duplicate channels for samples 41 and 42 are somewhat close together after IRS. However, they do not cluster together to the same degree that we saw in the Fusion SPS MS3 data. We should keep this in mind as we compare/contrast the platforms in the rest of this notebook.

Compare reference channels to each other before and after IRS

We can see the effect of the pseudo-random MS2 sampling in the data before IRS and how effective the IRS method is at correcting that. The upper left scatter plot and lower right scatter plots are within a single plex. The lower left set of 4 scatter plots, with dramatically larger scatter, are channels between plexes.

In [8]:
pairs.panels(log10(pools_before), main = "Standards before IRS")
pairs.panels(log10(pools_after), main = "Standards after IRS")

IRS method seems to be doing its thing

The pooled reference sample is identical in each of its channels. If our measurements are valid, we should have repeated measurements of the same thing yield nearly identical intensities. We achieve that with the IRS method. The data from the QE's MS2 scans almost looks too good.

Check MDS clustering and TMM normalization

Some data normalizations are integral to the IRS procedure. There can still be compositional differences between samples from different biological conditions that IRS would not address. The trimmed mean of M-values normalization in edgeR is designed for these large-scale experiments.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

We load the data into edgeR data structures and called the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We have to compute the normalized intensities with the apply_tmm_factors function.

We kept the duplicate channels for samples 41 and 42 in the data for the above sanity checks. We need to remove those extra technical-like replicates or they will artificially reduce the control group variance and throw off the testing p-values. We will take sample 41 from the first plex and sample 42 from the second plex.

In [9]:
# drop the extra control channels
control_sl <- control_sl %>% select(., -contains("42_1")) %>% select(., -contains("41_2"))
control_irs <- control_irs %>% select(., -contains("42_1")) %>% select(., -contains("41_2"))

# put groups together into a single data frame
tmt_sl <- bind_cols(control_sl, primary_sl, secondary_sl)
tmt_irs <- bind_cols(control_irs, primary_irs, secondary_irs)

control <- 1:4
primary <- 5:12
secondary <- 13:14

# set some colors by condition
colors = c(rep('red', length(control)), rep('blue', length(primary)), rep('green', length(secondary)))

Load the data into edgeR data structures

In [10]:
# get the biological sample data into a DGEList object
group = c(rep("C", length(control)), rep("P", length(primary)), rep("S", length(secondary)))
y <- DGEList(counts = tmt_irs, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, color = colors, plot = FALSE)

# check the clustering
plotMDS(y, col = colors, main = "all samples after TMM")
Library size factors:
 IRSNorm_Control-30_1 -> 0.999946
 IRSNorm_Control-41_1 -> 0.999278
 IRSNorm_Control-42_2 -> 0.995621
 IRSNorm_Control-40_2 -> 0.996226
 IRSNorm_Primary-31_1 -> 1.000010
 IRSNorm_Primary-26_1 -> 0.999168
 IRSNorm_Primary-29_1 -> 1.000689
 IRSNorm_Primary-35_2 -> 1.004292
 IRSNorm_Primary-34_2 -> 0.998541
 IRSNorm_Primary-39_2 -> 1.001377
 IRSNorm_Primary-38_2 -> 1.007373
 IRSNorm_Primary-37_2 -> 0.998034
 IRSNorm_Secondary-28_1 -> 0.997670
 IRSNorm_Secondary-36_1 -> 1.001901

Trimmed mean of M-values (TMM) factors:
 IRSNorm_Control-30_1 -> 1.028178
 IRSNorm_Control-41_1 -> 0.962427
 IRSNorm_Control-42_2 -> 0.974083
 IRSNorm_Control-40_2 -> 0.959207
 IRSNorm_Primary-31_1 -> 0.997632
 IRSNorm_Primary-26_1 -> 1.016388
 IRSNorm_Primary-29_1 -> 0.997045
 IRSNorm_Primary-35_2 -> 0.923686
 IRSNorm_Primary-34_2 -> 1.047231
 IRSNorm_Primary-39_2 -> 0.991888
 IRSNorm_Primary-38_2 -> 1.024091
 IRSNorm_Primary-37_2 -> 1.063597
 IRSNorm_Secondary-28_1 -> 1.008490
 IRSNorm_Secondary-36_1 -> 1.015065

Combined (lib size and TMM) normalization factors:
 IRSNorm_Control-30_1 -> 1.028123
 IRSNorm_Control-41_1 -> 0.961732
 IRSNorm_Control-42_2 -> 0.969817
 IRSNorm_Control-40_2 -> 0.955588
 IRSNorm_Primary-31_1 -> 0.997642
 IRSNorm_Primary-26_1 -> 1.015542
 IRSNorm_Primary-29_1 -> 0.997732
 IRSNorm_Primary-35_2 -> 0.927651
 IRSNorm_Primary-34_2 -> 1.045703
 IRSNorm_Primary-39_2 -> 0.993253
 IRSNorm_Primary-38_2 -> 1.031641
 IRSNorm_Primary-37_2 -> 1.061507
 IRSNorm_Secondary-28_1 -> 1.006141
 IRSNorm_Secondary-36_1 -> 1.016994

TMM factors are pretty close to 1.0

The IRS method kind of takes care of the library size type of normalizations. We could still have compositional differences between sample groups that the TMM factors would correct. Most TMM factors are not very different from 1.0 (smallest is 0.92 and largest is 1.06), so we do not seem to have any large compositional differences.

The box plots below are very similar across samples. TMM usually gets the medians (box centers) aligned pretty well. We also have the sizes of the boxes being very similar sample-to-sample.

Check the normalizations with box plots

Box plots for well-normalized data should be similar in size and the medians should align with each other. There can be problems with the data that may not be apparent with boxplots (some data distortions can average out and not appear different with boxplots). We will use ggplot2 this time and flip the graph on its side, so we can read the labels easier. We need tidy (long form) data for ggplot.

In [11]:
# need to make tidy data format
long_results <- gather(tmt_tmm, key = "sample", value = "intensity") %>%
  mutate(log_int = log10(intensity)) %>%
  extract(sample, into = 'group', "IRSNorm_(.*)-", remove = FALSE)
# head(long_results)
In [12]:
# plot the boxplots 
ggplot(long_results, aes(x = sample, y = log_int, fill = group)) +
  geom_boxplot(notch = TRUE) +
  coord_flip() + 
  ggtitle("Intensity distribution of each channel")

Check CV distributions

The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This is an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the biological conditions.

We would expect samples within the same biological group to be similar to each other. It follows that when normalization methods are working to remove non-biological sources of variation, we should see median CVs and CV distributions improve after normalization.

In [13]:
# put CVs in data frames to simplify plots and summaries
cv_sl <- data.frame(C.sl = CV(tmt_sl[control]), P.sl = CV(tmt_sl[primary]),
                    S.sl = CV(tmt_sl[secondary])) 
cv_irs <- data.frame(C.irs = CV(tmt_irs[control]), P.irs = CV(tmt_irs[primary]),
                    S.irs = CV(tmt_irs[secondary])) 
cv_tmm <- data.frame(C.tmm = CV(tmt_tmm[control]), P.tmm = CV(tmt_tmm[primary]),
                    S.tmm = CV(tmt_tmm[secondary])) 
# see what the median CV values are
medians <- apply(cv_sl, 2, FUN = median)
print("SLNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_irs, 2, FUN = median)
print("IRSNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_tmm, 2, FUN = median)
print("Final median CVs by condition (%)")
round(medians, 2)
[1] "SLNorm median CVs by condition (%)"
C.sl
26.2
P.sl
24.24
S.sl
7.31
[1] "IRSNorm median CVs by condition (%)"
C.irs
9.48
P.irs
9.44
S.irs
7.31
[1] "Final median CVs by condition (%)"
C.tmm
9.16
P.tmm
8.09
S.tmm
7.15

The median CVs are quite low

We see some significant improvements in the median CVs after IRS compared to the sample total loading adjustments (SLNorm). The TMM step does not change the median CVs much over IRS.

We will likely have some dynamic range compression for the reporter ions in the MS2 scans. The median CVs above are about half of what we had for the original Fusion data (CVs between 13% and 14% after IRS).

Use ggplot to explore the CV data

In [14]:
# see what the CV distibutions look like
# need long form for ggplot
long_cv_sl <- gather(cv_sl, key = "condition", value = "cv") %>%
  extract(condition, into = 'group', "(.*)\\.sl", remove = FALSE)

# traditional boxplots
ggplot(long_cv_sl, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("SL CV distributions")

# density plots
ggplot(long_cv_sl, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 150)) +
  ggtitle("SL CV distributions")

# need long form for ggplot
long_cv_tmm <- gather(cv_tmm, key = "condition", value = "cv") %>%
  extract(condition, into = 'group', "(.*)\\.tmm", remove = FALSE)

# traditional boxplots
ggplot(long_cv_tmm, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Final CV distributions")

# density plots
ggplot(long_cv_tmm, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 100)) +
  ggtitle("Final CV distributions")

Finally, compare samples within each biological condition

We can also look at each biological condition with a multi-panel scatter plot grid and see how similar the replicates are to each other. We can look at the intensities before IRS and after IRS with TMM (final data).

In [15]:
# multi-panel scatter plot grids, SLNorm data
pairs.panels(log10(tmt_sl[control]), lm = TRUE, main = "Controls - SLNorm")
pairs.panels(log10(tmt_sl[primary]), lm = TRUE, main = "Primary - SLNorm")
pairs.panels(log10(tmt_sl[secondary]), lm = TRUE, main = "Secondary - SLNorm")

Sharp difference in scatter plots within plex versus between plex

The rather low-scatter scatter plots within plex are now very distinct from the wide, wedge-shaped plots from channels between plexes.

In [16]:
# multi-panel scatter plot grids, final data
pairs.panels(log10(tmt_tmm[control]), lm = TRUE, main = "Control - final")
pairs.panels(log10(tmt_tmm[primary]), lm = TRUE, main = "Primary - final")
pairs.panels(log10(tmt_tmm[secondary]), lm = TRUE, main = "Secondary - final")

IRS reduces sample-to-sample variation

We have considerable variance (large scatter and lower correlation coefficients) between plexes before IRS. That is removed by the IRS method. The QE MS2 data have pretty narrow scatter plots and very high correlation coefficients. This is a little different compared to the SPS MS3 data from the Fusion.


EdgeR statistical testing starts here

We have loaded data into some edgeR data structures so we could run the TMM normalization function. We have not yet run any of the edgeR functions that do the statistical testing.


(1) Compare Control to Primary

Compute the shared variance trend

One of the more powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 16 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

In [17]:
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.

Testing of control versus primary

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

In [18]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "P"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Control vs Primary p-value distribution")
        P-C
Down    350
NotSig 4045
Up      505

Candidates and p-values look okay

The numbers of candidates are pretty large and split between up and down regulation, as can be seen in the MA plot. The p-value distribution has a nice flat distribution of larger p-values (from non-DE candidates) and a large spike at small p-values from true DE candidates. Very few candidates exceed a 2-fold difference.

More candidates that in the original Fusion data

We had 552 DE candiates for the Fusion and we have 855 here. That is an increase of 55%. We had 34% more quantifiable proteins (4,900 versus 3,650), so we are gaining DE candidates quite a bit faster than we gained proteins.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [19]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, control, "C", primary, "P")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_CvsP")

# accumulate the testing results
all_results <- results_temp

Count candidates and look at fold-change distributions

Log2 fold-change distribution plots are fairly common in proteomics. We can make those separately for the different levels of significant categories from the edgeR test results (no > 0.10 > low > 0.05 > med > 0.01 > high).

In [20]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs P logFC distributions by candidate")

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs P logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 173
med 335
low 347
no 4045

Main summary plots

We have many comparisons to visualize, so we will use some of the functions at the top of the notebook to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots

In [21]:
# make MA plots
MA_plots(results, "ave_C", "ave_P", "Control vs Primary")

Scatter plots

In [22]:
# make scatter plots
scatter_plots(results,  "ave_C", "ave_P", "Control vs Primary")

Volcano plot

In [23]:
# make a volcano plot
volcano_plot(results,  "ave_C", "ave_P", "Control vs Primary")

The candidates are more over-expressed in the primary samples

We have more candidates associated with over-expression in the primary samples, and the fold-changes seem larger. My eye sees less clear separation between DE candidates and non-candidates for the more compressed QE data. My internal scale of what constitutes a convincing expression difference seems to be calibrated differently than statistical models of significance.

Plot the top 25 up and top 25 down differentially expressed proteins

We will look at some protein in more detail. We will filter for the 25 most statistically significant over and under expressed proteins. The plots are colored by condition. The labels at the top of each plot have the identifier, the average reporter ion intensity, the B-H corrected p-value, and the fold change.

Note: this data has not yet been published (I know, why not?) so I have hidden the protein accessions. Sorry!

In [24]:
top_N <- 25
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(primary), top_N)
set_plot_dimensions(7, 7)

Individual protein plots are variable

The plots demonstrate that there are different ways for expression patterns to hit the same p-values. Some of the plots look convincing to me and others do not. I think the top DE candidates from the SPS MS3 Fusion data had clearer (i.e. larger) differences in means.

(2) Compare Control to Secondary

We have only 2 secondary samples

We will repeat similar analysis steps to compare the 4 control samples to the 2 secondary samples.

In [25]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "S"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("C vs S p-value distribution")
        S-C
Down     14
NotSig 4843
Up       43

Candidates and p-values look okay

We have reduced numbers of candidates and the data are a bit noisier. The p-value distribution has a flat distribution of larger p-values (from non-DE candidates) and a spike at small p-values from true DE candidates. We again have some candidate bias towards over-expression in the secondary samples.

Relatively fewer DE candidates compared to Fusion

We have 57 DE candidates here and we had 54 on the Fusion. That is less than 6% relative gain from 34% more proteins. The interplay between variance estimates and the magnitude of the difference in means is not so simple or easy to predict.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [26]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, control, "C", secondary, "S")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_CvsS")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [27]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("C vs S logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 3
med 10
low 44
no 4843

Main summary plots

MA plots

In [28]:
# make MA plots
MA_plots(results, "ave_C", "ave_S", "Control vs Secondary")

Scatter plots

In [29]:
# make scatter plots
scatter_plots(results,  "ave_C", "ave_S", "Control vs Secondary")

Volcano plot

In [30]:
# make a volcano plot
volcano_plot(results,  "ave_C", "ave_S", "Control vs Secondary")

Check top individual proteins (10 up and 10 down)

In [31]:
top_N <- 10
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(secondary), top_N)
set_plot_dimensions(7, 7)

(3) Compare Primary to Secondary

Compare primary to secondary vestibulitis samples

In [32]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("P", "S"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("P vs S p-value distribution")
        S-P
Down      0
NotSig 4900
Up        0

There are no DE candidates

There are no DE candidates and the p-value distribution is flat without any low p-value spike. We only had 4 DE candidates for the Fusion, so the comparisons are quite similar between instruments at some higher qualitative level.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [33]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, primary, "P", secondary, "S")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_PvsS")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [34]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("P vs S logFC distributions by candidate")
A tibble: 1 × 2
candidaten
<fct><int>
no4900

MA plots

In [35]:
# make MA plots
MA_plots(results, "ave_P", "ave_S", "Primary vs Secondary")

Scatter plots

In [36]:
# make scatter plots
scatter_plots(results, "ave_P", "ave_S", "Primary vs Secondary")

Volcano plot

In [37]:
# make a volcano plot
volcano_plot(results, "ave_P", "ave_S", "Primary vs Secondary")

No significant difference between primary and secondary conditions

There are only two secondary samples, so it is hard to know if the lack of DE candidates has much real significance.


Summary

We demonstrated that the normalizations worked and that the general data quality was good. We had reasonable number of samples in tow of the groups (controls and primary). The two samples for the third group limit the comparisons.

The general characteristics of the Fusion and QE data are quite similar. If this notebook and the Fusion notebook are scrolled side-by-side, most of the data views are qualitatively similar. We have more total proteins quantified with the QE. However, the dynamic range and fold changes are different (reduced) for MS2 data compared to the cleaner MS3 data from the Tribrid. The extra round of isolation and reduced signal probably has some effect on the quality of the Fusion data. Newer, more sensitive Tribrids should generate improved data compared to the first-generation Fusion.

My gut feeling is that the SPS MS3 data feels like better (more believable) measurements. I find the increased dynamic range and accuracy in expression measurements more reassuring.

We will save the testing results table for integration with the proteomics results and log the session.


Save the all_results frame to TSV file

In [38]:
write.table(all_results, "QE_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information

In [39]:
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.8.12    edgeR_3.24.3    limma_3.38.3    scales_1.0.0   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
 [9] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     locfit_1.5-9.1   tidyselect_0.2.5 repr_1.0.1      
 [5] splines_3.5.3    haven_2.1.1      lattice_0.20-38  colorspace_1.4-1
 [9] generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6  base64enc_0.1-3 
[13] rlang_0.4.0      pillar_1.4.2     foreign_0.8-72   glue_1.3.1      
[17] withr_2.1.2      modelr_0.1.5     readxl_1.3.1     uuid_0.1-2      
[21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.4     
[25] evaluate_0.14    labeling_0.3     parallel_3.5.3   broom_0.5.2     
[29] IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4  IRkernel_1.0.2  
[33] jsonlite_1.6     mnormt_1.5-5     hms_0.5.1        digest_0.6.20   
[37] stringi_1.4.3    grid_3.5.3       cli_1.1.0        tools_3.5.3     
[41] magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.2 
[45] zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4  assertthat_0.2.1
[49] httr_1.4.1       rstudioapi_0.10  R6_2.4.0         nlme_3.1-141    
[53] compiler_3.5.3  
In [ ]: