Re-analysis of TMT data MORG-75: Fusion MS3

Analysis performed by Phil Wilmarth

January 19, 2020



Overview

This is data from the Terry Morgan lab at OHSU. Tandem mass tags were used to label human vulvar vestibulitis tissue samples. There were 4 control samples, 8 primary vestibulitis, and 2 secondary vestibulitis. The 14 samples were distributed across two TMT 10-plex channels. A master pool of all samples was created, and duplicate channels of the pooled standard were included in each 10-plex run. Because we had two extra channels out of the 16 available, the 41 and 42 control samples were labeled twice and split between the plexes.

The Internal Reference Scaling method was used to adjust each TMT experiment onto a common reporter ion scale. The data was acquired on a first generation Thermo Fusion Tribrid instrument using the SPS MS3 method in August 2015. An on-line high pH separation was used to fractionate the samples into 17 first dimensions. A low pH 2-hour gradient was used for the second dimension. Data analysis used the open source PAW pipeline to confidently identify proteins with high sensitivity and aggregate reporter ion signals.

Statistical analysis used the Bioconductor package edgeR in a Jupyter notebook running an R kernel. The PAW pipeline produces manageable, informative summary tables. The PAW results and statistical testing results were integrated into a final Excel spreadsheet.

The data has not yet been published and is not publicly available.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.


Load the necessary R libraries

In [1]:
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor


Attaching package: ‘psych’

The following objects are masked from ‘package:scales’:

    alpha, rescale

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Define common functions for notebook use

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}


# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
    "X"
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward.

In [3]:
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 4312)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
accessions <- data_all$Accession # save gene names for edgeR so we can double check that results line up

ncol(data_all)
nrow(data_all)
Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)
See spec(...) for full column specifications.
97
3650

Check the imported RAW data channels

There are always multiple normalizations in these workflows. We have the same total amount of protein based on protein assays before the labeling reactions. We might have peptide assays after digestion to determine equal amounts of digest to label. In our core, we do a separate "normalization" run where equal volumes of the TMT-labeled reaction mixtures are mixed and analyzed in a single LC run. The total reporter ion signal per channel is used as a proxy for total digest for each sample. We adjust final mixing volumes to try and achieve more equal intensity totals across the sample.

Despite all of those "bench" normalizations, we will still need to do several software normalizations. The intensity totals per channel within each plex (after excluding contaminants) will still not be the same. We first take care of that for each plex. Each TMT plex may have different average total intensities per experiment (instruments vary over time). We can also make those match. These are the gross normalization adjustments. These single factor scalings are designed to get all of the data into the same ballpark so that more sophisticated normalizations can be more safely applied.

The IRS method to correct for pseudo-random MS2 scan selection is necessary to combine data from multiple TMT experiments (plexes). It should be done after sample-loading corrections (and overall instrument level matching between plexes). This gets all of the individual channel measurements over all of the plexes onto the same measurement scale so we can work with the data like we had a single really big TMT kit.

Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.

We adjust actual reporter ion intensities. We do not work with any ratios at any of these critical early normalization steps. Once we have all of the sample measurements adjusted to the same valid measurement scale, different biological conditions (groups) can have compositional changes. Those compositional changes can throw off grand total values (how we mostly do the normalizations). Algorithms like the trimmed mean of M-values are designed to find and correct any compositional differences.

To do good science, we should thoroughly check our data before and after these normalization steps. Visualizations are effective strategies because a picture really is worth a thousand words. The human brain is very good at processing visual patterns. We can look at the starting (RAW) reporter ion intensities before we do any software normalizations.

In [4]:
# see how raw data clusters - we have 11 channels per plex (one unused)
data_raw <- select(data_all, starts_with("TotInt_"))

# see colors by plex: black and  blue are Fusion; red and green are QE
colors_all <- c(rep("black", 11), rep("red", 11))

# check the intensity distributions
boxplot(log10(data_raw), col = colors_all, notch = TRUE, main = "RAW Data")

There are two unused channels

The PAW pipeline assumes 11-plex TMT reagents (6-plex and 10-plex are subsets of 11-plex). We had 10-plex reagents, so the 11th channel in each plex was not used. Get the sample loading (SLNorm) and IRS normalized data (IRSNorm) for the used channels.

In [5]:
# drop unused channels
data_irs <- select(data_all, contains("IRSNorm"), -contains("unused"))
data_sl <- select(data_all, contains("SLNorm"), -contains("unused"))

# define the 3 groups
control_sl <- select(data_sl, contains("Control"))
primary_sl <- select(data_sl, contains("Primary"))
secondary_sl <- select(data_sl, contains("Secondary"))

control_irs <- select(data_irs, contains("Control"))
primary_irs <- select(data_irs, contains("Primary"))
secondary_irs <- select(data_irs, contains("Secondary"))

pools_before <- select(data_sl, contains("Pool"))
pools_after <- select(data_irs, contains("Pool"))

Compare clustering before and after IRS

We should have the biological groups clustering together. We have Control samples 41 and 42 that are present in both plexes as another sanity check.

In [6]:
# put groups together into a single data frame
tmt_sl <- bind_cols(control_sl, primary_sl, secondary_sl)
tmt_irs <- bind_cols(control_irs, primary_irs, secondary_irs)

control <- 1:6
primary <- 7:14
secondary <- 15:16

# set some colors by condition
colors = c(rep('red', length(control)), rep('blue', length(primary)), rep('green', length(secondary)))

# check clustering before starting analysis
plotMDS(log2(tmt_sl), main = "Before IRS", col = colors)

We have a strong batch effect for the sample loading only corrected data

The SLNorm data has strong left and right separation between the two plexes.

In [7]:
# clustering after IRS
plotMDS(log2(tmt_irs), main = "After IRS", col = colors)

IRS removes the TMT-plex effect

After using the pooled reference channels in each plex for the IRS method, we remove the TMT plex effect and have some separations between the control samples (in red) from the syndrome samples (blue and green). It is always worth mentioning that the data (the reference channels) used to make the adjustments are independent of the actual biological sample data. This reduces any possibilities of confounding the correction factors and the expression changes. It also eliminates any need for carefully balanced allocations of samples by plex, something that can be hard with relatively small numbers of channels per TMT plex.

We also see that the duplicate channels for samples 41 and 42 are very close together after IRS. This is an important sanity check.

Compare reference channels to each other before and after IRS

We can see the effect of the pseudo-random MS2 sampling in the data before IRS and how effective the IRS method is at correcting that. The upper left scatter plot and lower right scatter plots are within a single plex. The lower left set of 4 scatter plots, with dramatically larger scatter, are channels between plexes.

In [8]:
pairs.panels(log10(pools_before), main = "Standards before IRS")
pairs.panels(log10(pools_after), main = "Standards after IRS")

IRS method seems to be doing its thing

The pooled reference sample is identical in each of its channels. If our measurements are valid, we should have repeated measurements of the same thing yield nearly identical intensities. We achieve that with the IRS method.

Check MDS clustering and TMM normalization

Some data normalizations are integral to the IRS procedure. There can still be compositional differences between samples from different biological conditions that IRS would not address. The trimmed mean of M-values normalization in edgeR is designed for these large-scale experiments.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

We load the data into edgeR data structures and called the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We have to compute the normalized intensities with the apply_tmm_factors function.

We kept the duplicate channels for samples 41 and 42 in the data for the above sanity checks. We need to remove those extra technical-like replicates or they will artificially reduce the control group variance and throw off the testing p-values. We will take sample 41 from the first plex and sample 42 from the second plex.

In [9]:
# drop the extra control channels
control_sl <- control_sl %>% select(., -contains("42_1")) %>% select(., -contains("41_2"))
control_irs <- control_irs %>% select(., -contains("42_1")) %>% select(., -contains("41_2"))

# put groups together into a single data frame
tmt_sl <- bind_cols(control_sl, primary_sl, secondary_sl)
tmt_irs <- bind_cols(control_irs, primary_irs, secondary_irs)

control <- 1:4
primary <- 5:12
secondary <- 13:14

# set some colors by condition
colors = c(rep('red', length(control)), rep('blue', length(primary)), rep('green', length(secondary)))

Load the data into edgeR data structures

In [10]:
# get the biological sample data into a DGEList object
group = c(rep("C", length(control)), rep("P", length(primary)), rep("S", length(secondary)))
y <- DGEList(counts = tmt_irs, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, color = colors, plot = FALSE)

# check the clustering
plotMDS(y, col = colors, main = "all samples after TMM")
Library size factors:
 IRSNorm_Control-30_1 -> 0.997024
 IRSNorm_Control-41_1 -> 0.998269
 IRSNorm_Control-42_2 -> 1.000461
 IRSNorm_Control-40_2 -> 1.001235
 IRSNorm_Primary-31_1 -> 0.996914
 IRSNorm_Primary-26_1 -> 1.002133
 IRSNorm_Primary-29_1 -> 0.999871
 IRSNorm_Primary-35_2 -> 1.002239
 IRSNorm_Primary-34_2 -> 0.999594
 IRSNorm_Primary-39_2 -> 0.997539
 IRSNorm_Primary-38_2 -> 1.002072
 IRSNorm_Primary-37_2 -> 1.000757
 IRSNorm_Secondary-28_1 -> 1.001854
 IRSNorm_Secondary-36_1 -> 1.000086

Trimmed mean of M-values (TMM) factors:
 IRSNorm_Control-30_1 -> 0.961281
 IRSNorm_Control-41_1 -> 0.943278
 IRSNorm_Control-42_2 -> 0.893062
 IRSNorm_Control-40_2 -> 0.895756
 IRSNorm_Primary-31_1 -> 0.991900
 IRSNorm_Primary-26_1 -> 1.030358
 IRSNorm_Primary-29_1 -> 0.978794
 IRSNorm_Primary-35_2 -> 0.924442
 IRSNorm_Primary-34_2 -> 1.065491
 IRSNorm_Primary-39_2 -> 0.995969
 IRSNorm_Primary-38_2 -> 1.150680
 IRSNorm_Primary-37_2 -> 1.079382
 IRSNorm_Secondary-28_1 -> 1.087495
 IRSNorm_Secondary-36_1 -> 1.040062

Combined (lib size and TMM) normalization factors:
 IRSNorm_Control-30_1 -> 0.958421
 IRSNorm_Control-41_1 -> 0.941645
 IRSNorm_Control-42_2 -> 0.893473
 IRSNorm_Control-40_2 -> 0.896862
 IRSNorm_Primary-31_1 -> 0.988839
 IRSNorm_Primary-26_1 -> 1.032557
 IRSNorm_Primary-29_1 -> 0.978667
 IRSNorm_Primary-35_2 -> 0.926512
 IRSNorm_Primary-34_2 -> 1.065059
 IRSNorm_Primary-39_2 -> 0.993518
 IRSNorm_Primary-38_2 -> 1.153064
 IRSNorm_Primary-37_2 -> 1.080199
 IRSNorm_Secondary-28_1 -> 1.089511
 IRSNorm_Secondary-36_1 -> 1.040151

TMM factors are pretty close to 1.0

The IRS method kind of takes care of the library size type of normalizations. We could still have compositional differences between sample groups that the TMM factors would correct. Most TMM factors are not very different from 1.0 (smallest is 0.89 and largest is 1.15), so we do not seem to have any large compositional differences.

The box plots (below) are very similar across samples. TMM usually gets the medians (box centers) aligned pretty well. We also have the sizes of the boxes being very similar sample-to-sample.

Check the normalizations with box plots

Box plots for well-normalized data should be similar in size and the medians should align with each other. There can be problems with the data that may not be apparent with boxplots (some data distortions can average out and not appear different with boxplots). We will use ggplot2 this time and flip the graph on its side, so we can read the labels easier. We need tidy (long form) data for ggplot.

In [11]:
# need to make tidy data format
long_results <- gather(tmt_tmm, key = "sample", value = "intensity") %>%
  mutate(log_int = log10(intensity)) %>%
  extract(sample, into = 'group', "IRSNorm_(.*)-", remove = FALSE)
# head(long_results)
In [12]:
# plot the boxplots 
ggplot(long_results, aes(x = sample, y = log_int, fill = group)) +
  geom_boxplot(notch = TRUE) +
  coord_flip() + 
  ggtitle("Intensity distribution of each channel")

Check CV distributions

The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This is an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the biological conditions.

We would expect samples within the same biological group to be similar to each other. It follows that when normalization methods are working to remove non-biological sources of variation, we should see median CVs and CV distributions improve after normalization.

In [13]:
# put CVs in data frames to simplify plots and summaries
cv_sl <- data.frame(C.sl = CV(tmt_sl[control]), P.sl = CV(tmt_sl[primary]),
                    S.sl = CV(tmt_sl[secondary])) 
cv_irs <- data.frame(C.irs = CV(tmt_irs[control]), P.irs = CV(tmt_irs[primary]),
                    S.irs = CV(tmt_irs[secondary])) 
cv_tmm <- data.frame(C.tmm = CV(tmt_tmm[control]), P.tmm = CV(tmt_tmm[primary]),
                    S.tmm = CV(tmt_tmm[secondary])) 
# see what the median CV values are
medians <- apply(cv_sl, 2, FUN = median)
print("SLNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_irs, 2, FUN = median)
print("IRSNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_tmm, 2, FUN = median)
print("Final median CVs by condition (%)")
round(medians, 2)
[1] "SLNorm median CVs by condition (%)"
C.sl
29.45
P.sl
28.61
S.sl
13.37
[1] "IRSNorm median CVs by condition (%)"
C.irs
14.13
P.irs
15.9
S.irs
13.37
[1] "Final median CVs by condition (%)"
C.tmm
14.13
P.tmm
13.91
S.tmm
13.34

The median CVs are quite low

We see some significant improvements in the median CVs after IRS compared to the sample total loading adjustments (SLNorm). The TMM step does not change the median CVs much over IRS for these samples.

Use ggplot to explore the CV data

In [14]:
# see what the CV distibutions look like
# need long form for ggplot
long_cv_sl <- gather(cv_sl, key = "condition", value = "cv") %>%
  extract(condition, into = 'group', "(.*)\\.sl", remove = FALSE)

# traditional boxplots
ggplot(long_cv_sl, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("SL CV distributions")

# density plots
ggplot(long_cv_sl, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 150)) +
  ggtitle("SL CV distributions")

# need long form for ggplot
long_cv_tmm <- gather(cv_tmm, key = "condition", value = "cv") %>%
  extract(condition, into = 'group', "(.*)\\.tmm", remove = FALSE)

# traditional boxplots
ggplot(long_cv_tmm, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Final CV distributions")

# density plots
ggplot(long_cv_tmm, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 100)) +
  ggtitle("Final CV distributions")

We have different numbers of samples in each group: 4 control, 8 primary, and only two secondary. CVs for an n=2 group might not be very reliable.

Finally, compare samples within each biological condition

We can also look at each biological condition with a multi-panel scatter plot grid and see how similar the replicates are to each other. We can look at the intensities before IRS and after IRS with TMM (final data).

In [15]:
# multi-panel scatter plot grids, SLNorm data
pairs.panels(log10(tmt_sl[control]), lm = TRUE, main = "Controls - SLNorm")
pairs.panels(log10(tmt_sl[primary]), lm = TRUE, main = "Primary - SLNorm")
pairs.panels(log10(tmt_sl[secondary]), lm = TRUE, main = "Secondary - SLNorm")
In [16]:
# multi-panel scatter plot grids, final data
pairs.panels(log10(tmt_tmm[control]), lm = TRUE, main = "Control - final")
pairs.panels(log10(tmt_tmm[primary]), lm = TRUE, main = "Primary - final")
pairs.panels(log10(tmt_tmm[secondary]), lm = TRUE, main = "Secondary - final")

IRS reduces sample-to-sample variation

We have considerable variance (large scatter and lower correlation coefficients) between plexes before IRS. Within plex correlation coefficients are 0.97 to 0.98; between plex values are around 0.89. That is removed by the IRS method, where all coefficients are 0.97 or greater. We see that we do have reasonable sample-to-sample scatter even after IRS. This is not surprising given the biological variability of human samples.


EdgeR statistical testing starts here

We have loaded data into some edgeR data structures so we could run the TMM normalization function. We have not yet run any of the edgeR functions that do the statistical testing.


(1) Compare Control to Primary

Compute the shared variance trend

One of the more powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 14 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

In [17]:
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.

Testing of control versus primary

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide trended dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

In [18]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "P"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Control vs Primary p-value distribution")
        P-C
Down    167
NotSig 3098
Up      385

Candidates and p-values look good

The numbers of candidates are moderate and split between up and down regulation, as can be seen in the MA plot. The p-value distribution has a nice flat distribution of larger p-values (from non-DE candidates) and a large spike at small p-values from true DE candidates.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all of the three comparisons into all_results near the end of the notebook.

In [19]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, control, "C", primary, "P")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_CvsP")

# accumulate the testing results
all_results <- results_temp

Count candidates and look at fold-change distributions

Log2 fold-change distribution plots are fairly common in proteomics. We can make those separately for the different levels of significant categories from the edgeR test results (no > 0.10 > low > 0.05 > med > 0.01 > high).

In [20]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs P logFC distributions by candidate")

# plot log2 fold-changes by category in facet plots
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs P logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 129
med 222
low 201
no 3098

Main summary plots

We have many comparisons to visualize, so we will use some of the functions at the top of the notebook to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots

In [21]:
# make MA plots
MA_plots(results, "ave_C", "ave_P", "Control vs Primary")

Scatter plots

In [22]:
# make scatter plots
scatter_plots(results,  "ave_C", "ave_P", "Control vs Primary")

Volcano plot

In [23]:
# make a volcano plot
volcano_plot(results,  "ave_C", "ave_P", "Control vs Primary")

The candidates are more over-expressed in the primary samples

We have more candidates associated with over-expression in the primary samples, and the fold-changes seem larger.

Plot the top 25 up and top 25 down differentially expressed proteins

We will look at some protein in more detail. We will filter for the 25 most statistically significant over and under expressed proteins. The plots are colored by condition. The labels at the top of each plot have the identifier, the average reporter ion intensity, the B-H corrected p-value, and the fold change.

Note: this data has not yet been published (I know, why not?) so I have hidden the protein accessions. Sorry!

In [24]:
top_N <- 25
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(primary), top_N)
set_plot_dimensions(7, 7)

Individual protein plots are variable

The plots demonstrate that there are different ways for expression patterns to hit the same p-values. Some effort to tease apart biological significance from statistical significance is part of the landscape in these survey experiments.

(2) Compare Control to Secondary

We have only 2 secondary samples

We will repeat similar analysis steps to compare the 4 control samples to the 2 secondary samples.

In [25]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "S"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("C vs S p-value distribution")
        S-C
Down      6
NotSig 3596
Up       48

Candidates and p-values look okay

We have reduced numbers of candidates and the data are a bit noisier. The p-value distribution has a flat distribution of larger p-values (from non-DE candidates) and a spike at small p-values from true DE candidates. We again have some candidate bias towards over-expression in the secondary samples.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [26]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, control, "C", secondary, "S")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_CvsS")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [27]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("C vs S logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 4
med 3
low 47
no 3596

Main summary plots

MA plots

In [28]:
# make MA plots
MA_plots(results, "ave_C", "ave_S", "Control vs Secondary")

Scatter plots

In [29]:
# make scatter plots
scatter_plots(results,  "ave_C", "ave_S", "Control vs Secondary")

Volcano plot

In [30]:
# make a volcano plot
volcano_plot(results,  "ave_C", "ave_S", "Control vs Secondary")

Check top individual proteins (10 up and 10 down)

In [31]:
top_N <- 10
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(secondary), top_N)
set_plot_dimensions(7, 7)

(3) Compare Primary to Secondary

Compare primary to secondary vestibulitis samples

In [32]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("P", "S"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("P vs S p-value distribution")
        S-P
Down      3
NotSig 3646
Up        1

There are few DE candidates

There are few DE candidates and the p-value distribution is flat without any low p-value spike.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [33]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, primary, "P", secondary, "S")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_PvsS")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [34]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("P vs S logFC distributions by candidate")
A tibble: 3 × 2
candidaten
<fct><int>
high 2
med 2
no 3646

MA plots

In [35]:
# make MA plots
MA_plots(results, "ave_P", "ave_S", "Primary vs Secondary")

Scatter plots

In [36]:
# make scatter plots
scatter_plots(results, "ave_P", "ave_S", "Primary vs Secondary")

Volcano plot

In [37]:
# make a volcano plot
volcano_plot(results, "ave_P", "ave_S", "Primary vs Secondary")

Not much difference between primary and secondary conditions

There are only two secondary samples, so it is hard to know if the lack of DE candidates has much real significance. The two secondary samples were not too close together on the MDS plots.


Summary

We demonstrated that the normalizations worked and that the general data quality was good. We had reasonable number of samples in two of the groups (controls and primary). The two samples for the third group limit the comparisons. Save the testing results table for integration with the proteomics results.


Save the all_results frame to TSV file

In [38]:
write.table(all_results, "Fusion_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information

In [39]:
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.8.12    edgeR_3.24.3    limma_3.38.3    scales_1.0.0   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
 [9] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     locfit_1.5-9.1   tidyselect_0.2.5 repr_1.0.1      
 [5] splines_3.5.3    haven_2.1.1      lattice_0.20-38  colorspace_1.4-1
 [9] generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6  base64enc_0.1-3 
[13] rlang_0.4.0      pillar_1.4.2     foreign_0.8-72   glue_1.3.1      
[17] withr_2.1.2      modelr_0.1.5     readxl_1.3.1     uuid_0.1-2      
[21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.4     
[25] evaluate_0.14    labeling_0.3     parallel_3.5.3   broom_0.5.2     
[29] IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4  IRkernel_1.0.2  
[33] jsonlite_1.6     mnormt_1.5-5     hms_0.5.1        digest_0.6.20   
[37] stringi_1.4.3    grid_3.5.3       cli_1.1.0        tools_3.5.3     
[41] magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.2 
[45] zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4  assertthat_0.2.1
[49] httr_1.4.1       rstudioapi_0.10  R6_2.4.0         nlme_3.1-141    
[53] compiler_3.5.3  
In [ ]: