Re-analysis of TMT data MORG-75: Compare Fusion SPS MS3 to QE MS2¶

Analysis performed by Phil Wilmarth¶

January 20, 2020¶

Overview
Library and data loading
Clustering and normalization
1. Fusion: Control versus Primary
2. QE: Control versus Secondary
3. Comparisons
Summary

Overview ¶

This is data from the Terry Morgan lab at OHSU. Tandem mass tags were used to label human vulvar vestibulitis tissue samples. There were 4 control samples, 8 primary vestibulitis, and 2 secondary vestibulitis. The 14 samples were distributed across two TMT 10-plex channels. Two of the controls (41 and 42) were repeated in the second TMT-plex. A master pool of all samples was created, and duplicate channels of the pooled standard were included in each 10-plex run.

The Internal Reference Scaling method was used to adjust each TMT experiment onto a common reporter ion scale. The data was acquired on a first generation Thermo Fusion Tribrid instrument using the SPS MS3 method in August 2015. An on-line high pH separation was used to fractionate the samples into 17 first dimensions. A low pH 2-hour gradient was used for the second dimension. Data analysis used the open source PAW pipeline to confidently identify proteins with high sensitivity and aggregate reporter ion signals.

The samples (saved in the minus 80 freezer) were run under similar LC conditions on a Q-Exactive HF in May 2017. The PAW pipeline is flexible enough to mix SPS MS3 data from the Fusion with the MS2 data on the QE. We can even use the IRS method to put all of the data on the same intensity scale since the same pooled standard channels are present in all runs. IRS can only adjust data if it is observed in each of the 4 plexes (the intersection of the results). That reduces the number of quantifiable proteins to about 3700, but with the advantage that we will have data for those proteins from both platforms. Then, we can safely do some head-to-head comparisons.

The study has three groups, but only two have reasonable numbers of replicates. To simplify the comparison between platforms, we will only look at the control versus primary biological comparison.

Statistical analysis used the Bioconductor package edgeR in a Jupyter notebook running an R kernel. The PAW pipeline produces manageable, informative summary tables. The PAW results and statistical testing results were integrated into a final Excel spreadsheet.

The data has not yet been published and is not publicly available.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

Load the necessary R libraries¶

# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
library(ggExtra)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor


Attaching package: ‘psych’

The following objects are masked from ‘package:scales’:

    alpha, rescale

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Define common functions for notebook use¶

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}


# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
    "X"  # trying to keep data anonymous for MORG-75
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data ¶

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward. We have the biological samples distributed across two 10-plexes for each instrument/method. That will be a total of 4 plexes in the combined IRS results table. We have embedded instrument codes of "OT" for the Fusion and "QE" for the Q-Exactive in the LC run names.

# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 5874)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
accessions <- data_all$Accession # save gene names for edgeR so we can double check that results line up

cat("Number of columns:", ncol(data_all))
cat("\nNumber of rows:", nrow(data_all))

Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)
See spec(...) for full column specifications.

Number of columns: 177
Number of rows: 3720

We have 3720 proteins to play with¶

MS3 data and MS2 data have rather different starting reporter ion intensity scales. I was lazy about implementing TMT data extraction in the PAW pipeline. We always do high resolution scans for the reporter ion regions. Since 6-plex reagents are a subset of 10-plex reagents which are a subset of the 11-plex reagents, reporter ions for all 11 channels are extracted independent of what TMT kit you actually used. Here, we have 10-plex kits, so the 131C channels are unused. It can be interesting to look at the unused channels to assess channel cross talk and other QC issues.

For fun, we will examine the raw reporter ion data before any normalizations.

# see how raw data clusters - we have 11 channels per plex (one unused)
data_raw <- select(data_all, starts_with("TotInt_"))

# see colors by plex: black and  blue are Fusion; red and green are QE
colors_all <- c(rep("black", 11), rep("red", 11), rep("blue", 11), rep("green", 11))

# check the intensity distributions
boxplot(log10(data_raw), col = colors_all, notch = TRUE, main = "Raw Data")

MS2 intensities are about 2 orders of magnitude larger¶

The two Fusion plexes are in black and blue colors. The two QE plexes are in red and green (apologies to any color-blind folks). We can see that the last channels in each plex are much lower in intensity (as expected). We see that the QE MS2 data has intensities that are about 100 times larger than the Fusion SPS MS3 data.

# see how samples cluster
plotMDS(log2(data_raw), main = "Raw Data", col = colors_all)

The unused channels dominate the clustering¶

The two Fusion plexes are in black and blue colors. The two QE plexes are in red and green. We can drop the unused channels and redo the clustering.

# remove the 4 unused channels and redo the cluster plot
colors_used <- c(rep("black", 10), rep("red", 10), rep("blue", 10), rep("green", 10))
plotMDS(log2(select(data_raw, -contains("131C"))), main = "Raw Data (used channels)", col = colors_used)

Data clusters by TMT plex¶

No surprise. The pseudo random MS2 scan selection effect drives the clustering of the raw data. That is the whole reason for the IRS method. We need to correct that, or our precise isobaric labeling technique is practically useless. Based on the sizes of the clusters, the QE data has smaller differences between channels. We can extract the reference channels and see what effect IRS has on them.

# the SLNorm data is before IRS and the IRSNorm data is after
data_sl <- select(data_all, contains("SLNorm"), -contains("unused"))
data_irs <- select(data_all, contains("IRSNorm"), -contains("unused"))

# get just the pooled reference channels
pools_before <- select(data_sl, contains("Pool"))
pools_after <- select(data_irs, contains("Pool"))

# see how the data correlate before IRS
pairs.panels(log10(pools_before), main = "Pooled standards before IRS")

Data within plexes are very different from data between plexes¶

It should be noted that we have done the sample-loading corrections (SLNorm) to the above data. We have globally matched total intensities to take care of some of that 100-fold intensity difference. We see that the variance caused by the random sampling is severe enough that any statistical testing of non-IRS-corrected data would only be able to find very large expression changes.

pairs.panels(log10(pools_after), main = "Pooled standards after IRS")

The IRS method is amazing!!!¶

Or is it? While the differences in the scatter plots before and after IRS are pretty dramatic, this is kind of what the mathematics of the IRS method produces. It makes the reference channel intensities the same. We see some deviations from truly identical intensities because we have two reference channels per plex and we made the averages identical. If we plotted the average reference intensities per plex, those would be perfect.

Keep duplicate channels for the QC checks¶

We have two duplicated controls, so we only have 4 control samples instead of 6. We will keep the two extra channels in the data for the quality control steps. It will give us another independent way to validate the IRS method. For fun, we can mix the Fusion SPS MS3 data and the QE MS2 data in one clustering plot. The four colors still denote the four plexes (as above).

plotMDS(log2(data_irs), main = "All channels after IRS", col = colors_used)

The TMT plex effect is removed by the IRS method.¶

We also see that the duplicated control samples (41 and 42) are close together by platform (the Fusion or the Q-Exactive).

We can make the cluster plots separately for each platform.¶

Before we do that, we will drop the duplicate control 41 and 42 channels so we have the data prepped for the statistical testing. We will keep one from each plex by platform.

# define the 2 groups for the Fusion
control_ot <- data_irs %>% select(., contains("Control")) %>% 
  select(., -ends_with("_2")) %>% 
  select(., -ends_with("_4")) %>%
  select(., -contains("42_1")) %>%
  select(., -contains("41_3"))
primary_ot <- data_irs %>% select(., contains("Primary")) %>% 
  select(., -ends_with("_2")) %>% select(., -ends_with("_4"))

# define the 2 groups for the Q-Exactive
control_qe <- data_irs %>% select(., contains("Control")) %>% 
  select(., -ends_with("_1")) %>% 
  select(., -ends_with("_3")) %>%
  select(., -contains("42_2")) %>%
  select(., -contains("41_4"))
primary_qe <- data_irs %>% select(., contains("Primary")) %>% 
  select(., -ends_with("_1")) %>% select(., -ends_with("_3"))

# put groups together into single data frames
tmt_ot <- bind_cols(control_ot, primary_ot)
tmt_qe <- bind_cols(control_qe, primary_qe)

# define the column indexes for each condition
control <- 1:4
primary <- 5:12

# set some colors by condition
color_groups = c(rep('red', length(control)), rep('blue', length(primary)))

# check clustering before starting analysis
plotMDS(log2(tmt_ot), main = "Fusion after IRS", col = color_groups)
plotMDS(log2(tmt_qe), main = "QE after IRS", col = color_groups)

Control (red) and primary (blue) samples have some separation¶

The two conditions show some separation and the general pattern is similar between platforms.

Run TMM normalization ¶

Some data normalizations are integral to the IRS procedure. There can still be compositional differences between samples from different biological conditions that IRS would not address. The trimmed mean of M-values normalization in edgeR is designed for these large-scale experiments.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

We load the data into edgeR data structures and called the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We know from running separate analysis notebooks that the median CVs for the QE data are not the same as on the Fusion. We should not do pooled variances for data that is not similar. We will need to make two DGEList objects, one for each platform.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We have to compute the normalized intensities with the apply_tmm_factors function.

# get the biological sample data into a DGEList object
group = c(rep("C", length(control)), rep("P", length(primary)))
y_ot <- DGEList(counts = tmt_ot, group = group, genes = accessions)
y_qe <- DGEList(counts = tmt_qe, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y_ot <- calcNormFactors(y_ot)
y_qe <- calcNormFactors(y_qe)

# compute the Fusion normalized intensities
cat("Fusion data -")
tmt_ot_tmm <- apply_tmm_factors(y_ot, color = color_groups)

# check the Fusion clustering
plotMDS(y_ot, col = color_groups, main = "Fusion samples after TMM")

Fusion data -
Library size factors:
 IRSNorm_Control-30_1 -> 0.995325
 IRSNorm_Control-41_1 -> 0.997747
 IRSNorm_Control-42_3 -> 1.001469
 IRSNorm_Control-40_3 -> 1.002549
 IRSNorm_Primary-31_1 -> 0.999103
 IRSNorm_Primary-26_1 -> 1.000905
 IRSNorm_Primary-29_1 -> 1.000657
 IRSNorm_Primary-35_3 -> 1.004297
 IRSNorm_Primary-34_3 -> 0.998760
 IRSNorm_Primary-39_3 -> 0.998459
 IRSNorm_Primary-38_3 -> 1.002824
 IRSNorm_Primary-37_3 -> 0.997977

Trimmed mean of M-values (TMM) factors:
 IRSNorm_Control-30_1 -> 0.967738
 IRSNorm_Control-41_1 -> 0.959007
 IRSNorm_Control-42_3 -> 0.914071
 IRSNorm_Control-40_3 -> 0.901118
 IRSNorm_Primary-31_1 -> 0.992809
 IRSNorm_Primary-26_1 -> 1.042597
 IRSNorm_Primary-29_1 -> 0.988774
 IRSNorm_Primary-35_3 -> 0.925324
 IRSNorm_Primary-34_3 -> 1.067927
 IRSNorm_Primary-39_3 -> 1.014095
 IRSNorm_Primary-38_3 -> 1.174413
 IRSNorm_Primary-37_3 -> 1.086038

Combined (lib size and TMM) normalization factors:
 IRSNorm_Control-30_1 -> 0.963213
 IRSNorm_Control-41_1 -> 0.956847
 IRSNorm_Control-42_3 -> 0.915414
 IRSNorm_Control-40_3 -> 0.903415
 IRSNorm_Primary-31_1 -> 0.991918
 IRSNorm_Primary-26_1 -> 1.043540
 IRSNorm_Primary-29_1 -> 0.989423
 IRSNorm_Primary-35_3 -> 0.929301
 IRSNorm_Primary-34_3 -> 1.066603
 IRSNorm_Primary-39_3 -> 1.012532
 IRSNorm_Primary-38_3 -> 1.177729
 IRSNorm_Primary-37_3 -> 1.083841

Normalization factors are all close to 1.0 and the clustering plot is not too different before TMM ("Fusion after IRS", a few cells above) or after (right above).

Check the TMM normalization for the Q-Exactive¶

# compute the Q-Exactive normalized intensities
cat("Q-Exactive data -")
tmt_qe_tmm <- apply_tmm_factors(y_qe, color = color_groups)

# check the Q-Exactive clustering
plotMDS(y_qe, col = color_groups, main = "Q-Exactive samples after TMM")

Q-Exactive data -
Library size factors:
 IRSNorm_Control-30_2 -> 1.002748
 IRSNorm_Control-41_2 -> 0.998475
 IRSNorm_Control-42_4 -> 0.997566
 IRSNorm_Control-40_4 -> 0.996088
 IRSNorm_Primary-31_2 -> 0.998841
 IRSNorm_Primary-26_2 -> 0.999570
 IRSNorm_Primary-29_2 -> 1.000759
 IRSNorm_Primary-35_4 -> 1.002617
 IRSNorm_Primary-34_4 -> 0.998629
 IRSNorm_Primary-39_4 -> 1.001352
 IRSNorm_Primary-38_4 -> 1.004555
 IRSNorm_Primary-37_4 -> 0.998866

Trimmed mean of M-values (TMM) factors:
 IRSNorm_Control-30_2 -> 1.018056
 IRSNorm_Control-41_2 -> 0.962224
 IRSNorm_Control-42_4 -> 0.960962
 IRSNorm_Control-40_4 -> 0.946647
 IRSNorm_Primary-31_2 -> 0.992723
 IRSNorm_Primary-26_2 -> 1.022902
 IRSNorm_Primary-29_2 -> 0.993096
 IRSNorm_Primary-35_4 -> 0.933735
 IRSNorm_Primary-34_4 -> 1.056840
 IRSNorm_Primary-39_4 -> 1.000932
 IRSNorm_Primary-38_4 -> 1.056299
 IRSNorm_Primary-37_4 -> 1.066548

Combined (lib size and TMM) normalization factors:
 IRSNorm_Control-30_2 -> 1.020853
 IRSNorm_Control-41_2 -> 0.960756
 IRSNorm_Control-42_4 -> 0.958623
 IRSNorm_Control-40_4 -> 0.942944
 IRSNorm_Primary-31_2 -> 0.991573
 IRSNorm_Primary-26_2 -> 1.022463
 IRSNorm_Primary-29_2 -> 0.993850
 IRSNorm_Primary-35_4 -> 0.936179
 IRSNorm_Primary-34_4 -> 1.055391
 IRSNorm_Primary-39_4 -> 1.002285
 IRSNorm_Primary-38_4 -> 1.061111
 IRSNorm_Primary-37_4 -> 1.065338

TMM factors are pretty close to 1.0 for either platform¶

The IRS method kind of takes care of the library size type of normalizations before doing the protein-by-protein reference scaling. We could still have some additional compositional differences between sample groups that TMM normalization would correct. Most TMM factors are not very different from 1.0, so we do not seem to have any large compositional differences.

The box plots are very similar across samples. TMM usually gets the medians (box centers) aligned pretty well. We also have the heights of the boxes being very similar sample-to-sample. We do not see much difference in the cluster plots after IRS or after both IRS and TMM.

Check CV distributions¶

The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This is an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the biological conditions.

# put CVs in data frames to simplify plots and summaries
cv_ot <- data.frame(control = CV(tmt_ot_tmm[control]), primary = CV(tmt_ot_tmm[primary])) 
cv_qe <- data.frame(control = CV(tmt_qe_tmm[control]), primary = CV(tmt_qe_tmm[primary])) 

# see what the median CV values are
medians <- apply(cv_ot, 2, FUN = median)
print("Fusion median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_qe, 2, FUN = median)
print("Q-Exactive median CVs by condition (%)")
round(medians, 2)

[1] "Fusion median CVs by condition (%)"

[1] "Q-Exactive median CVs by condition (%)"

The median CVs are quite low¶

We have considerably smaller median CVs for the Q-Exactive MS2 reporter ions compared to the Fusion's SPS MS3 data.

Use ggplot to explore the CV data distributions¶

# see what the CV distibutions look like
# need long form for ggplot
long_cv_ot <- gather(cv_ot, key = "condition", value = "cv")

# traditional boxplots
ggplot(long_cv_ot, aes(x = condition, y = cv, fill = condition)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Fusion CV distributions")

# density plots
ggplot(long_cv_ot, aes(x = cv, color = condition)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 100)) +
  ggtitle("Fusion CV distributions")

CV distributions for both conditions are similar¶

Even though we have 4 control samples and 8 primary samples, the CV distributions are very similar.

# need long form for ggplot
long_cv_qe <- gather(cv_qe, key = "condition", value = "cv")

# traditional boxplots
ggplot(long_cv_qe, aes(x = condition, y = cv, fill = condition)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Q-Exactive CV distributions")

# density plots
ggplot(long_cv_qe, aes(x = cv, color = condition)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 100)) +
  ggtitle("Q-Exactive CV distributions")

Q-Exactive CV distributions are also similar¶

The CV distributions are shifted to smaller CV values for the Q-Exactive, but both conditions have similar CV distributions.

Finally, compare samples within each biological condition¶

We can also look at each biological condition with a multi-panel scatter plot grid and see how similar the replicates are to each other.

Fusion data first¶

# multi-panel scatter plot grids, Fusion
pairs.panels(log10(tmt_ot_tmm[control]), lm = TRUE, main = "Control - Fusion")
pairs.panels(log10(tmt_ot_tmm[primary]), lm = TRUE, main = "Primary - Fusion")

Q-Exactive data next¶

# multi-panel scatter plot grids, Q-Exactive
pairs.panels(log10(tmt_qe_tmm[control]), lm = TRUE, main = "Control - Q-Exactive")
pairs.panels(log10(tmt_qe_tmm[primary]), lm = TRUE, main = "Primary - Q-Exactive")

IRS reduces sample-to-sample variation¶

The different normalizations have reduced non-biological sources of variance and we have good similarity of samples within each condition. The scatter in the correlation plots is a little larger for the SPS MS3 Fusion data than for the QE MS2 data.

EdgeR statistical testing starts here¶

We have already loaded the TMT data into some edgeR data structures so that we could run the TMM normalization. We have not done any of the processing steps yet for the statistical analysis.

Compute the shared variance trend¶

One of the more powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 12 samples across two conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. There is, unfortunately, some flexibility in what samples are included in the trended variance estimate and different choices produce different test results. We want a more robust estimate of the biological variability, so dropping the two duplicated control samples (more of a technical replicate situation) was clear. The two secondary condition samples are more complicated. We could include them in the edgeR data structure and used them in the trended variance estimate. That might be more conservative. We will only use samples from the two conditions here to try and keep things a little simpler.

We have an edgeR estimateDisp function that does the trended variance estimate and a visualization function to check the result. As mentioned above, we do not want to mix the MS3 and MS2 data given its different variance (and variance trends). We also do not want the two repeated control samples in the data for edgeR analysis. To be more conservative, we will drop control-41 from the first plex and control-42 from the second plex to better balance out the control between the two plexes.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

# compute dispersions and plot BCV
y_ot <- estimateDisp(y_ot)
plotBCV(y_ot, main = "Biological variance - Fusion SPS MS3")

y_qe <- estimateDisp(y_qe)
plotBCV(y_qe, main = "Biological variance - Q-Exactive MS2")

Design matrix not provided. Switch to the classic mode.
Design matrix not provided. Switch to the classic mode.

Variance trends are different for each platform¶

We see that the abundance-dependent variance trends are different on the two platforms. We also have smaller biological coefficients of variance (BCV) for the MS2 QE data compared to the SPS MS3 Fusion data.

(1) Compare Control to Primary for Fusion data ¶

Testing of control versus primary¶

We usually specify the pair of interest for the exact test in edgeR if we have more than two conditions and use the experiment-wide dispersion. For this example, we only have two conditions. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

# data has been normalizied and the trended variance computed
# the exact test object has columns like fold-change, CPM, and p-values
et_ot <- exactTest(y_ot, pair = c("C", "P"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et_ot, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et_ot)$table
tt_ot <- topTags(et_ot, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et_ot, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt_ot$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et_ot$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("C vs P Fusion p-value distribution")

        P-C
Down    213
NotSig 3128
Up      379

Candidates and p-values look good¶

The numbers of candidates are moderate and split between up and down regulation, as can be seen in the MA plot. Many of the red and blue DE candidates have less than a 2-fold change. We tend to have larger fold-changes for the red candidates compared to the blue candidates. Both red and blue candidates tend to be separated from the bulk of the non-DE-candidate black points.

The p-value distribution has a nice flat (uniform) distribution of larger p-values (from non-DE candidates) and a large spike at small p-values from true DE candidates.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results_ot data frame (which also has the TMM-normalized data values).

# get the results summary
results <- collect_results(tmt_ot_tmm, tt_ot$table, control, "C", primary, "P")

# make column names unique by adding comparison (for the accumulated frame)
results_ot  <- results
colnames(results_ot) <- str_c(colnames(results), "_CvsP_ot")

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs P Fusion logFC distributions by candidate")

Main summary plots for Fusion SPS MS3 data¶

We have many comparisons to visualize, so we will use some of the functions at the top of the notebook to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots¶

# make MA plots
MA_plots(results, "ave_C", "ave_P", "Control vs Primary - Fusion")

We see some correlation between the difference in means and the p-values. The "gap" around the 1-to-1 solid black line is progressively larger as one goes from low- to medium- to high-significance candidates. We also see the trend towards over-expression in primary condition (positive log2 FC values).

Scatter plots¶

# make scatter plots
scatter_plots(results,  "ave_C", "ave_P", "Control vs Primary - Fusion")

Volcano plot¶

# make a volcano plot
volcano_plot(results,  "ave_C", "ave_P", "Control vs Primary - Fusion")

The candidates are more over-expressed in the primary samples¶

We have more candidates associated with over-expression in the primary samples, and the fold-changes are larger in that direction.

Plot the top 10 up and top 10 down differentially expressed proteins¶

We will look at some protein in more detail. We will filter for the 10 most statistically significant over and under expressed proteins. The plots are colored by condition. The labels at the top of each plot have the identifier, the average reporter ion intensity, the B-H corrected p-value, and the fold change.

Note: identifier is masked here because the data is not yet published.

top_N <- 10
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(primary), top_N)
set_plot_dimensions(7, 7)

Individual protein plots are variable¶

The top 10 candidates in each direction are based on the Benjamini-Hochberg corrected p-values. These are the most statistically significant proteins. That is not the same as sorting by fold-change, for example.

The plots demonstrate that there are different ways for expression patterns to hit the same p-values. The test p-values depend on both the difference in the means and the relative degree of variance (the scatter in the data around the mean). You can get similar p-values from data with smaller mean differences if the variance is also smaller compared to larger differences in means but with more variance. Some effort to tease apart biological significance from statistical significance is part of the landscape in these survey experiments.

(2) Compare Control to Primary for Q-Exactive data ¶

We will repeat similar analysis steps to compare the control and primary samples for the QE data.

# data has been normalizied and the trended variance computed
# the exact test object has columns like fold-change, CPM, and p-values
et_qe <- exactTest(y_qe, pair = c("C", "P"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et_qe, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
# topTags(et_qe)$table
tt_qe <- topTags(et_qe, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et_qe, p.value = 0.10)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt_qe$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et_qe$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("C vs P QE p-value distribution")

        P-C
Down    276
NotSig 2992
Up      452

Candidates and p-values look okay¶

We had 592 candidates from the SPS MS3 Fusion data. We have 728 from the MS2 QE data, an increase of 23%. Very few DE candidates exceed a 2-fold change on the QE.

The p-value distribution has a flat distribution of larger p-values (from non-DE candidates) and a spike at small p-values from true DE candidates. We again have some candidate bias towards over-expression in the secondary samples.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

# get the results summary
results <- collect_results(tmt_qe_tmm, tt_qe$table, control, "C", primary, "P")

# make column names unique by adding comparison
results_qe  <- results
colnames(results_qe) <- str_c(colnames(results), "_CvsP_qe")

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("C vs S QE logFC distributions by candidate")

Main summary plots¶

MA plots¶

# make MA plots
MA_plots(results, "ave_C", "ave_P", "Control vs Primary - QE")

We see some correlation between the difference in means and the p-values (candidate categories). The "gap" around the 1-to-1 solid black line is a little larger as one goes from low- to medium- to high-significance candidates. We also see the trend towards over-expression in primary condition (positive log2 FC values). This is similar to what we saw for the Fusion, but with smaller fold-changes.

Scatter plots¶

# make scatter plots
scatter_plots(results,  "ave_C", "ave_P", "Control vs Primary - QE")

Volcano plot¶

# make a volcano plot
volcano_plot(results,  "ave_C", "ave_P", "Control vs Primary - QE")

Check top 10 up and top 10 down (by FDR) protein expression changes¶

top_N <- 10
set_plot_dimensions(6, 3.5)
plot_top_tags(results, length(control), length(primary), top_N)
set_plot_dimensions(7, 7)

(3) Compare Fusion (MS3) to QE (MS2)¶

Compare the two platforms and acquisition methods more head-to-head¶

Since we have the same set of proteins for each platform and have them on a common intensity scale, we can directly compare fold-changes, etc. Let's start with the log2 fold-changes. We can do a plot with marginal distributions and then one with some information about the trend line.

# see how expression ratio correlate
library(ggExtra)

# combined the results
all_results <- cbind(results_ot, results_qe)

logplot <- ggplot(all_results, aes(x = logFC_CvsP_ot, y = logFC_CvsP_qe)) +
  geom_point() +
  geom_smooth(method = lm) +
  ggtitle("Log2 Fold-Change: MS2 (y-axis) vs SPS MS3 (x-axis)")

ggMarginal(logplot, type = "density")

# see: https://community.rstudio.com/t/annotate-ggplot2-with-regression-equation-and-r-squared/6112
ggplotRegression <- function(dat, xvar, yvar){
  
  fml <- paste(yvar, "~", xvar)

  fit <- lm(fml, dat)
  
  ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
    geom_point() +
    stat_smooth(method = "lm", col = "red") +
    labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                       "Intercept =",signif(fit$coef[[1]],5 ),
                       " Slope =",signif(fit$coef[[2]], 5),
                       " P =",signif(summary(fit)$coef[2,4], 5)))
}
ggplotRegression(all_results, "logFC_CvsP_ot", "logFC_CvsP_qe")

Expression ratios are correlated¶

I hate these plots because all of the data is collapsed to the origin. The correlation and trend line depend on a small number of outlier ratios. The slope of the trend line is of log2 ratio values. That is harder to interpret without doing some math.

Can also use the conventional fold-change values (simple ratios with signs)¶

We can also work with the ratios rather than their logs. We will do the same two types of plots.

# can also use conventional FC units
fcplot <- ggplot(all_results, aes(x = FC_CvsP_ot, y = FC_CvsP_qe)) +
  geom_point() +
  geom_smooth(method = lm) + 
  ggtitle("Fold change: MS2 (y-axis) vs SPS MS3 (x-axis)")

ggMarginal(fcplot, type = "density")

# see: https://community.rstudio.com/t/annotate-ggplot2-with-regression-equation-and-r-squared/6112
ggplotRegression <- function(dat, xvar, yvar){
  
  fml <- paste(yvar, "~", xvar)

  fit <- lm(fml, dat)
  
  ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
    geom_point() +
    stat_smooth(method = "lm", col = "red") +
    labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                       "Intercept =",signif(fit$coef[[1]],5 ),
                       " Slope =",signif(fit$coef[[2]], 5),
                       " P =",signif(summary(fit)$coef[2,4], 5)))
}
ggplotRegression(all_results, "FC_CvsP_ot", "FC_CvsP_qe")

Most expression changes trend the same on both platforms¶

Most of the data points are in the upper right or lower left quadrants indicating that the expression changes trended the same on both platforms. We do have some ratios that differed between platforms and are in the other two quadrants. They are likely ratios close to 1-to-1 from the non-DE-candidate proteins.

The slope of the trend line indicates that the expression changes measured from MS2 scans on the Q-Exactive are about half as large as measured by the SPS MS3 data from the Fusion.

Get some more comparison stats from the results file¶

I opened the results file (written below) in Excel to get some additional comparison data. I cannot overstate how non-trivial comparisons like this can be. Anyone who thinks the outcome is going to be black and white or a slam dunk should find a new profession.

Are more candidates more better? We have more candidates on the QE. We identified more total proteins on the QE compared to the Fusion. We have an even larger difference in candidates when we analyze the two experiments separately.

Here, we are working with the proteins seen in both experiments with usable reporter ion signals. The Fusion results are essentially a subset of the QE results. That makes sense because we have a large separation, so we get good proteome depth. We will have overlap for most proteins that are above the trigger threshold on the Fusion. The faster scanning QE can dig a little deeper into the proteome and we pick up some lower abundance proteins on the QE.

Are larger fold changes (presumably also more accurate) what we want? The fold changes on the SPS MS3 data from the Fusion are about twice as large as the QE data. Do we want as few false positives as possible? Maybe taking only the differential expression candidates confirmed by both platforms would be the safest? There are many conflicting potential optimization choices. Are these choices independent of the samples and the experimental design, or are they tightly coupled? It is never simple or easy.

I decided to count and compare candidates between the platforms using an FDR cutoff of 0.05 (medium and high candidates). Any use of a hard cutoff can be problematic. Are the p-values calibrated exactly the same in both datasets? A protein could have an FDR of 0.04 on one platform and be 0.06 on the other. That would be a candidate on one platform but not the other. There will always be proteins that have FDRs that straddle the cutoff. There needs to be some fuzziness in how things are counted to do it properly, but that is harder to do.

We are measuring the same samples on two platforms. Ideally, we would ID the same proteins and measure the same expression levels. Obviously, that is not the case. We can use confirmation between platforms as a validation metric, so we can see what fraction of DE candidates on each platform were confirmed by the other platform. Since we are working on the proteins seen in both instruments, the higher sensitivity of the QE should not bias the counting.

Platform	DE candidates	Confirmed	Fraction
Fusion	374	250	67%
Q-Exactive	518	250	48%

This suggests that the SPS MS3 data might be more accurate than the MS2 data. That matches the literature consensus where many issues with MS2 reporter ions have been reported. The more complicated SPS MS3 method, which is restricted to a very expensive class of instruments, was developed to improve accuracy at the expense of duty factor and sensitivity. We get fewer quantifiable proteins on the Fusion with the SPS MS3 method, but the fold changes are larger, and I like the feel of the data a lot better. I have looked at a lot of SPS MS3 data. I have only done 3 projects (and one is this data) with MS2 reporter ions. My gut feeling is that the MS2 data is compressed in an unnatural way. That compression reduces the differences in the expression levels; however, it also shrinks the variance. The mean difference and variance contribute to the p-values differently. Shrinkage in differences between means can increase false negatives. Shrinkage in variance can increase false positives. Both factors are probably at play. There could be losses of true candidates that are compensated by increased false positives so that overall candidate numbers still look reasonable.

I think the wider dynamic range in SPS MS3 data is very important. There can be many statistically significant results that are not biologically significant. Some moderate fold change is a pretty commonly used concept to help distinguish these two cases. The fold change numbers for the SPS MS3 data just seem more sensible to me. I took the 374 DE candidates for the Fusion and the 518 candidates from the QE and checked the largest fold changes (up or down) observed.

Platform	Direction	Fold Change
Fusion	Up	7.2
QE	Up	3.1
Fusion	Down	-17.5
QE	Down	-9.7

We again see the roughly factor of two wider dynamic range for the SPS MS3 data compared to the MS2 data.

Summary ¶

We took the same set of samples and generated TMT reporter ion data on two Orbitrap platforms (a Fusion Tribrid and a Q-Exactive HF) with two different acquisition methods (SPS MS3 scans for the Fusion and MS2 scans for the QE). We could quantify 3650 proteins with the Fusion compared to 4900 for the QE. We were able to combine data from both platforms using the pooled internal standards, the IRS method, and the PAW pipeline. We had 3720 quantifiable proteins from the combined data. The combined data allowed true head-to-head comparisons without the complication of different number of proteins from each platform.

Comparisons between platforms were qualitatively similar in most aspects. The intensity differences between conditions (fold-changes) were smaller for the QE MS2 data by about a factor of two compared to the SPS MS3 Fusion data. The first-generation Fusion is less sensitive than newer Tribrids from Thermo. I think weaker reporter ion signals are more of an issue on this early Fusion than one would have on Lumos or Eclipse instruments.

My recommendation is to use an SPS MS3 method if you can. TMT in MS2 scans on platforms like the Q-Exactive HF looks mostly okay. I do not have as much confidence in the results based on the nature of the data (dynamic range seems artificially compressed and the sample-to-sample variation looks too small). I think the effects that compress the reporter ion dynamic range are variable and that is worrisome. I think instrument cleanliness, tune, and calibration may play a role. I think sample composition, degree of chromatography separation, column carryover, sample load, etc. are also factors. If there are a lot of scans where the reporter ions are near the noise level in intensity, I think the compression will be worse.

In large-scale experiments, there are biological noise filters that scientists often use. Candidates have to fit into some sensible context, and this can be used to tease out a signal in the presence of noise. I would be more careful if I was working up MS2 TMT data.

Save the `all_results` frame to TSV file¶

# combine the two platforms and then write table
write.table(all_results, "Combined_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information¶

sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggExtra_0.9     psych_1.8.12    edgeR_3.24.3    limma_3.38.3   
 [5] scales_1.0.0    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [9] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
[13] ggplot2_3.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-38 
 [5] assertthat_0.2.1 zeallot_0.1.0    digest_0.6.20    mime_0.7        
 [9] IRdisplay_0.7.0  R6_2.4.0         cellranger_1.1.0 repr_1.0.1      
[13] backports_1.1.4  evaluate_0.14    httr_1.4.1       pillar_1.4.2    
[17] rlang_0.4.0      lazyeval_0.2.2   uuid_0.1-2       readxl_1.3.1    
[21] rstudioapi_0.10  miniUI_0.1.1.1   splines_3.5.3    labeling_0.3    
[25] foreign_0.8-72   munsell_0.5.0    shiny_1.3.2      broom_0.5.2     
[29] compiler_3.5.3   httpuv_1.5.1     modelr_0.1.5     pkgconfig_2.0.2 
[33] base64enc_0.1-3  mnormt_1.5-5     htmltools_0.3.6  tidyselect_0.2.5
[37] crayon_1.3.4     withr_2.1.2      later_0.8.0      grid_3.5.3      
[41] nlme_3.1-141     jsonlite_1.6     xtable_1.8-4     gtable_0.3.0    
[45] magrittr_1.5     cli_1.1.0        stringi_1.4.3    promises_1.0.1  
[49] xml2_1.2.2       vctrs_0.2.0      generics_0.0.2   IRkernel_1.0.2  
[53] tools_3.5.3      glue_1.3.1       hms_0.5.1        parallel_3.5.3  
[57] colorspace_1.4-1 rvest_0.3.4      pbdZMQ_0.3-3     haven_2.1.1

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	123
med	251
low	218
no	3128

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	195
med	323
low	210
no	2992

Re-analysis of TMT data MORG-75: Compare Fusion SPS MS3 to QE MS2¶

Analysis performed by Phil Wilmarth¶

January 20, 2020¶

Overview¶

Load the necessary R libraries¶

Define common functions for notebook use¶

Load the IRS-normalized TMT intensity data¶

We have 3720 proteins to play with¶

MS2 intensities are about 2 orders of magnitude larger¶

The unused channels dominate the clustering¶

Data clusters by TMT plex¶

Data within plexes are very different from data between plexes¶

The IRS method is amazing!!!¶

Keep duplicate channels for the QC checks¶

The TMT plex effect is removed by the IRS method.¶

We can make the cluster plots separately for each platform.¶

Control (red) and primary (blue) samples have some separation¶

Run TMM normalization¶

Check the TMM normalization for the Q-Exactive¶

TMM factors are pretty close to 1.0 for either platform¶

Check CV distributions¶

The median CVs are quite low¶

Use ggplot to explore the CV data distributions¶

CV distributions for both conditions are similar¶

Q-Exactive CV distributions are also similar¶

Finally, compare samples within each biological condition¶

Fusion data first¶

Q-Exactive data next¶

IRS reduces sample-to-sample variation¶

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

Variance trends are different for each platform¶

(1) Compare Control to Primary for Fusion data¶

Testing of control versus primary¶

Candidates and p-values look good¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

Main summary plots for Fusion SPS MS3 data¶

MA plots¶

Scatter plots¶

Volcano plot¶

The candidates are more over-expressed in the primary samples¶

Plot the top 10 up and top 10 down differentially expressed proteins¶

Individual protein plots are variable¶

(2) Compare Control to Primary for Q-Exactive data¶

Candidates and p-values look okay¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

Main summary plots¶

MA plots¶

Scatter plots¶

Volcano plot¶

Check top 10 up and top 10 down (by FDR) protein expression changes¶

(3) Compare Fusion (MS3) to QE (MS2)¶

Compare the two platforms and acquisition methods more head-to-head¶

Expression ratios are correlated¶

Can also use the conventional fold-change values (simple ratios with signs)¶

Most expression changes trend the same on both platforms¶

Get some more comparison stats from the results file¶

Summary¶

Save the all_results frame to TSV file¶

Log the session information¶

Overview ¶

Load the IRS-normalized TMT intensity data ¶

Run TMM normalization ¶

(1) Compare Control to Primary for Fusion data ¶

(2) Compare Control to Primary for Q-Exactive data ¶

Summary ¶

Save the `all_results` frame to TSV file¶