Schweppe et al. - Real Time Search TMT¶

Testing cell lines: Regular SPS MS3 before IRS¶

Phil Wilmarth, OHSU¶

May 27, 2020¶

Re-analysis of PXD017823 (human cell lines)¶

The PRIDE data was just the human cell line data presented in Figure 4 from the JPR publication. There are HTC116 cells (first 3 channels), MCF7 cells (the next 3 channels), and HEK293 cells (the last 4 channels) labeled with 10-plex TMT. A 180-minute RP LC method for a 12-fraction separation (high pH RP) using the regular SPS MS3 acquisition was compared to running the 12-fractions with a 90-minute SPS MS3 method using real time search (RTS) and protein close-out. Protein close-out is not described in the methods or in citation 4 (only described in a figure caption) but is (maybe) restricting MS3 scans after 4 successful scans have been acquired for a particular protein.

The data from PRIDE was downloaded and processed with my PAW pipeline:

Use MSConvert to make GZipped text files with MS2 and MS3 scans
Run wide tolerance Comet searches (simple configuration)
Compile top-hit PSM summaries (discriminant scores) with reporter ion peak heights
Accurate mass conditioned target/decoy score histograms to control 1% PSM FDR
Basic and extended parsimonious protein inference
Summed reporter ion totals computed for peptides and proteins
Internal Reference Scaling (IRS) normalization

There are multiple comparisons that can be made. We compared the two acquisition methods that were done in two separate TMT experiments with internal reference scaling (IRS) in our first notebook. We will continue with some statistical testing of the differences between the cell lines.

While IRS did seem to work, the real time search data with protein close-out was quite different. We started with the SLNormalized data from each experiment for the initial testing work. We will see if the data after IRS seems any different. We will take the data from the IRS normalization script table. That takes care of finding the intersection of the protein IDs. We will have the same starting set of proteins from each experiment for testing. We will work with the Regular SPS MS3 experiment after IRS and do our usual edgeR testing with notebooks.

References¶

Devin K. Schweppe, Jimmy K. Eng, Qing Yu, Derek Bailey, Ramin Rad, Jose Navarrete-Perea, Edward L. Huttlin, Brian K. Erickson, Joao A. Paulo, and Steven P. Gygi, 2020. Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics. Journal of proteome research, 19(3), pp.2026−2034.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

www.jupyter.org

Load the necessary R libraries¶

# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
library(janitor)

# set global plot size (default is 7x7 inch)
options(repr.plot.width=9, repr.plot.height=9)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.3.0     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ dplyr   0.8.5
✔ tidyr   1.0.2     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor


Attaching package: ‘psych’


The following objects are masked from ‘package:scales’:

    alpha, rescale


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test

Define common functions for notebook use¶

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data¶

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward. Note than this experimental design has only a single reference channel per plex instead of two.

We need to drop contaminant and decoy proteins, and proteins with missing sets of reporter ions. We extract the accessions column, the SL and IRS normed data columns, and collect channels by biological condition. We also grab the reference channels to check the IRS procedure.

# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 8245)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))

# get the pooled channels
data_pools <- data_all %>% select(., contains("Pool")) %>%
  select(., contains("Norm_"))

# clean up column names 
data_all <- data_all %>% rename_all(funs(str_replace(., "Norm_", "_"))) %>% 
  rename_all(funs(str_replace(., "RTS_1", "RTS"))) %>% 
  rename_all(funs(str_replace(., "Reg_2", "Reg"))) 

# get the data before and after IRS
data_sl <- select(data_all, contains("SL_"), -contains("Pool"))
data_irs <- select(data_all, contains("IRS_"), -contains("Pool"))

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# see how many rows of data we have
length(accessions)

Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)

See spec(...) for full column specifications.

Warning message:
“funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.”

Get the experiment to compare cell lines within¶

We will get the data from the regular SPS MS3 experiment after IRS.

# define the acquisition mode of interest
acq_mode  <- "Reg"

# define data (SL or IRS) and define the cell line indices
df <- select(data_irs, contains(acq_mode))
HCT <- 1:3
MCF <- 4:6
HEK <- 7:10

# set some colors by cell line
colors = c(rep('red', length(HCT)), rep('blue', length(MCF)), rep('green', length(HEK)))

Run TMM normalization and check final clustering¶

We will the data into edgeR data structures and call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering that we had above.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We compute the normalized intensities with the apply_tmm_factors function.

# get the biological sample data into a DGEList object
groups = c(rep("HCT", length(HCT)), rep("MCF", length(MCF)), 
          rep("HEK", length(HEK)))
y <- DGEList(counts = df, group = groups, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

# compute the actual normalized intensities
tmt_tmm <- apply_tmm_factors(y, color = colors)

# check the clustering after TMM
plotMDS(y, col = colors, main = "all samples after TMM")

Library size factors:
 IRS_HCT116_1_Reg -> 1.011919
 IRS_HCT116_2_Reg -> 1.011008
 IRS_HCT116_3_Reg -> 1.012895
 IRS_MCF_1_Reg -> 1.019526
 IRS_MCF_2_Reg -> 1.017015
 IRS_MCF_3_Reg -> 1.016398
 IRS_HEK293_1_Reg -> 0.980158
 IRS_HEK293_2_Reg -> 0.979107
 IRS_HEK293_3_Reg -> 0.980158
 IRS_HEK293_4_Reg -> 0.975052

Trimmed mean of M-values (TMM) factors:
 IRS_HCT116_1_Reg -> 1.032526
 IRS_HCT116_2_Reg -> 1.028441
 IRS_HCT116_3_Reg -> 1.024365
 IRS_MCF_1_Reg -> 1.054369
 IRS_MCF_2_Reg -> 1.034909
 IRS_MCF_3_Reg -> 1.042825
 IRS_HEK293_1_Reg -> 0.951687
 IRS_HEK293_2_Reg -> 0.958510
 IRS_HEK293_3_Reg -> 0.946396
 IRS_HEK293_4_Reg -> 0.935825

Combined (lib size and TMM) normalization factors:
 IRS_HCT116_1_Reg -> 1.044832
 IRS_HCT116_2_Reg -> 1.039763
 IRS_HCT116_3_Reg -> 1.037574
 IRS_MCF_1_Reg -> 1.074956
 IRS_MCF_2_Reg -> 1.052518
 IRS_MCF_3_Reg -> 1.059926
 IRS_HEK293_1_Reg -> 0.932803
 IRS_HEK293_2_Reg -> 0.938484
 IRS_HEK293_3_Reg -> 0.927618
 IRS_HEK293_4_Reg -> 0.912479

TMM factors are close to 1.0¶

The IRS method already takes care of the library size type of normalizations. All of the library adjustment facters were very small. We could still have compositional differences between sample groups that the TMM factors would correct for. We do see some consistent differences by cell line for the TMM factors.

The box plots are very similar. TMM usually gets the medians (box centers) aligned pretty well.

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 10 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")

Design matrix not provided. Switch to the classic mode.

(1) HCT116 versus MCF7¶

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.05. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("HCT", "MCF"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("HCT116 versus MCF7 p-value distribution")

       MCF-HCT
Down      2978
NotSig    1569
Up        2922

We have many candidates!¶

We have 79% of the proteins being differentially expressed. There are likely real differences in these cell lines. We also have very low CVs for most proteins (median CVs are less than 5%), so pretty small expression differences are going to be statistically significant. EdgeR, with the trended dispersion, tends to be more conservate, i.e. requires a larger diffference in means, than a basic t-test.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we do not have many candidates, we will still make the MA plots, scatter plots, and volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, HCT, "HCT", MCF, "MCF")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_HCT_MCF")

# accumulate the testing results
all_results <- results_temp

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("HCT116 vs MCF7 logFC distributions by candidate")

Main summary plots¶

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots¶

# make MA plots
MA_plots(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")

# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))

Candidates look strong¶

There are many candidates with more than 2-fold expression changes. We do have many candidates with smaller fold-changes due to the low CVs.

Check some individual protein expression¶

# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 3, 10)
set_plot_dimensions(9, 9)

The top candidates are convincing¶

We have ridiculously small testing p-values, large fold changes, and minimal variation between cell line biological replicates.

(2) HCT116 versus HEK293¶

We will do the same testing to compare HCT116 cells versus HEK293 cells.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("HCT", "HEK"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("HCT116 vs HEK293 p-value distribution")

       HEK-HCT
Down      3039
NotSig    1489
Up        2941

We have 80% of the proteins as candidates¶

We have a similar, large number of DE candidates in this comparison.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, HCT, "HCT", HEK, "HEK")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_HCT_HEK")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("HCT116 versus HEK293 logFC distributions by candidate")

MA plots¶

# make MA plots
MA_plots(results, "ave_HCT", "ave_HEK", "HCT116 vs HEK293")

Scatter plots¶

# make scatter plots
scatter_plots(results,  "ave_HCT", "ave_HEK", "HCT116 vs HEK293")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_HCT", "ave_HEK", "HCT116 vs HEK293")

# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))

Many candidates have larger fold-changes¶

The expression pattern is similar to the HCT116 versus MCF7 cells. We might have a smaller number of proteins with larger fold-changes in this comparison of HCT116 cells to HEK293 cells.

Check some individual protein expression¶

# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 4, 10)
set_plot_dimensions(9, 9)

Individual candidates are solid¶

We have very small p-values, large fold changes, and low variance like the first comparison.

(3) MCF7 versus HEK293¶

Compare the MCF7 cells to the HEK293 cells.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("MCF", "HEK"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("MCF7 vs HEK293 p-value distribution")

       HEK-MCF
Down      3046
NotSig    1367
Up        3056

We have even more DE candidates¶

We have 82% of the proteins differentially expressed for this comparison.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we have few candidates, we will still make the MA plot, scatter plot, and volcano plot.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, MCF, "MCF", HEK, "HEK")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_MCF_HEK")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("MCF7 vs HEK293 logFC distributions by candidate")

MA plots¶

# make MA plots
MA_plots(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")

# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))

Candidate visualization patterns are similar to other two comparisons¶

We may have more down-regulated proteins in HEK293 in this comparison. We do see some differences in three volcano plots. We have good overall symmetry between the numbers of up and down regulated proteins in each comparison; however, the pattern of fold changes does differ a little in each comparison.

Check individual protein expression¶

# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 4, 10)
set_plot_dimensions(9, 9)

Individual plots for top candidates are pretty ideal¶

All three cell lines are consistent. There is excellent reproducibility for biological replicates of each cell line. The three cell lines are different from each other. The questions about how to best rank the proteins that are different between cell lines would apply here. Invoking some fold-change cutoff would be a likely starting point. The linear modeling options in edgeR or limma have functions for adjusting FDR values in combination with a fold-change cutoff. There is also the question of what is the biological question? The three lines could also be compared with ANOVA-like tests.

Summary¶

The data seem to be well behaved. We have clear differences in expression levels for a majority of the proteins. The low CVs that can be obtained from cell cultures in combination with the precision of TMT results in very large numbers of statistically significant differentially expressed proteins. The regular SPS MS3 data took twice as much time to acquire. The median CVs were a little smaller than the RTS acquisition. We have even more statistically significant candidates with this data than we had for the RTS data. The big picture aspects of the two datasets seem very similar. The post IRS method data here produced essentially identical statistical testing results as the data before IRS.

Save the `all_results` frame to TSV file¶

write.table(all_results, "Regular_IRS_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information¶

sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] janitor_2.0.1   psych_1.9.12.31 edgeR_3.24.3    limma_3.38.3   
 [5] scales_1.1.0    forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
 [9] purrr_0.3.3     readr_1.3.1     tidyr_1.0.2     tibble_2.1.3   
[13] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-40 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  labeling_0.3    
[21] splines_3.5.3    munsell_0.5.0    broom_0.5.2      compiler_3.5.3  
[25] modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3  mnormt_1.5-6    
[29] htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1      crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.2.0      grid_3.5.3       nlme_3.1-145    
[37] jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
[45] fs_1.3.2         snakecase_0.11.0 xml2_1.2.5       generics_0.0.2  
[49] vctrs_0.2.4      IRkernel_1.1     tools_3.5.3      glue_1.3.2      
[53] hms_0.5.3        parallel_3.5.3   colorspace_1.4-1 rvest_0.3.5     
[57] pbdZMQ_0.3-3     haven_2.2.0

	genes	logFC	logCPM	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
1753	sp\|P42765\|THIM_HUMAN	3.306551	7.514757	0.000000e+00	0.000000e+00
3061	sp\|Q9NQ66\|PLCB1_HUMAN	2.964718	5.877560	0.000000e+00	0.000000e+00
3409	sp\|O00461\|GOLI4_HUMAN	2.204673	6.394427	0.000000e+00	0.000000e+00
3022	sp\|Q5JSL3\|DOC11_HUMAN	2.849306	6.684627	3.488103e-321	6.513159e-318
2362	sp\|P52564\|MP2K6_HUMAN	2.634931	7.078625	1.668262e-319	2.477953e-316
3512	sp\|Q92604\|LGAT1_HUMAN	2.276543	6.439820	1.990590e-319	2.477953e-316
5112	sp\|Q9UN81\|LORF1_HUMAN	2.559243	5.353546	2.890877e-319	3.084566e-316
4064	sp\|Q8NBX0\|SCPDL_HUMAN	2.597036	5.753404	4.613881e-319	4.307635e-316
2495	sp\|P82970\|HMGN5_HUMAN	2.842783	5.396735	1.366225e-305	1.133815e-302
3070	sp\|P26374\|RAE2_HUMAN	2.918057	6.684810	1.546676e-305	1.155213e-302

	genes	logFC	logCPM	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
7453	sp\|Q9H4W6\|COE3_HUMAN	6.007105	0.9384442	0	0
2920	sp\|Q9Y3E1\|HDGR3_HUMAN	5.430656	6.6790365	0	0
4831	sp\|Q96T17\|MA7D2_HUMAN	4.786728	5.1520302	0	0
2914	sp\|P54687\|BCAT1_HUMAN	4.762933	6.5354502	0	0
5803	sp\|Q8ND83\|SLAI1_HUMAN	4.627566	4.7462137	0	0
1982	sp\|Q9H6S3\|ES8L2_HUMAN	-4.435790	7.2600417	0	0
4726	sp\|P0CG30\|GSTT2_HUMAN	-4.194166	5.6432397	0	0
991	sp\|Q14764\|MVP_HUMAN	-4.165393	7.5717759	0	0
2600	sp\|Q9Y446\|PKP3_HUMAN	-3.777932	6.5426405	0	0
6260	sp\|P24385\|CCND1_HUMAN	-3.753490	4.4619386	0	0

	genes	logFC	logCPM	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
7421	sp\|Q5K4L6\|S27A3_HUMAN	5.396298	1.077376	0	0
6209	sp\|Q12840\|KIF5A_HUMAN	3.512278	3.520472	0	0
3723	sp\|Q16527\|CSRP2_HUMAN	3.272690	5.854648	0	0
907	sp\|P00918\|CAH2_HUMAN	3.119721	7.722321	0	0
3415	sp\|P15104\|GLNA_HUMAN	3.000252	6.328925	0	0
2667	sp\|Q96F45\|ZN503_HUMAN	2.889411	6.837123	0	0
4023	sp\|Q9H5H4\|ZN768_HUMAN	2.785618	5.932306	0	0
2726	sp\|P56377\|AP1S2_HUMAN	2.527542	6.475997	0	0
2194	sp\|P0DN79\|CBSL_HUMAN (+1)	2.423614	6.955916	0	0
189	sp\|P00390\|GSHR_HUMAN	-2.412886	9.748002	0	0

candidate	n
<fct>	<int>
high	5755
med	347
low	198
no	1169

Schweppe et al. - Real Time Search TMT¶

Testing cell lines: Regular SPS MS3 before IRS¶

Phil Wilmarth, OHSU¶

May 27, 2020¶

Re-analysis of PXD017823 (human cell lines)¶

References¶

Load the necessary R libraries¶

Define common functions for notebook use¶

Load the IRS-normalized TMT intensity data¶

Get the experiment to compare cell lines within¶

Run TMM normalization and check final clustering¶

TMM factors are close to 1.0¶

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

(1) HCT116 versus MCF7¶

We have many candidates!¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

Main summary plots¶

MA plots¶

Scatter plots¶

Volcano plot¶

Candidates look strong¶

Check some individual protein expression¶

The top candidates are convincing¶

(2) HCT116 versus HEK293¶

We have 80% of the proteins as candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

MA plots¶

Scatter plots¶

Volcano plot¶

Many candidates have larger fold-changes¶

Check some individual protein expression¶

Individual candidates are solid¶

(3) MCF7 versus HEK293¶

We have even more DE candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

MA plots¶

Scatter plots¶

Volcano plot¶

Candidate visualization patterns are similar to other two comparisons¶

Check individual protein expression¶

Individual plots for top candidates are pretty ideal¶

Summary¶

Save the all_results frame to TSV file¶

Log the session information¶

Save the `all_results` frame to TSV file¶