Schweppe et al. - Real Time Search TMT

Testing cell lines: Regular SPS MS3 before IRS

Phil Wilmarth, OHSU

May 27, 2020

Re-analysis of PXD017823 (human cell lines)

The PRIDE data was just the human cell line data presented in Figure 4 from the JPR publication. There are HTC116 cells (first 3 channels), MCF7 cells (the next 3 channels), and HEK293 cells (the last 4 channels) labeled with 10-plex TMT. A 180-minute RP LC method for a 12-fraction separation (high pH RP) using the regular SPS MS3 acquisition was compared to running the 12-fractions with a 90-minute SPS MS3 method using real time search (RTS) and protein close-out. Protein close-out is not described in the methods or in citation 4 (only described in a figure caption) but is (maybe) restricting MS3 scans after 4 successful scans have been acquired for a particular protein.

The data from PRIDE was downloaded and processed with my PAW pipeline:

  • Use MSConvert to make GZipped text files with MS2 and MS3 scans
  • Run wide tolerance Comet searches (simple configuration)
  • Compile top-hit PSM summaries (discriminant scores) with reporter ion peak heights
  • Accurate mass conditioned target/decoy score histograms to control 1% PSM FDR
  • Basic and extended parsimonious protein inference
  • Summed reporter ion totals computed for peptides and proteins
  • Internal Reference Scaling (IRS) normalization

There are multiple comparisons that can be made. We compared the two acquisition methods that were done in two separate TMT experiments with internal reference scaling (IRS) in our first notebook. We will continue with some statistical testing of the differences between the cell lines.

While IRS did seem to work, the real time search data with protein close-out was quite different. We started with the SLNormalized data from each experiment for the initial testing work. We will see if the data after IRS seems any different. We will take the data from the IRS normalization script table. That takes care of finding the intersection of the protein IDs. We will have the same starting set of proteins from each experiment for testing. We will work with the Regular SPS MS3 experiment after IRS and do our usual edgeR testing with notebooks.

References

Devin K. Schweppe, Jimmy K. Eng, Qing Yu, Derek Bailey, Ramin Rad, Jose Navarrete-Perea, Edward L. Huttlin, Brian K. Erickson, Joao A. Paulo, and Steven P. Gygi, 2020. Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics. Journal of proteome research, 19(3), pp.2026−2034.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

www.jupyter.org


Load the necessary R libraries

In [1]:
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
library(janitor)

# set global plot size (default is 7x7 inch)
options(repr.plot.width=9, repr.plot.height=9)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

 ggplot2 3.3.0      purrr   0.3.3
 tibble  2.1.3      dplyr   0.8.5
 tidyr   1.0.2      stringr 1.4.0
 readr   1.3.1      forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()


Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor



Attaching package: ‘psych’


The following objects are masked from ‘package:scales’:

    alpha, rescale


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha



Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


Define common functions for notebook use

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward. Note than this experimental design has only a single reference channel per plex instead of two.

We need to drop contaminant and decoy proteins, and proteins with missing sets of reporter ions. We extract the accessions column, the SL and IRS normed data columns, and collect channels by biological condition. We also grab the reference channels to check the IRS procedure.

In [3]:
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 8245)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))

# get the pooled channels
data_pools <- data_all %>% select(., contains("Pool")) %>%
  select(., contains("Norm_"))

# clean up column names 
data_all <- data_all %>% rename_all(funs(str_replace(., "Norm_", "_"))) %>% 
  rename_all(funs(str_replace(., "RTS_1", "RTS"))) %>% 
  rename_all(funs(str_replace(., "Reg_2", "Reg"))) 

# get the data before and after IRS
data_sl <- select(data_all, contains("SL_"), -contains("Pool"))
data_irs <- select(data_all, contains("IRS_"), -contains("Pool"))

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# see how many rows of data we have
length(accessions)
Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)

See spec(...) for full column specifications.

Warning message:
“funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
7469

Get the experiment to compare cell lines within

We will get the data from the regular SPS MS3 experiment after IRS.

In [4]:
# define the acquisition mode of interest
acq_mode  <- "Reg"

# define data (SL or IRS) and define the cell line indices
df <- select(data_irs, contains(acq_mode))
HCT <- 1:3
MCF <- 4:6
HEK <- 7:10

# set some colors by cell line
colors = c(rep('red', length(HCT)), rep('blue', length(MCF)), rep('green', length(HEK)))

Run TMM normalization and check final clustering

We will the data into edgeR data structures and call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering that we had above.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We compute the normalized intensities with the apply_tmm_factors function.

In [5]:
# get the biological sample data into a DGEList object
groups = c(rep("HCT", length(HCT)), rep("MCF", length(MCF)), 
          rep("HEK", length(HEK)))
y <- DGEList(counts = df, group = groups, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

# compute the actual normalized intensities
tmt_tmm <- apply_tmm_factors(y, color = colors)

# check the clustering after TMM
plotMDS(y, col = colors, main = "all samples after TMM")
Library size factors:
 IRS_HCT116_1_Reg -> 1.011919
 IRS_HCT116_2_Reg -> 1.011008
 IRS_HCT116_3_Reg -> 1.012895
 IRS_MCF_1_Reg -> 1.019526
 IRS_MCF_2_Reg -> 1.017015
 IRS_MCF_3_Reg -> 1.016398
 IRS_HEK293_1_Reg -> 0.980158
 IRS_HEK293_2_Reg -> 0.979107
 IRS_HEK293_3_Reg -> 0.980158
 IRS_HEK293_4_Reg -> 0.975052

Trimmed mean of M-values (TMM) factors:
 IRS_HCT116_1_Reg -> 1.032526
 IRS_HCT116_2_Reg -> 1.028441
 IRS_HCT116_3_Reg -> 1.024365
 IRS_MCF_1_Reg -> 1.054369
 IRS_MCF_2_Reg -> 1.034909
 IRS_MCF_3_Reg -> 1.042825
 IRS_HEK293_1_Reg -> 0.951687
 IRS_HEK293_2_Reg -> 0.958510
 IRS_HEK293_3_Reg -> 0.946396
 IRS_HEK293_4_Reg -> 0.935825

Combined (lib size and TMM) normalization factors:
 IRS_HCT116_1_Reg -> 1.044832
 IRS_HCT116_2_Reg -> 1.039763
 IRS_HCT116_3_Reg -> 1.037574
 IRS_MCF_1_Reg -> 1.074956
 IRS_MCF_2_Reg -> 1.052518
 IRS_MCF_3_Reg -> 1.059926
 IRS_HEK293_1_Reg -> 0.932803
 IRS_HEK293_2_Reg -> 0.938484
 IRS_HEK293_3_Reg -> 0.927618
 IRS_HEK293_4_Reg -> 0.912479

TMM factors are close to 1.0

The IRS method already takes care of the library size type of normalizations. All of the library adjustment facters were very small. We could still have compositional differences between sample groups that the TMM factors would correct for. We do see some consistent differences by cell line for the TMM factors.

The box plots are very similar. TMM usually gets the medians (box centers) aligned pretty well.

EdgeR statistical testing starts here

Compute the shared variance trend

One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 10 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

In [6]:
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.

(1) HCT116 versus MCF7

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.05. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

In [7]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("HCT", "MCF"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("HCT116 versus MCF7 p-value distribution")
       MCF-HCT
Down      2978
NotSig    1569
Up        2922
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
1753sp|P42765|THIM_HUMAN 3.3065517.514757 0.000000e+00 0.000000e+00
3061sp|Q9NQ66|PLCB1_HUMAN2.9647185.877560 0.000000e+00 0.000000e+00
3409sp|O00461|GOLI4_HUMAN2.2046736.394427 0.000000e+00 0.000000e+00
3022sp|Q5JSL3|DOC11_HUMAN2.8493066.6846273.488103e-3216.513159e-318
2362sp|P52564|MP2K6_HUMAN2.6349317.0786251.668262e-3192.477953e-316
3512sp|Q92604|LGAT1_HUMAN2.2765436.4398201.990590e-3192.477953e-316
5112sp|Q9UN81|LORF1_HUMAN2.5592435.3535462.890877e-3193.084566e-316
4064sp|Q8NBX0|SCPDL_HUMAN2.5970365.7534044.613881e-3194.307635e-316
2495sp|P82970|HMGN5_HUMAN2.8427835.3967351.366225e-3051.133815e-302
3070sp|P26374|RAE2_HUMAN 2.9180576.6848101.546676e-3051.155213e-302

We have many candidates!

We have 79% of the proteins being differentially expressed. There are likely real differences in these cell lines. We also have very low CVs for most proteins (median CVs are less than 5%), so pretty small expression differences are going to be statistically significant. EdgeR, with the trended dispersion, tends to be more conservate, i.e. requires a larger diffference in means, than a basic t-test.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we do not have many candidates, we will still make the MA plots, scatter plots, and volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

In [8]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, HCT, "HCT", MCF, "MCF")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_HCT_MCF")

# accumulate the testing results
all_results <- results_temp

Count candidates and look at fold-change distributions

In [9]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("HCT116 vs MCF7 logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high5459
med 441
low 229
no 1340

Main summary plots

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots

In [10]:
# make MA plots
MA_plots(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")

Scatter plots

In [11]:
# make scatter plots
scatter_plots(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")

Volcano plot

In [12]:
# make a volcano plot
volcano_plot(results, "ave_HCT", "ave_MCF", "HCT116 versus MCF7")
In [13]:
# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))
1293

Candidates look strong

There are many candidates with more than 2-fold expression changes. We do have many candidates with smaller fold-changes due to the low CVs.

Check some individual protein expression

In [14]:
# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 3, 10)
set_plot_dimensions(9, 9)

The top candidates are convincing

We have ridiculously small testing p-values, large fold changes, and minimal variation between cell line biological replicates.


(2) HCT116 versus HEK293

We will do the same testing to compare HCT116 cells versus HEK293 cells.

In [15]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("HCT", "HEK"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("HCT116 vs HEK293 p-value distribution")
       HEK-HCT
Down      3039
NotSig    1489
Up        2941
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
7453sp|Q9H4W6|COE3_HUMAN 6.0071050.938444200
2920sp|Q9Y3E1|HDGR3_HUMAN 5.4306566.679036500
4831sp|Q96T17|MA7D2_HUMAN 4.7867285.152030200
2914sp|P54687|BCAT1_HUMAN 4.7629336.535450200
5803sp|Q8ND83|SLAI1_HUMAN 4.6275664.746213700
1982sp|Q9H6S3|ES8L2_HUMAN-4.4357907.260041700
4726sp|P0CG30|GSTT2_HUMAN-4.1941665.643239700
991sp|Q14764|MVP_HUMAN -4.1653937.571775900
2600sp|Q9Y446|PKP3_HUMAN -3.7779326.542640500
6260sp|P24385|CCND1_HUMAN-3.7534904.461938600

We have 80% of the proteins as candidates

We have a similar, large number of DE candidates in this comparison.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [16]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, HCT, "HCT", HEK, "HEK")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_HCT_HEK")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [17]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("HCT116 versus HEK293 logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high5561
med 419
low 195
no 1294

MA plots

In [18]:
# make MA plots
MA_plots(results, "ave_HCT", "ave_HEK", "HCT116 vs HEK293")

Scatter plots

In [19]:
# make scatter plots
scatter_plots(results,  "ave_HCT", "ave_HEK", "HCT116 vs HEK293")

Volcano plot

In [20]:
# make a volcano plot
volcano_plot(results, "ave_HCT", "ave_HEK", "HCT116 vs HEK293")
In [21]:
# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))
1341

Many candidates have larger fold-changes

The expression pattern is similar to the HCT116 versus MCF7 cells. We might have a smaller number of proteins with larger fold-changes in this comparison of HCT116 cells to HEK293 cells.

Check some individual protein expression

In [22]:
# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 4, 10)
set_plot_dimensions(9, 9)

Individual candidates are solid

We have very small p-values, large fold changes, and low variance like the first comparison.


(3) MCF7 versus HEK293

Compare the MCF7 cells to the HEK293 cells.

In [23]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("MCF", "HEK"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("MCF7 vs HEK293 p-value distribution")
       HEK-MCF
Down      3046
NotSig    1367
Up        3056
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
7421sp|Q5K4L6|S27A3_HUMAN 5.3962981.07737600
6209sp|Q12840|KIF5A_HUMAN 3.5122783.52047200
3723sp|Q16527|CSRP2_HUMAN 3.2726905.85464800
907sp|P00918|CAH2_HUMAN 3.1197217.72232100
3415sp|P15104|GLNA_HUMAN 3.0002526.32892500
2667sp|Q96F45|ZN503_HUMAN 2.8894116.83712300
4023sp|Q9H5H4|ZN768_HUMAN 2.7856185.93230600
2726sp|P56377|AP1S2_HUMAN 2.5275426.47599700
2194sp|P0DN79|CBSL_HUMAN (+1) 2.4236146.95591600
189sp|P00390|GSHR_HUMAN -2.4128869.74800200

We have even more DE candidates

We have 82% of the proteins differentially expressed for this comparison.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we have few candidates, we will still make the MA plot, scatter plot, and volcano plot.

In [24]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, MCF, "MCF", HEK, "HEK")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_MCF_HEK")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [25]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("MCF7 vs HEK293 logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high5755
med 347
low 198
no 1169

MA plots

In [26]:
# make MA plots
MA_plots(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")

Scatter plots

In [27]:
# make scatter plots
scatter_plots(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")

Volcano plot

In [28]:
# make a volcano plot
volcano_plot(results, "ave_MCF", "ave_HEK", "MCF7 vs HEK293")
In [29]:
# count proteins with FC more than 2-fold
nrow(filter(results, abs(FC) > 2 & candidate != "no"))
1461

Candidate visualization patterns are similar to other two comparisons

We may have more down-regulated proteins in HEK293 in this comparison. We do see some differences in three volcano plots. We have good overall symmetry between the numbers of up and down regulated proteins in each comparison; however, the pattern of fold changes does differ a little in each comparison.

Check individual protein expression

In [30]:
# look at the top candidates
set_plot_dimensions(8, 3.5)
plot_top_tags(results, 3, 4, 10)
set_plot_dimensions(9, 9)

Individual plots for top candidates are pretty ideal

All three cell lines are consistent. There is excellent reproducibility for biological replicates of each cell line. The three cell lines are different from each other. The questions about how to best rank the proteins that are different between cell lines would apply here. Invoking some fold-change cutoff would be a likely starting point. The linear modeling options in edgeR or limma have functions for adjusting FDR values in combination with a fold-change cutoff. There is also the question of what is the biological question? The three lines could also be compared with ANOVA-like tests.


Summary

The data seem to be well behaved. We have clear differences in expression levels for a majority of the proteins. The low CVs that can be obtained from cell cultures in combination with the precision of TMT results in very large numbers of statistically significant differentially expressed proteins. The regular SPS MS3 data took twice as much time to acquire. The median CVs were a little smaller than the RTS acquisition. We have even more statistically significant candidates with this data than we had for the RTS data. The big picture aspects of the two datasets seem very similar. The post IRS method data here produced essentially identical statistical testing results as the data before IRS.


Save the all_results frame to TSV file

In [31]:
write.table(all_results, "Regular_IRS_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information

In [32]:
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] janitor_2.0.1   psych_1.9.12.31 edgeR_3.24.3    limma_3.38.3   
 [5] scales_1.1.0    forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
 [9] purrr_0.3.3     readr_1.3.1     tidyr_1.0.2     tibble_2.1.3   
[13] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-40 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  labeling_0.3    
[21] splines_3.5.3    munsell_0.5.0    broom_0.5.2      compiler_3.5.3  
[25] modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3  mnormt_1.5-6    
[29] htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1      crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.2.0      grid_3.5.3       nlme_3.1-145    
[37] jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
[45] fs_1.3.2         snakecase_0.11.0 xml2_1.2.5       generics_0.0.2  
[49] vctrs_0.2.4      IRkernel_1.1     tools_3.5.3      glue_1.3.2      
[53] hms_0.5.3        parallel_3.5.3   colorspace_1.4-1 rvest_0.3.5     
[57] pbdZMQ_0.3-3     haven_2.2.0     
In [ ]: