Re-analysis of PXD004163 data

Proteomics of lung cancer U1810 cells treated with miRNAs

Quantitative summary file from Mascot Daemon

Notebook prepared by Phil Wilmarth, PSR Core, OHSU

June 24, 2020



Overview

This notebook reads the quantitative summary file created by the Mascot Daemon and shows some alternative visualizations, normalization, and differential expression testing. The data was used in a recent blog where the new quantitative summary functionality was explored. I also did a full re-analysis with my pipeline and have a PAW notebook for the IPG 3.7 to 4.9 data to compare to this notebook. I suggest opening both notebook files in side-by-side browser windows. The notebooks are very similar, and you can see the underlying data similarities and differences as you scroll down the notebooks.

The data is from the PXD004163 PRIDE archive. The U1810 cell line was from non-small cell lung cancer. The cells were treated with several different microRNAs. The miRNAs were either non-targeting controls (n=3), miR-372-3p mimic (n=3), miR-191-5p mimic (n=2), or miR-519c-3p mimics (n=2). Extracted proteins were digested with trypsin in a FASP protocol. The samples were labeled with tandem mass tags (TMT) 10-plex and analyzed in a large-scale isoelectric point fractionation. One separation was pH 3-10 and the other was pH 3.7-4.9. Each separation produced 72 fractions that were run with a C18 reverse-phase column, electrospray ionization, and DDA mass spectrometry using a Thermo Q Exactive instrument. MS1 scans had a resolution of 70K and the top-5 MS2 scans had a resolution of 35K (which might be a little low for fully resolving the N- and C- forms of the TMT tags). Other instrument settings seemed okay for a Q Exactive doing TMT.

Sample Key:

Channel Sample Key Simplified Key
126C siCtrl A A_Ctrl
127N miR 191 B B_191
127C miR 372 A A_372
128N miR 519c A A_519c
128C siCtrl B B_Ctrl
129N miR372 B B_372
129C miR 519c B B_519c
130N siCtrl C C_Ctrl
130C miR 191 A A_191
131N miR 372 C C_372

The data is described in this Oncogene paper, and more details on the samples are in the PRIDE description. The Matrix Science blog has a link to download the quantitative summary tab-delimited file. That PXD004163.txt file for the IPG 3-10 range data was the starting point for this analysis. The summary has protein aggregated reporter ion quantities where each row is a protein or protein group and columns at the right of the table are the quantities for the individual channels. There is also a column to flag possible contaminant proteins. There are some additional human keratins that should probably be added to the contaminants.

The column totals from the PAW analysis (averaged) were 12,099,627,853. The column total average in the Mascot quantitative summary was 67,842,137. The protein level summary values from Mascot are reporter ion peak areas rather than peak heights. The values are 178 times smaller on average. The PAW processing uses reporter ion peak heights. Like most TMT datasets, there are not many missing data values. Like all data, the "missingness" is protein abundance dependent. The lowest abundance proteins have the missing data. I did some processing of the PXD004163.txt in Excel to exclude contaminants and rank the proteins by decreasing average reporter ion value. A value of 5.0 looked like a good lower abundance cutoff. That left a handful of proteins with a few zero values. Zeros were replaced with 0.5 to avoid mathematical errors in the notebook analysis.

Normalizations and statistical testing was done in edgeR using a Jupyter notebook, similar to the analysis) used for the PAW pipeline data.


Load the necessary R libraries

In [1]:
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.3.0     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ dplyr   0.8.5
✔ tidyr   1.0.2     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor



Attaching package: ‘psych’


The following objects are masked from ‘package:scales’:

    alpha, rescale


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Define common functions for notebook use

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== MA plots using ggplot =============================================
MA_species <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = species, shape = species)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = species, shape = species)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ species) +
        ggtitle(str_c(title, " (separated by species)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Scatter plots using ggplot ========================================
scatter_species <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = species, shape = species)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = species, shape = species)) +
        scatter_lines +
        facet_wrap(~ species) +
        ggtitle(str_c(title, " (separated by species)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "::", simplify = TRUE)
    identifier[,2]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

Load the TMT intensity data

The data was gently prepped in Excel to remove contaminants, decoys, and proteins without any reporter ion intensities. We set a lower abundance cutoff (average reporter signal of 5.0) to exclude a few of the lowest abundance proteins that had excessive missing data. We replaced and remaining zero values with a value of 0.5. We have an accessions column and the 10 TMT channels.

In [3]:
# load the IRS-normalized data and check the table
data_all <- read_tsv("edgeR_input.txt")

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# get just the TMT columns
data_tmt <- select(data_all, -Accession)

# see how many rows of data we have
length(accessions)
Parsed with column specification:
cols(
  Accession = col_character(),
  siCtrl_A = col_double(),
  siCtrl_B = col_double(),
  siCtrl_C = col_double(),
  miR_372_A = col_double(),
  miR_372_B = col_double(),
  miR_372_C = col_double(),
  miR_191_B = col_double(),
  miR_191_A = col_double(),
  miR_519c_A = col_double(),
  miR_519c_B = col_double()
)

7928

Use data from all treatments for edgeR analysis

We are defining the groups that will be compared explicitly and using all of the samples for variance estimates. We will put the data into a data frame, grouped by treatment. We will define some column indexes for each condition, set some colors for plotting, and see how the data cluster by treatment.

In [4]:
# define the groups
C <- 1:3
x372 <- 4:6
y191 <- 7:8
z519c <- 9:10

# set some colors by condition
colors = c(rep('red', length(C)), rep('blue', length(x372)), 
           rep('green', length(y191)), rep('black', length(z519c)))

Run TMM normalization and check final clustering

We will the data into edgeR data structures and call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering that we had above.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We compute the normalized intensities with the apply_tmm_factors function.

In [5]:
# get the biological sample data into a DGEList object
group = c(rep("C", length(C)), rep("x372", length(x372)), 
          rep("y191", length(y191)), rep("z519c", length(z519c)))
y <- DGEList(counts = data_tmt, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, color = colors)

# check the clustering
plotMDS(y, col = colors, main = "all samples after TMM")
Library size factors:
 siCtrl_A -> 1.246655
 siCtrl_B -> 1.181233
 siCtrl_C -> 1.101838
 miR_372_A -> 0.995438
 miR_372_B -> 0.920755
 miR_372_C -> 0.669505
 miR_191_B -> 1.483958
 miR_191_A -> 0.800750
 miR_519c_A -> 1.064944
 miR_519c_B -> 1.002309

Trimmed mean of M-values (TMM) factors:
 siCtrl_A -> 1.005848
 siCtrl_B -> 1.004439
 siCtrl_C -> 0.971583
 miR_372_A -> 0.999375
 miR_372_B -> 1.015821
 miR_372_C -> 0.997089
 miR_191_B -> 0.999022
 miR_191_A -> 0.998031
 miR_519c_A -> 1.007892
 miR_519c_B -> 1.001501

Combined (lib size and TMM) normalization factors:
 siCtrl_A -> 1.253946
 siCtrl_B -> 1.186476
 siCtrl_C -> 1.070527
 miR_372_A -> 0.994816
 miR_372_B -> 0.935322
 miR_372_C -> 0.667555
 miR_191_B -> 1.482507
 miR_191_A -> 0.799173
 miR_519c_A -> 1.073348
 miR_519c_B -> 1.003813

TMM factors are very close to 1.0

We could have compositional differences between sample groups that the TMM factors would correct for. We have a couple of channels that need some modest library size adjustments. We do not have any TMM factors that are much different from 1.0, so there are no compositional differences of note. The box plots are very similar in size and well aligned. TMM usually gets the medians (box centers) aligned well.

Check coefficients of variation (CV) by condition

The CVs are a very useful metric in these large-scale experiments. Cell cultures often have lower CVs than tissue samples. Different TMT methods (MS2 versus SPS MS3) are also rather different.

In [6]:
# put CVs in data frames to simplify plots and summaries
cv_tmm <- data.frame(C = CV(tmt_tmm[C]), x372 = CV(tmt_tmm[x372]), 
                     y191 = CV(tmt_tmm[y191]), z519c = CV(tmt_tmm[z519c]))

# see what the median CV values are
medians <- apply(cv_tmm, 2, FUN = median)
print("Final median CVs by condition (%)")
round(medians, 2)

# also look at CV distributions
boxplot(cv_tmm, col = c("red", "blue", "green", "black"), notch = TRUE, 
        main = "TMM Normalized data")
[1] "Final median CVs by condition (%)"
C
7.04
x372
7.86
y191
4.45
z519c
4.6

Median CVs are low

Condition This data PAW processing
Control 7.0% 6.1%
miR-372-3p 7.9% 6.9%
miR-191-5p 4.5% 4.2%
miR-519c-3p 4.6% 3.8%

CVs were smaller from the PAW processing.

EdgeR statistical testing starts here

Compute the shared variance trend

One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 10 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

In [7]:
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.

Trended dispersion looks more variable

The trended dispersion plot from the PAW processing looks "tighter".

(1) Control versus 372 (3x3)

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

In [8]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "x372"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Control versus 372 p-value distribution")
       x372-C
Down      297
NotSig   7395
Up        236
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
49992::C9J3D7 -1.19129053.35849796.160876e-344.884342e-30
17952::Q53EL6 -0.94636036.41894511.126453e-264.465261e-23
38132::Q8NB46 -0.80759084.46529193.095826e-268.181237e-23
26832::Q9BQS8 -0.69928175.49732621.502161e-212.977284e-18
9762::P39880 -0.81130187.52135483.590613e-214.941458e-18
13052::Q13948-2-0.81116387.03690633.739752e-214.941458e-18
70752::M0QXZ8 -1.28667910.50790743.125062e-193.539356e-16
21362::Q9C0E8 -0.67925956.01815084.798814e-194.755624e-16
9322::P29317 -0.74092817.60049992.934097e-182.584614e-15
62662::O95365 -0.87133181.91072863.907359e-183.097754e-15

We have over 500 DE candidates with small fold-changes

We have 533 candidates; however, the fold-changes are mostly much less than 2-fold. The top tags have quite small p-values which must be due to small variances. The p-value distribution looks typical when we have true DE candidates - a low p-value spike and a flat background distribution.

We had 785 candidates in the PAW processing. The non-changing proteins (the black points in the MS plot) seemed tighter as well.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

In [9]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, C, "C", x372, "x372")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_C_372")

# accumulate the testing results
all_results <- results_temp

Count candidates and look at fold-change distributions

In [10]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("Control vs 372 logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 198
med 190
low 145
no 7395

Main summary plots

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots

In [11]:
# make MA plots
MA_plots(results, "ave_C", "ave_x372", "Control versus 372")

Scatter plots

In [12]:
# make scatter plots
scatter_plots(results, "ave_C", "ave_x372", "Control versus 372")

Volcano plot

In [13]:
# make a volcano plot
volcano_plot(results, "ave_C", "ave_x372", "Control versus 372")

DE candidates have different expression from the main background

Although the DE candidates have rather small expression differences, they do seem to have over and under expression that differs from the unchanged background. We do seem to have a good balance between up and down regulation.

The data from the PAW processing looks qualitatively similar. There are more candidates and the purple non-changing proteins are tighter to the 1-to-1 lines.

Check some individual protein expression (top 10 in each direction)

We will rank the DE candidates by p-values (FDR) and look at the top 10 in each expression direction.

In [14]:
# look at the top candidates
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 3, 3, 10)
set_plot_dimensions(7, 7)

Individual plots look okay

This data has protein accessions. The PAW processing extracted the identifiers because they are a little more human friendly.

(2) Control versus 191 (3x2)

Compare the non-targeting control treatment to the miR-191-5p mimic.

In [15]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "y191"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Control vs 191 p-value distribution")
       y191-C
Down       73
NotSig   7790
Up         65
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
34232::P09914 2.4069774.83279354.280502e-693.393582e-65
40792::A0A087X2791.9610284.23039926.416997e-642.543698e-60
68412::Q15646 2.2111370.96964683.658474e-469.668127e-43
45252::Q8TCB0 1.9625893.81213701.092027e-452.164398e-42
38842::Q13325 1.4089974.40716043.738113e-435.927152e-40
23722::O95786 1.2166305.77284012.618797e-423.460304e-39
42242::Q16666 1.3256034.06594256.052608e-426.855010e-39
59072::A0A3B3IRK81.5491582.33037442.459405e-322.437270e-29
29582::O14879 1.8762125.21146451.122489e-319.887881e-29
59392::Q00978 1.6461462.27263831.789277e-291.418539e-26

We have fewer candidates

We have 138 candidates. The other plots look pretty okay. The PAW processing had 210 candidates and the p-value distribution looked better.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [16]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, C, "C", y191, "y191")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_C_191")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [17]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("Control versus 191 logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 86
med 29
low 23
no 7790

MA plots

In [18]:
# make MA plots
MA_plots(results, "ave_C", "ave_y191", "Control versus 191")

Scatter plots

In [19]:
# make scatter plots
scatter_plots(results,  "ave_C", "ave_y191", "Control versus 191")

Volcano plot

In [20]:
# make a volcano plot
volcano_plot(results,  "ave_C", "ave_y191", "Control versus 191")

Expression patterns seem okay

We have fewer DE candidates, but the plots are all pretty similar to what we saw in the first comparison. We do have some larger fold-changes for over-expressed proteins. The plots for this comparison are also quite similar to what we saw with the PAW/Comet analysis.

Check some individual protein expression

In [21]:
# look at the top candidates
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 3, 2, 10)
set_plot_dimensions(7, 7)

We are using trended variance to help with low replicate numbers

This is a 3 versus 2 comparison, but we are using the trended variance computed from all 10 channels. This should stabalize the p-values and be more conservative.


(3) Control versus 519c (3x2)

Compare the non-targeting controls to the miR-519c-3p mimics. This is another 3 by 2 comparison.

In [22]:
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("C", "z519c"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Control versus 519c p-value distribution")
       z519c-C
Down       440
NotSig    7194
Up         294
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
47192::G3V2D5 1.47636593.6180111.206594e-399.565873e-36
13722::O14578-4 -1.08647076.8843379.179329e-352.844086e-31
16432::H7BYJ3 -1.08880416.5604911.076218e-342.844086e-31
27042::Q6PL18 -0.91946085.4760252.212971e-234.386109e-20
16932::P29279 1.14740456.4980673.395127e-225.383314e-19
32322::P58335-4 -0.82696494.9549561.099767e-211.453159e-18
40792::A0A087X279 1.09588944.2303991.566782e-201.774493e-17
14882::Q00534 0.86592366.7559984.781440e-204.738407e-17
53472::P24385 -1.07051512.9436836.534809e-195.756440e-16
46222::O15240 0.85941673.6878413.288392e-162.607037e-13

We have 734 DE candidates

We have more DE candidates for this comparison (734). The MDS cluster plots and CVs indicated that the two 519c samples were very similar. We are still using the trended variance from all 10 channels. We might have an even larger number of positive testing results if we allowed the lower 519c variance to drive candidates. The increased number of candidates suggests that we might have some larger fold-changes here.

The PAW processing had over 1,100 candidates.

Compute averages, flag candidates, and save the pair-wise test results

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

In [23]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, C, "C", z519c, "z519c")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_C_519")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions

In [24]:
# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-3, 3)) +
  ggtitle("Control versus 519c logFC distributions by candidate")
A tibble: 4 × 2
candidaten
<fct><int>
high 299
med 262
low 173
no 7194

MA plots

In [25]:
# make MA plots
MA_plots(results, "ave_C", "ave_z519c", "Control versus 519c")

Scatter plots

In [26]:
# make scatter plots
scatter_plots(results, "ave_C", "ave_z519c", "Control versus 519c")

Volcano plot

In [27]:
# make a volcano plot
volcano_plot(results, "ave_C", "ave_z519c", "Control versus 519c")

Expression patterns do not seem much different

We might have some more proteins with slightly larger fold-changes, but it not anything very dramatic. The compressed intensities in MS2 TMT data really alter the statistical testing. Because of the low variance, even modest increases in the differences between means have a strong influence on the testing results.

Check individual protein expression

In [28]:
# look at the top candidates
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 3, 2, 10)
set_plot_dimensions(7, 7)

We do seem to have a bit larger fold changes for the top candidates

The individual protein expression plots do seem to have larger expression changes for the top tags. We still do not get much more than 2-fold changes for any of the proteins.


Summary

Notebooks let us provide more context on the analysis that a plain R script can provide. We can easily add quality control visualizations, normalization checks, and multiple types of expression visualizations. We are not really limited, although to many redundant data views may dilute the story telling a little. Keep in mind that the plots you find compelling may be the ones that another reader finds appalling. Censoring a data analysis story for your favorite visualizations is probably a worse choice than having a notebook be a bit longer. Navigations can help.

Another notebook advantage is to compare different data processing results. The Mascot quantitative summary data for the protein-level reporter ion values are different from the PAW processing. The PAW processing uses summation for the protein aggregation. I am not sure what is being done in Mascot Daemon. The PAW processing produced lower median CVs and considerably more differential expression candidates. Many other aspects of the data characteristics seemed quite similar. How shotgun quantitative data is processed is pretty critical and there are many choices along the way. This blog on how we put Humpty Dumpty back together discusses some of the details.

What about the biology? I don't know. I do not do cancer research. The paper did not present any results from the second or third comparisons (only the controls and the miR-372-3p mimics). Performing all comparisons was easy to do here. Given the way that MS2 reporter ion data seems "squeezed" towards the diagonal, false negatives might be more of a concern with this type of TMt data.

One final note. Reporter ion intensities from MS2 scans do not have much expression difference dynamic range. The bench protocols for these experiments are complicated and hard to do perfectly. There should be some reasonable sample-to-sample variability. MS2 TMT data is so non-variable it has to be an artifact. I wrote a blog about TMT ratio distortions. The more MS2 TMT data I look at, the more convinced I am that the data is much more biased than what the newer SPS MS3 methods produce.


Save the all_results frame to TSV file

In [29]:
write.table(all_results, "edgeR_Mascot_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information

In [30]:
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.9.12.31 edgeR_3.24.3    limma_3.38.3    scales_1.1.0   
 [5] forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3    
 [9] readr_1.3.1     tidyr_1.0.2     tibble_2.1.3    ggplot2_3.3.0  
[13] tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-40 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  labeling_0.3    
[21] splines_3.5.3    munsell_0.5.0    broom_0.5.2      compiler_3.5.3  
[25] modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3  mnormt_1.5-6    
[29] htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1      crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.2.0      grid_3.5.3       nlme_3.1-145    
[37] jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
[45] fs_1.3.2         xml2_1.2.5       generics_0.0.2   vctrs_0.2.4     
[49] IRkernel_1.1     tools_3.5.3      glue_1.3.2       hms_0.5.3       
[53] parallel_3.5.3   colorspace_1.4-1 rvest_0.3.5      pbdZMQ_0.3-3    
[57] haven_2.2.0     
In [ ]: