Re-analysis of PXD014414: Metaplastic breast cancer¶

Sabra I. Djomehri, et al.¶

Nature Communications (2020) 11:1723 ¶

Data re-analysis performed by Phil Wilmarth, PSR Core OHSU¶

This notebook will be using limma instead of edgeR¶

July 26, 2020¶

Overview
Library and data loading
Clustering and normalization
1. Normal versus Triple Negative
2. Normal versus Metaplastic
3. Triple Negative versus Metaplastic
Summary

Overview ¶

The data were first presented in this publication:

Djomehri, S.I., Gonzalez, M.E., da Veiga Leprevost, F., Tekula, S.R., Chang, H.Y., White, M.J., Cimino-Mathews, A., Burman, B., Basrur, V., Argani, P. and Nesvizhskii, A.I., 2020. Quantitative proteomic landscape of metaplastic breast carcinoma pathological subtypes and their relationship to triple-negative tumors. Nature communications, 11(1), pp.1-15.

The IRS adjusted reporter ion intensities from the PAW pipeline were analyzed with Jupyter notebooks, an R kernel, the Bioconductor package edgeR, and its TMM normalization method. This notebook format allows rich data visualization and quality control checks.

Load the necessary R libraries¶

# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.3.0     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ dplyr   0.8.5
✔ tidyr   1.0.2     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor


Attaching package: ‘psych’


The following objects are masked from ‘package:scales’:

    alpha, rescale


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

Define common functions for notebook use¶

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results_limma <- function(df, tt, x, xlab, y, ylab, genes) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$adj.P.Val, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$P.Value, FDR = tt$adj.P.Val, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data ¶

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward. Note than this experimental design has only a single reference channel per plex instead of two.

We need to drop contaminant and decoy proteins, and proteins with missing sets of reporter ions. We extract the accessions column, the SL and IRS normed data columns, and collect channels by biological condition. We also grab the reference channels to check the IRS procedure.

# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 5250)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
data_sl <- data_all %>% select(., contains("SLNorm_")) %>% 
  select(., -contains("_Unused")) %>% 
  select(., -contains("_Pool"))
data_irs <- data_all %>% select(., contains("IRSNorm_")) %>% 
  select(., -contains("_Unused")) %>% 
  select(., -contains("_Pool"))
data_pool <- data_all %>% select(., contains("_Pool_"))

Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)

See spec(...) for full column specifications.

# save a few columns for the results table
all_results <- data_all %>% select(., ProtGroup, Counter, Accession, Description, starts_with("PSMs_Used"))

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# see how many rows of data we have
length(accessions)

Group samples into the three types of tissue¶

We will lump all the metaplastic breast cancer (MBC) samples together to contrast to triple negative breast cancer and normal tissue samples. There will be 14 mbc, 6 triple negative, and 7 normal samples.

# all categories of metaplastic breast cancer tissue
mbc_sl <- select(data_sl, contains("_C"), contains("_SP"), contains("_SQ"))
mbc_irs <- select(data_irs, contains("_C"), contains("_SP"), contains("_SQ"))

# triple negative breast cancer tissue
tn_sl <- select(data_sl, contains("_TN"))
tn_irs <- select(data_irs, contains("_TN"))

# Normal tissue
n_sl <- select(data_sl, contains("_N"))
n_irs <- select(data_irs, contains("_N"))

# collect the biological replicate channels
bio_sl <- cbind(n_sl, tn_sl, mbc_sl)
bio_irs <- cbind(n_irs, tn_irs, mbc_irs)

# set a color vector for plots
color <- c(rep("red", 7), rep("blue", 6), rep("green", 14))

set_plot_dimensions(9, 9)

Normalize the data ¶

Run TMM normalization on the IRS data so we have a starting point for limma.

# put groups together into a single data frame
tmt_sl <- bio_sl
tmt_irs <- bio_irs

# define the positions of the groups
N <- 1:7
TN <- 8:13
MP <- 14:27

# set some colors by condition
group <- c(rep("N", 7), rep("TN", 6), rep("MBC", 14))

# get the biological sample data into a DGEList object
y <- DGEList(counts = tmt_irs, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, color = color)

# check the clustering
plotMDS(y, col = color, main = "all samples after TMM")

Library size factors:
 IRSNorm_N1_1 -> 1.008535
 IRSNorm_N2_1 -> 1.011973
 IRSNorm_Nx_2 -> 0.989973
 IRSNorm_N3_2 -> 1.005097
 IRSNorm_N4_2 -> 0.997700
 IRSNorm_N5_3 -> 1.015249
 IRSNorm_N6_3 -> 1.008969
 IRSNorm_TN1_1 -> 1.001659
 IRSNorm_TN2_1 -> 1.001914
 IRSNorm_TN3_2 -> 1.016278
 IRSNorm_TN4_2 -> 0.993309
 IRSNorm_TN5_3 -> 1.001864
 IRSNorm_TN6_3 -> 1.004372
 IRSNorm_C1_1 -> 0.992717
 IRSNorm_C2_1 -> 0.999395
 IRSNorm_C4_2 -> 0.997888
 IRSNorm_C5_3 -> 0.997909
 IRSNorm_SP1_1 -> 0.999110
 IRSNorm_SP2_2 -> 0.998050
 IRSNorm_SP3_2 -> 0.995285
 IRSNorm_SP4_3 -> 0.988267
 IRSNorm_SP5_3 -> 0.988331
 IRSNorm_SP6_3 -> 0.993169
 IRSNorm_SQ1_1 -> 0.994292
 IRSNorm_SQ2_1 -> 1.003056
 IRSNorm_SQ3_2 -> 1.001224
 IRSNorm_SQ4_3 -> 0.995877

Trimmed mean of M-values (TMM) factors:
 IRSNorm_N1_1 -> 1.152091
 IRSNorm_N2_1 -> 1.299722
 IRSNorm_Nx_2 -> 1.578921
 IRSNorm_N3_2 -> 1.167884
 IRSNorm_N4_2 -> 1.016422
 IRSNorm_N5_3 -> 1.182453
 IRSNorm_N6_3 -> 0.991036
 IRSNorm_TN1_1 -> 0.927147
 IRSNorm_TN2_1 -> 0.885920
 IRSNorm_TN3_2 -> 0.999065
 IRSNorm_TN4_2 -> 1.027606
 IRSNorm_TN5_3 -> 0.872837
 IRSNorm_TN6_3 -> 0.803830
 IRSNorm_C1_1 -> 1.143793
 IRSNorm_C2_1 -> 0.841743
 IRSNorm_C4_2 -> 1.101707
 IRSNorm_C5_3 -> 0.957607
 IRSNorm_SP1_1 -> 0.957029
 IRSNorm_SP2_2 -> 0.890115
 IRSNorm_SP3_2 -> 1.085426
 IRSNorm_SP4_3 -> 0.944171
 IRSNorm_SP5_3 -> 0.912511
 IRSNorm_SP6_3 -> 0.950222
 IRSNorm_SQ1_1 -> 0.881965
 IRSNorm_SQ2_1 -> 0.916414
 IRSNorm_SQ3_2 -> 0.952798
 IRSNorm_SQ4_3 -> 0.867931

Combined (lib size and TMM) normalization factors:
 IRSNorm_N1_1 -> 1.161924
 IRSNorm_N2_1 -> 1.315284
 IRSNorm_Nx_2 -> 1.563089
 IRSNorm_N3_2 -> 1.173837
 IRSNorm_N4_2 -> 1.014085
 IRSNorm_N5_3 -> 1.200484
 IRSNorm_N6_3 -> 0.999924
 IRSNorm_TN1_1 -> 0.928685
 IRSNorm_TN2_1 -> 0.887615
 IRSNorm_TN3_2 -> 1.015327
 IRSNorm_TN4_2 -> 1.020731
 IRSNorm_TN5_3 -> 0.874465
 IRSNorm_TN6_3 -> 0.807344
 IRSNorm_C1_1 -> 1.135462
 IRSNorm_C2_1 -> 0.841233
 IRSNorm_C4_2 -> 1.099380
 IRSNorm_C5_3 -> 0.955605
 IRSNorm_SP1_1 -> 0.956177
 IRSNorm_SP2_2 -> 0.888379
 IRSNorm_SP3_2 -> 1.080309
 IRSNorm_SP4_3 -> 0.933093
 IRSNorm_SP5_3 -> 0.901863
 IRSNorm_SP6_3 -> 0.943731
 IRSNorm_SQ1_1 -> 0.876931
 IRSNorm_SQ2_1 -> 0.919214
 IRSNorm_SQ3_2 -> 0.953965
 IRSNorm_SQ4_3 -> 0.864353

Statistical testing starts here¶

We will run pairwise 2-sample 2-tailed t-test with equal variance. The data will be the log2 normalized intensities.

(1) Normal versus Triple Negative ¶

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

# copy the data (log2 of intensities)
limma_PAW <- log2(tmt_tmm)
row.names(limma_PAW) <- accessions # add accessions as row names

# set up the design matrix
group <- as.factor(c(rep("N", length(N)), rep("TN", length(TN)), rep("MP", length(MP))))
group <- factor(group, levels = c("N", "TN", "MP")) # set the factor order
design <- model.matrix(~ 0 + group)
colnames(design) <- c("N", "TN", "MP")

# make the contrast
contrast <- makeContrasts(N-TN, levels = design)
contrast

# do the linear model fitting
fit <- lmFit(limma_PAW, design)

# get the fit for the contrast of interest
fit2 <- contrasts.fit(fit, contrast)

# do the empirical Bayes moderation of the test statistic (with trended variance)
fit2 <- eBayes(fit2, trend = TRUE)

# grab the information in topTable so we can get the data to plot candidates
# the coef parameter has to do with the contrast of interest
# specify no sorting of results and a number that is longer than the data table
tt_limma <- topTable(fit2, coef = 1, sort.by = "none", number = Inf)

# let's see how many up and down candidates, and the top tags
summary(decideTests(fit2, p.value = 0.05))
topTable(fit2)

       N - TN
Down      188
NotSig   3611
Up        333

# get the results summary
results <- collect_results_limma(tmt_tmm, tt_limma, N, "N", TN, "TN", accessions)

# check the p-value distribution
ggplot(results, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("N versus TN p-value distribution")

We have 521 DE candidates¶

We have a modest number of DE candidates. The p-value distribution suggests that the low p-value candidates are a little "soft" (not really small p-values) because the "spike" is more of a narrow distribution.

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_N_TN")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

We will still make the MA plots, scatter plots, and volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("N vs TN logFC distributions by candidate")

Main summary plots¶

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots¶

# make MA plots
MA_plots(results, "ave_N", "ave_TN", "N vs TN")

Pattern of non-DE candidates is qualitatively similar to DE candidates¶

We sort of have some difference in expression patterns for the non-DE candidates and the highly significant candidates.

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_N", "ave_TN", "N vs TN")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_N", "ave_TN", "N vs TN")

Many proteins are down regulated in breast cancer¶

Data views seem okay. Magnitude of p-values is not too small, but the volcano plot looks okay.

Check some individual protein expression¶

We can look at the data for the top 20 (by BH-corrected p-values) up and down regulated proteins.

# look at the top 20 candidates in each direction (up in TN, then down in TN)
set_plot_dimensions(9, 3.5)
plot_top_tags(results, 7, 6, 20)
set_plot_dimensions(9, 9)

(2) Normal versus Metaplastic ¶

We will do the same testing to compare normal tissue to metaplastic breast cancer samples.

# make the contrast
contrast <- makeContrasts(N-MP, levels = design)
contrast

# get the fit for the contrast of interest
fit2 <- contrasts.fit(fit, contrast)

# do the empirical Bayes moderation of the test statistic (with trended variance)
fit2 <- eBayes(fit2, trend = TRUE)

# grab the information in topTable so we can get the data to plot candidates
# the coef parameter has to do with the contrast of interest
# specify no sorting of results and a number that is longer than the data table
tt_limma <- topTable(fit2, coef = 1, sort.by = "none", number = Inf)

# let's see how many up and down candidates, and the top tags
summary(decideTests(fit2, p.value = 0.05))
topTable(fit2)

# get the results summary
results <- collect_results_limma(tmt_tmm, tt_limma, N, "N", MP, "MP", accessions)

# check the p-value distribution
ggplot(results, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("N versus MP p-value distribution") # check the p-value distribution

       N - MP
Down      369
NotSig   3312
Up        451

We have more candidates¶

We have 820 candidates in this comparison. The p-value distribution looks okay.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_N_MP")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("N vs MP logFC distributions by candidate")

MA plots¶

# make MA plots
MA_plots(results, "ave_N", "ave_MP", "N vs MP")

Scatter plots¶

# make scatter plots
scatter_plots(results,  "ave_N", "ave_MP", "N vs MP")

Volcano plot¶

# make a volcano plot
volcano_plot(results,  "ave_N", "ave_MP", "N vs MP")

The candidates have a similar pattern¶

We have robust down-regulation in metaplastic breast cancer compared to normal tissue. We may have more DE candidates compared to the first comparison because we have 14 MBC samples compared to 6 for triple negative. Generally, we should have more statistical power with larger replicate numbers (all things being equal). Expression patterns and volcano plot look better, too.

Check some individual protein expression¶

# look at the top 20 candidates (up in MP, then down in MP)
set_plot_dimensions(9, 3.5)
plot_top_tags(results, 7, 14, 20)
set_plot_dimensions(9, 9)

The top DE candidates look more convincing¶

There are some sample to sample variations in line with what one expects for human subjects. However, we do have clear differences in expression levels.

(3) Triple Negative versus Metaplastic ¶

We can compare the non-metaplastic breast cancer samples to the metaplastic samples. These samples were not very distinct from each other in the cluster plots.

# make the contrast
contrast <- makeContrasts(TN-MP, levels = design)
contrast

# get the fit for the contrast of interest
fit2 <- contrasts.fit(fit, contrast)

# do the empirical Bayes moderation of the test statistic (with trended variance)
fit2 <- eBayes(fit2, trend = TRUE)

# grab the information in topTable so we can get the data to plot candidates
# the coef parameter has to do with the contrast of interest
# specify no sorting of results and a number that is longer than the data table
tt_limma <- topTable(fit2, coef = 1, sort.by = "none", number = Inf)

# let's see how many up and down candidates, and the top tags
summary(decideTests(fit2, p.value = 0.05))
topTable(fit2)

# get the results summary
results <- collect_results_limma(tmt_tmm, tt_limma, TN, "TN", MP, "MP", accessions)

# check the p-value distribution
ggplot(results, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("TN versus MP p-value distribution") # check the p-value distribution

       TN - MP
Down         0
NotSig    4132
Up           0

We have no candidates¶

There is no point in plotting things.

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_TN_MP")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

head(results)

MA plots¶

# make MA plots
MA_plots(results, "ave_TN", "ave_MP", "TN vs MP")

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_TN", "ave_MP", "TN vs MP")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_TN", "ave_MP", "TN vs MP")

Reduced number of candidates, but still some down regulation for MBC¶

Check individual protein expression¶

These will all be non-candidates.

# look at the top 10 candidates (up in MP, then down in MP)
set_plot_dimensions(9, 3.5)
plot_top_tags(results, 6, 14, 10)
set_plot_dimensions(9, 9)

Summary ¶

We had significant differences between normal tissue samples and either set of cancer samples. limma testing is more conservative than edgeR testing. limma, with its moderated test statistic, looks better than the 2-sample t-test.

Save the `all_results` frame to TSV file¶

# write the results to disk
write.table(all_results, "results_limma.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information¶

# log the session details
sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.9.12.31 edgeR_3.24.3    limma_3.38.3    scales_1.1.0   
 [5] forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3    
 [9] readr_1.3.1     tidyr_1.0.2     tibble_2.1.3    ggplot2_3.3.0  
[13] tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-40 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  labeling_0.3    
[21] splines_3.5.3    munsell_0.5.0    broom_0.5.2      compiler_3.5.3  
[25] modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3  mnormt_1.5-6    
[29] htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1      crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.2.0      grid_3.5.3       nlme_3.1-145    
[37] jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
[45] fs_1.3.2         xml2_1.2.5       generics_0.0.2   vctrs_0.2.4     
[49] IRkernel_1.1     tools_3.5.3      glue_1.3.2       hms_0.5.3       
[53] parallel_3.5.3   colorspace_1.4-1 rvest_0.3.5      pbdZMQ_0.3-3    
[57] haven_2.2.0

	logFC	AveExpr	t	P.Value	adj.P.Val	B
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
sp\|P02750\|A2GL_HUMAN	2.256412	17.52211	6.537487	4.297078e-07	0.0007232031	6.424896
sp\|P15088\|CBPA3_HUMAN	2.768689	18.17815	6.288058	8.341576e-07	0.0007232031	5.808048
sp\|P07585\|PGS2_HUMAN	3.690774	24.00972	6.265944	8.849314e-07	0.0007232031	5.753035
sp\|P28906\|CD34_HUMAN	2.375388	14.92570	6.264920	8.873563e-07	0.0007232031	5.750487
sp\|P12109\|CO6A1_HUMAN	2.463413	23.69726	6.172837	1.135426e-06	0.0007232031	5.520867
sp\|P23946\|CMA1_HUMAN	2.907811	17.40106	6.156793	1.185346e-06	0.0007232031	5.480773
sp\|O14558\|HSPB6_HUMAN	2.992663	17.71030	6.144475	1.225175e-06	0.0007232031	5.449973
sp\|O95810\|CAVN2_HUMAN	2.498668	16.64844	6.068556	1.502374e-06	0.0007759760	5.259829
sp\|Q9UEY8\|ADDG_HUMAN	1.245871	16.83575	5.980019	1.906919e-06	0.0008177936	5.037413
sp\|P04275\|VWF_HUMAN	1.326607	18.36020	5.966230	1.979171e-06	0.0008177936	5.002712

	logFC	AveExpr	t	P.Value	adj.P.Val	B
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
sp\|O14558\|HSPB6_HUMAN	3.2293871	17.71030	7.968884	1.090722e-08	3.745302e-05	9.959497
sp\|Q969G5\|CAVN3_HUMAN	2.4089924	18.99492	7.649707	2.421942e-08	3.745302e-05	9.205136
sp\|P23946\|CMA1_HUMAN	2.9514503	17.40106	7.510595	3.442787e-08	3.745302e-05	8.871869
sp\|O95810\|CAVN2_HUMAN	2.5660577	16.64844	7.490204	3.625656e-08	3.745302e-05	8.822796
sp\|P15088\|CBPA3_HUMAN	2.7006022	18.17815	7.371457	4.905956e-08	4.054282e-05	8.535886
sp\|P07585\|PGS2_HUMAN	3.5444045	24.00972	7.232071	7.011865e-08	4.205167e-05	8.196699
sp\|P42330\|AK1C3_HUMAN	2.7251140	17.58119	7.195991	7.693932e-08	4.205167e-05	8.108483
sp\|P46782\|RS5_HUMAN	-0.9966082	21.35049	-7.174036	8.141659e-08	4.205167e-05	8.054719
sp\|P21399\|ACOC_HUMAN	1.3052261	20.54042	7.120030	9.359270e-08	4.296945e-05	7.922206
sp\|Q13642\|FHL1_HUMAN	2.8998197	19.65159	7.078871	1.041050e-07	4.301620e-05	7.820963

	logFC	AveExpr	t	P.Value	adj.P.Val	B
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
sp\|P02794\|FRIH_HUMAN	-1.0060801	18.99785	-5.232813	1.454396e-05	0.06009566	2.06153299
sp\|P49773\|HINT1_HUMAN	0.8113519	20.40913	4.526457	1.004624e-04	0.20755534	0.69780533
sp\|P36776\|LONM_HUMAN	0.7673013	20.23731	4.197682	2.455261e-04	0.31296037	0.05491719
sp\|Q15102\|PA1B3_HUMAN	0.6873740	18.17124	4.119876	3.029626e-04	0.31296037	-0.09724655
sp\|Q14980\|NUMA1_HUMAN	0.6025250	21.73054	3.997791	4.207979e-04	0.32150544	-0.33570216
sp\|Q96I59\|SYNM_HUMAN	0.7941657	14.65173	3.904308	5.405111e-04	0.32150544	-0.51788740
sp\|O75116\|ROCK2_HUMAN	-0.5508043	18.71127	-3.861046	6.066682e-04	0.32150544	-0.60203702
sp\|P35556\|FBN2_HUMAN	-1.5645686	17.58132	-3.851399	6.224694e-04	0.32150544	-0.62078586
sp\|P21283\|VATC1_HUMAN	-0.5645443	19.43877	-3.792112	7.288056e-04	0.32972523	-0.73586377
sp\|P42330\|AK1C3_HUMAN	1.4941076	17.58119	3.742907	8.303850e-04	0.32972523	-0.83116836

	IRSNorm_TN1_1_tmm	IRSNorm_TN2_1_tmm	IRSNorm_TN3_2_tmm	IRSNorm_TN4_2_tmm	IRSNorm_TN5_3_tmm	IRSNorm_TN6_3_tmm	IRSNorm_C1_1_tmm	IRSNorm_C2_1_tmm	IRSNorm_C4_2_tmm	IRSNorm_C5_3_tmm	⋯	IRSNorm_SQ4_3_tmm	logFC	FC	PValue	FDR	ave_TN	ave_MP	direction	candidate	Acc
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<fct>	<fct>	<fct>
1	54214170	91080466	63253969	252745580	53880555	64859355	258612806	114735507	83106941	163794818	⋯	28241984	-0.1670371	1.209868	0.70035814	0.9780014	96672349	116960739	up	no	sp\|P02787\|TRFE_HUMAN
2	103538734	101798625	87255042	46379332	90867807	145562724	127674879	77028273	278027439	137704872	⋯	96402320	-0.2773392	1.234958	0.24132365	0.8936171	95900377	118432976	up	no	sp\|P04406\|G3P_HUMAN
3	77263186	119701133	46254025	46821197	61329814	52790948	120089988	77997370	126427090	82349588	⋯	99111754	-0.6554589	1.577066	0.02069129	0.7020572	67360050	106231227	up	no	sp\|P08670\|VIME_HUMAN_family
4	34466145	57562194	49466943	268022428	24473686	60167183	130621107	85883891	64873419	96767089	⋯	24470530	-0.2526888	1.008481	0.57334230	0.9623796	82359763	83058283	up	no	sp\|P02647\|APOA1_HUMAN
5	64709281	119006821	106761899	57026161	79604307	36911988	59178934	57161301	86083334	100871870	⋯	106971423	-0.1202404	1.061691	0.66459577	0.9717303	77336743	82107718	up	no	sp\|Q09666\|AHNK_HUMAN
6	34841175	55113845	29269499	220626570	26504571	42012253	155182916	71495403	55148759	88028469	⋯	18085906	-0.3767222	1.088582	0.35687778	0.9180108	68061319	74090313	up	no	sp\|P01024\|CO3_HUMAN

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	153
med	368
low	244
no	3367

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	407
med	413
low	332
no	2980

Re-analysis of PXD014414: Metaplastic breast cancer¶

Sabra I. Djomehri, et al.¶

Data re-analysis performed by Phil Wilmarth, PSR Core OHSU¶

This notebook will be using limma instead of edgeR¶

July 26, 2020¶

Load the necessary R libraries¶

Define common functions for notebook use¶

Group samples into the three types of tissue¶

Statistical testing starts here¶

Compute averages, flag candidates, and save the pair-wise test results¶

We have 521 DE candidates¶

Count candidates and look at fold-change distributions¶

Main summary plots¶

MA plots¶

Pattern of non-DE candidates is qualitatively similar to DE candidates¶

Scatter plots¶

Volcano plot¶

Many proteins are down regulated in breast cancer¶

Check some individual protein expression¶

We have more candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

MA plots¶

Scatter plots¶

Volcano plot¶

The candidates have a similar pattern¶

Check some individual protein expression¶

The top DE candidates look more convincing¶

We have no candidates¶

MA plots¶

Scatter plots¶

Volcano plot¶

Reduced number of candidates, but still some down regulation for MBC¶

Check individual protein expression¶

Save the all_results frame to TSV file¶

Log the session information¶

Save the `all_results` frame to TSV file¶