Re-analysis of PXD014414: Metaplastic breast cancer¶

Sabra I. Djomehri, et al.¶

Nature Communications (2020) 11:1723 ¶

Data re-analysis performed by Phil Wilmarth, PSR Core OHSU¶

April 24, 2020¶

Overview
Library and data loading
Clustering and normalization
1. Normal versus Triple Negative
2. Normal versus Metaplastic
3. Triple Negative versus Metaplastic
Summary

Overview ¶

The data were first presented in this publication:

Djomehri, S.I., Gonzalez, M.E., da Veiga Leprevost, F., Tekula, S.R., Chang, H.Y., White, M.J., Cimino-Mathews, A., Burman, B., Basrur, V., Argani, P. and Nesvizhskii, A.I., 2020. Quantitative proteomic landscape of metaplastic breast carcinoma pathological subtypes and their relationship to triple-negative tumors. Nature communications, 11(1), pp.1-15.

A total of 27 breast tissue samples were studied with TMT (Ref. 1) labeling using the newer SPS MS3 method (Ref. 2). The instrument was a Thermo Fusion Tribrid mass spectrometer (Ref. 3). The paper was looking at the proteomic signature of metaplastic breast cancer and how different metaplastic subtypes might differ. Normal tissue and less aggressive triple negative breast cancer samples were also profiled. One of the metaplastic samples was only normal tissue. The final number of samples were 7 normal tissue, 6 triple negative non-metaplastic, and 14 metaplastic samples (in a few categories). The data in the above publication was processed with an MSFragger/Philosopher data analysis pipeline (github.com/Nesvilab/philosopher) using a reporter ion ratio oriented methodology. The use of ratios and log transformations make the analysis less transparent and not easy to visually check.

In this notebook, I will show that high quality analyses can be done without using ratios. The 27 samples in the experiment were split across 3 10-plex experiments (8-fraction separations for each). The 131 channel contained a pooled reference channel. The pooled channel was used for Internal Reference Scaling (IRS) (Ref. 4) so that all of the data could be kept in its original intensity scale. The Comet/PAW pipeline (Ref. 5) was used for highly sensitive PSM identification and robust TMT reporter ion processing.

The PAW pipeline uses accurate mass measurements from modern instruments in a flexible way maximize PSM identifications. Extended parsimony principles maximize the quantiative information content compared to cumbersome razor peptide approaches. The two-peptide rule is employed for accurate protein identification without any reliance on ad hoc protein ranking functions. Very low intensity reporter ion scans are filtered out and any remaining missing reporter ion data is handled in a simple way.

There were 754,130 MS scans from the 24 RAW files. There were 266,723 PSMs at 1% FDR from the PAW processing. The publication reported 251,305 at a similar FDR. The PAW protein inference had 5,127 non-contaminant, non-decoy proteins (proteins, protein groups, or protein families) with at least two distinct peptides per protein per plex. There were 5,087 proteins that had some reporter ion signals in at least one plex.

After the IRS step, there were 4,132 quantifiable proteins seen in each plex. This was 81% of the total number of protein identifications. Loss of 955 proteins might seem concerning at first glance. However, the 4,132 proteins account for 99.55% of the total reporter ion intensity in the 3-plex experiment. The 955 rejected proteins were only 0.45% of the total signal. If we compute the average IRS adjusted reporter ion intensity for each of the 4,132 proteins and sort them by decreasing average intensity, the last protein has an average intensity of 1,032. The smallest non-zero reporter ion signals in individual scans was around 500 (Orbitrap intensities have some small minimum value (we see 350-ish on our Fusion, this data had about 500) or the data is missing). We require two peptides and 2 times 500 is 1,000 and that is right where the cutoff is.

Another important point with TMT data and missing data for multi-plex experiments is that data end up missing by plex rather than randomly by individual data point. This is likely something no missing data imputation algorithm was designed to consider. The 4,132 proteins only have 242 missing intensities out of 111,564 data points (4132 times 27) or 0.2% missing values. The 955 proteins have 25,785 data points with 11,454 missing data points, for 44.4% missing. Practically speaking, nearly all of the missing data (98%) come from those 955 proteins.

Aside: It is human nature to focus on what is missing rather than what is present. Robust quantification of over 4,000 proteins was unheard of just a few years ago. Starting with more sample and using more fractions is the best way to quantify the 955 "lost" proteins. The experiment could also be repeated on a Lumos or Eclipse to get more identifications and more intense reporter ion signals. Performing a second, better experiment is always more productive than over-analyzing data.

The IRS adjusted reporter ion intensities were analyzed with Jupyter notebooks, an R kernel, the Bioconductor package edgeR (Ref. 6), and its TMM normalization method (Ref. 7). This notebook format allows rich data visualization and quality control checks.

References¶

Ref. 1. Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Ref. 2. McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

Ref. 3. Senko, M.W., Remes, P.M., Canterbury, J.D., Mathur, R., Song, Q., Eliuk, S.M., Mullen, C., Earley, L., Hardman, M., Blethrow, J.D. and Bui, H., 2013. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates. Analytical chemistry, 85(24), pp.11710-11714.

Ref. 4. Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Ref. 5. Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Ref. 6. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

Ref. 7. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

Load the necessary R libraries¶

# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)

Warning message:
"package 'tidyverse' was built under R version 3.6.2"-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.2.1     v purrr   0.3.3
v tibble  2.1.3     v dplyr   0.8.4
v tidyr   1.0.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
Warning message:
"package 'ggplot2' was built under R version 3.6.2"Warning message:
"package 'tibble' was built under R version 3.6.2"Warning message:
"package 'tidyr' was built under R version 3.6.2"Warning message:
"package 'readr' was built under R version 3.6.2"Warning message:
"package 'purrr' was built under R version 3.6.2"Warning message:
"package 'dplyr' was built under R version 3.6.2"Warning message:
"package 'stringr' was built under R version 3.6.3"Warning message:
"package 'forcats' was built under R version 3.6.2"-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning message:
"package 'scales' was built under R version 3.6.2"
Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Warning message:
"package 'limma' was built under R version 3.6.2"Warning message:
"package 'edgeR' was built under R version 3.6.2"Warning message:
"package 'psych' was built under R version 3.6.2"
Attaching package: 'psych'

The following objects are masked from 'package:scales':

    alpha, rescale

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Define common functions for notebook use¶

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# =========== Boxplot with median label ========================================
labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== p-value plots ================================================
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Load the IRS-normalized TMT intensity data ¶

The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is relatively straightforward. Note than this experimental design has only a single reference channel per plex instead of two.

We need to drop contaminant and decoy proteins, and proteins with missing sets of reporter ions. We extract the accessions column, the SL and IRS normed data columns, and collect channels by biological condition. We also grab the reference channels to check the IRS procedure.

# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 5250)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
data_sl <- data_all %>% select(., contains("SLNorm_")) %>% 
  select(., -contains("_Unused")) %>% 
  select(., -contains("_Pool"))
data_irs <- data_all %>% select(., contains("IRSNorm_")) %>% 
  select(., -contains("_Unused")) %>% 
  select(., -contains("_Pool"))
data_pool <- data_all %>% select(., contains("_Pool_"))

Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)
See spec(...) for full column specifications.

# save a few columns for the results table
all_results <- data_all %>% select(., ProtGroup, Counter, Accession, Description, starts_with("PSMs_Used"))

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# see how many rows of data we have
length(accessions)

Group samples into the three types of tissue¶

We will lump all the metaplastic breast cancer (MBC) samples together to contrast to triple negative breast cancer and normal tissue samples. There will be 14 mbc, 6 triple negative, and 7 normal samples.

# all categories of metaplastic breast cancer tissue
mbc_sl <- select(data_sl, contains("_C"), contains("_SP"), contains("_SQ"))
mbc_irs <- select(data_irs, contains("_C"), contains("_SP"), contains("_SQ"))

# triple negative breast cancer tissue
tn_sl <- select(data_sl, contains("_TN"))
tn_irs <- select(data_irs, contains("_TN"))

# Normal tissue
n_sl <- select(data_sl, contains("_N"))
n_irs <- select(data_irs, contains("_N"))

Check the data intensity distributions before and after IRS¶

As part of the IRS script, the total intensities per channel across all three plexes are matched (sample-loading normalization). The IRS method corrects for pseudo random scan selection by the instrument using the pooled reference channel(s). We can look at boxplots after SL-normalization and after IRS-normalization.

# collect the biological replicate channels
bio_sl <- cbind(n_sl, tn_sl, mbc_sl)
bio_irs <- cbind(n_irs, tn_irs, mbc_irs)

# set a color vector for plots
color <- c(rep("red", 7), rep("blue", 6), rep("green", 14))

# boxplots
boxplot(log10(bio_sl), col = color, notch = TRUE, main = "SL Normalized data")
boxplot(log10(bio_irs), col = color, notch = TRUE, main = "IRS Normalized data")

Check clustering of sample¶

The above boxplots did not look too different. Boxplots are not sufficient to verify that normalization worked as expected. Clustering is an orthogonal way to check things. We should have samples cluster by biological conditions if we have successfully removed other sources of measurement variation.

# clustering of the replicates before IRS
plotMDS(log2(bio_sl), col = color, main = "Replicates before IRS")

# clustering after IRS
plotMDS(log2(bio_irs), col = color, main = "Replicates after IRS")

IRS removes TMT plex effects¶

Without IRS, the samples cluster by TMT plex. After IRS we recover the biological groupings. While IRS might look like other "batch correction" methods, it is actually superior. IRS does not rely on balanced study designs and is more reliable with the relatively small numbers of channels in TMT experiments.

Compare sample-to-sample consistency within groups using scatter plots¶

We can also look at the scatter plots of samples to each other within each biological group. We will do that before and after IRS.

# the normal tissue samples (includes the one mis-classified sample)
pairs.panels(log10(n_sl), lm = TRUE, main = "Normal, SLNorm")
pairs.panels(log10(n_irs), lm = TRUE, main = "Normal, IRSNorm")

Scatter is reduced with IRS¶

We have correlation coefficients of around 0.95 for samples within the same plex, and correlation coefficients of around 0.85 between plexes before IRS. After IRS, all correlation coefficients are in the mid to upper 0.9s.

# the triple negative samples
pairs.panels(log10(tn_sl), lm = TRUE, main = "Triple Negative, SLNorm")
pairs.panels(log10(tn_irs), lm = TRUE, main = "Triple Negative, IRSNorm")

# the MBC C samples
pairs.panels(log10(mbc_sl[1:4]), lm = TRUE, main = "Metaplastic C, SLNorm")
pairs.panels(log10(mbc_irs[1:4]), lm = TRUE, main = "Metaplastic C, IRSNorm")

# the MBC Sp samples
pairs.panels(log10(mbc_sl[5:10]), lm = TRUE, main = "Metaplastic Sp, SLNorm")
pairs.panels(log10(mbc_irs[5:10]), lm = TRUE, main = "Metaplastic Sp, IRSNorm")

# the MBC Sq samples
pairs.panels(log10(mbc_sl[11:14]), lm = TRUE, main = "Metaplastic Sq, SLNorm")
pairs.panels(log10(mbc_irs[11:14]), lm = TRUE, main = "Metaplastic Sq, IRSNorm")

We can also check the pooled standard channels before and after IRS¶

This is a little artificial because IRS will make the reference channels identical. The before IRS plots (SLNorm) show how the random sampling distorts the data. These samples are chemically/biologically identical and should yield very similar intensities.

# check the pooled standards
pairs.panels(log10(data_pool[1:3]), lm = TRUE, main = "Pools, SLNorm")
pairs.panels(log10(data_pool[4:6]), lm = TRUE, main = "Pools, IRSNorm")

Use data from all 3 tissue types for the edgeR analysis¶

We are defining the groups that will be compared explicitly and using all of the samples for variance estimates (similar to pooled variance methods). We will put the data into a data frame, grouped by condition. We will define some column indexes for each condition, set some colors for plotting, and see how the data cluster by condition.

# put groups together into a single data frame
tmt_sl <- bio_sl
tmt_irs <- bio_irs

# define the positions of the groups
N <- 1:7
TN <- 8:13
MBC <- 14:27

# set some colors by condition
group <- c(rep("N", 7), rep("TN", 6), rep("MBC", 14))

Run TMM normalization and check final clustering ¶

We will load the data into edgeR data structures and call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. We will double check if the TMM normalization changed the clustering that we had above.

We need to use the edgeR normalization factors to produce the TMM normalized data that the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We compute the normalized intensities with the apply_tmm_factors function.

# get the biological sample data into a DGEList object
y <- DGEList(counts = tmt_irs, group = group, genes = accessions)

# run TMM normalization (also includes a library size factor)
y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, color = color)

# check the clustering
plotMDS(y, col = color, main = "all samples after TMM")

Library size factors:
 IRSNorm_N1_1 -> 1.008535
 IRSNorm_N2_1 -> 1.011973
 IRSNorm_Nx_2 -> 0.989973
 IRSNorm_N3_2 -> 1.005097
 IRSNorm_N4_2 -> 0.997700
 IRSNorm_N5_3 -> 1.015249
 IRSNorm_N6_3 -> 1.008969
 IRSNorm_TN1_1 -> 1.001659
 IRSNorm_TN2_1 -> 1.001914
 IRSNorm_TN3_2 -> 1.016278
 IRSNorm_TN4_2 -> 0.993309
 IRSNorm_TN5_3 -> 1.001864
 IRSNorm_TN6_3 -> 1.004372
 IRSNorm_C1_1 -> 0.992717
 IRSNorm_C2_1 -> 0.999395
 IRSNorm_C4_2 -> 0.997888
 IRSNorm_C5_3 -> 0.997909
 IRSNorm_SP1_1 -> 0.999110
 IRSNorm_SP2_2 -> 0.998050
 IRSNorm_SP3_2 -> 0.995285
 IRSNorm_SP4_3 -> 0.988267
 IRSNorm_SP5_3 -> 0.988331
 IRSNorm_SP6_3 -> 0.993169
 IRSNorm_SQ1_1 -> 0.994292
 IRSNorm_SQ2_1 -> 1.003056
 IRSNorm_SQ3_2 -> 1.001224
 IRSNorm_SQ4_3 -> 0.995877

Trimmed mean of M-values (TMM) factors:
 IRSNorm_N1_1 -> 1.152091
 IRSNorm_N2_1 -> 1.299722
 IRSNorm_Nx_2 -> 1.578921
 IRSNorm_N3_2 -> 1.167884
 IRSNorm_N4_2 -> 1.016422
 IRSNorm_N5_3 -> 1.182453
 IRSNorm_N6_3 -> 0.991036
 IRSNorm_TN1_1 -> 0.927147
 IRSNorm_TN2_1 -> 0.885920
 IRSNorm_TN3_2 -> 0.999065
 IRSNorm_TN4_2 -> 1.027606
 IRSNorm_TN5_3 -> 0.872837
 IRSNorm_TN6_3 -> 0.803830
 IRSNorm_C1_1 -> 1.143793
 IRSNorm_C2_1 -> 0.841743
 IRSNorm_C4_2 -> 1.101707
 IRSNorm_C5_3 -> 0.957607
 IRSNorm_SP1_1 -> 0.957029
 IRSNorm_SP2_2 -> 0.890115
 IRSNorm_SP3_2 -> 1.085426
 IRSNorm_SP4_3 -> 0.944171
 IRSNorm_SP5_3 -> 0.912511
 IRSNorm_SP6_3 -> 0.950222
 IRSNorm_SQ1_1 -> 0.881965
 IRSNorm_SQ2_1 -> 0.916414
 IRSNorm_SQ3_2 -> 0.952798
 IRSNorm_SQ4_3 -> 0.867931

Combined (lib size and TMM) normalization factors:
 IRSNorm_N1_1 -> 1.161924
 IRSNorm_N2_1 -> 1.315284
 IRSNorm_Nx_2 -> 1.563089
 IRSNorm_N3_2 -> 1.173837
 IRSNorm_N4_2 -> 1.014085
 IRSNorm_N5_3 -> 1.200484
 IRSNorm_N6_3 -> 0.999924
 IRSNorm_TN1_1 -> 0.928685
 IRSNorm_TN2_1 -> 0.887615
 IRSNorm_TN3_2 -> 1.015327
 IRSNorm_TN4_2 -> 1.020731
 IRSNorm_TN5_3 -> 0.874465
 IRSNorm_TN6_3 -> 0.807344
 IRSNorm_C1_1 -> 1.135462
 IRSNorm_C2_1 -> 0.841233
 IRSNorm_C4_2 -> 1.099380
 IRSNorm_C5_3 -> 0.955605
 IRSNorm_SP1_1 -> 0.956177
 IRSNorm_SP2_2 -> 0.888379
 IRSNorm_SP3_2 -> 1.080309
 IRSNorm_SP4_3 -> 0.933093
 IRSNorm_SP5_3 -> 0.901863
 IRSNorm_SP6_3 -> 0.943731
 IRSNorm_SQ1_1 -> 0.876931
 IRSNorm_SQ2_1 -> 0.919214
 IRSNorm_SQ3_2 -> 0.953965
 IRSNorm_SQ4_3 -> 0.864353

TMM factors are close to 1 for most samples¶

The IRS method already takes care of the library size type of normalizations. We could still have compositional differences between sample groups that the TMM factors would correct for. We do not have any library size or TMM factors that are too much different from 1.0.

The box plots are very similar after TMM. TMM usually gets the medians (box centers) aligned pretty well.

# put CVs in data frames to simplify plots and summaries
cv_sl <- data.frame(N = CV(bio_sl[N]), TN = CV(bio_sl[TN]), MBC = CV(bio_sl[MBC])) 
cv_irs <- data.frame(N = CV(bio_irs[N]), TN = CV(bio_irs[TN]), MBC = CV(bio_irs[MBC]))
cv_tmm <- data.frame(N = CV(tmt_tmm[N]), TN = CV(tmt_tmm[TN]), MBC = CV(tmt_tmm[MBC]))

# see what the median CV values are
medians <- apply(cv_sl, 2, FUN = median)
print("SLNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_irs, 2, FUN = median)
print("IRSNorm median CVs by condition (%)")
round(medians, 2)

medians <- apply(cv_tmm, 2, FUN = median)
print("Final median CVs by condition (%)")
round(medians, 2)

[1] "SLNorm median CVs by condition (%)"

[1] "IRSNorm median CVs by condition (%)"

[1] "Final median CVs by condition (%)"

IRS and TMM reduce the CV significantly¶

We have about 50% median CVs before IRS. IRS reduces those to around 30%. TMM makes some small additional improvements.

# see what the CV distibutions look like
# need long form for ggplot
long_cv_sl <- gather(cv_sl, key = "group", value = "cv")

# traditional boxplots
ggplot(long_cv_sl, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("SL CV distributions")

# density plots
ggplot(long_cv_sl, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 200)) +
  ggtitle("SL CV distributions")

# need long form for ggplot
long_cv_tmm <- gather(cv_tmm, key = "group", value = "cv") 

# traditional boxplots
ggplot(long_cv_tmm, aes(x = group, y = cv, fill = group)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Final CV distributions")

# density plots
ggplot(long_cv_tmm, aes(x = cv, color = group)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 200)) +
  ggtitle("Final CV distributions")

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 27 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We loaded the IRS data into DGEList object y a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in y.

# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")

Design matrix not provided. Switch to the classic mode.

These sample have some relatively high inherent variance¶

(1) Normal versus Triple Negative ¶

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("N", "TN"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("N versus TN p-value distribution")

       TN-N
Down    505
NotSig 3222
Up      405

We have 910 DE candidates¶

We have about 22% of the proteins with differential expression. The p-value distribution looks good. We have a flat distribution from non-DE proteins and a large spike at low p-values from the true DE candidates.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we do not have many candidates, we will still make the MA plots, scatter plots, and volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, N, "N", TN, "TN")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_N_TN")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("N vs TN logFC distributions by candidate")

Main summary plots¶

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots¶

# make MA plots
MA_plots(results, "ave_N", "ave_TN", "N vs TN")

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_N", "ave_TN", "N vs TN")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_N", "ave_TN", "N vs TN")

Many proteins are down regulated in breast cancer¶

We have robust differential expression changes. We have DE candidates that look convincing in several different views. There are many proteins that are down regulated significantly (large negative fold changes) in breast cancer.

Check some individual protein expression¶

We can look at the data for the top 20 (by edgeR BH-corrected p-values) up and down regulated proteins.

# look at the top 20 candidates in each direction (up in TN, then down in TN)
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 7, 6, 20)
set_plot_dimensions(7, 7)

(2) Normal versus Metaplastic ¶

We will do the same testing to compare normal tissue to metaplastic breast cancer samples.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("N", "MBC"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black") # 2-fold change lines

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("N vs MBC p-value distribution")

       MBC-N
Down     671
NotSig  2858
Up       603

We have more candidates¶

We had about 900 candidates in the normal versus triple negative. We have around 1,270 here. The p-value distribution looks good.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, N, "N", MBC, "MBC")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_N_MBC")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("N vs MBC logFC distributions by candidate")

MA plots¶

# make MA plots
MA_plots(results, "ave_N", "ave_MBC", "N vs MBC")

Scatter plots¶

# make scatter plots
scatter_plots(results,  "ave_N", "ave_MBC", "N vs MBC")

Volcano plot¶

# make a volcano plot
volcano_plot(results,  "ave_N", "ave_MBC", "N vs MBC")

The candidates have a similar pattern¶

We have similar more robust down-regulation in breast cancer compared to normal tissue. We may have more DE candidates compared to the first comparison because we have 14 MBC samples compared to 6 for triple negative. Generally, we should have more statistical power with larger replicate numbers (all things being equal).

Check some individual protein expression¶

# look at the top 20 candidates (up in MBC, then down in MBC)
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 7, 14, 20)
set_plot_dimensions(7, 7)

The top DE candidates look pretty convincing¶

There are some sample to sample variations in line with what one expects for human subjects. However, we do have clear differences in expression levels. Given the moderate numbers of DE candidates, enrichment and pathway analyses are likely to provide a good index into the biology.

(3) Triple Negative versus Metaplastic ¶

We can compare the non-metaplastic breast cancer samples to the metaplastic samples. These samples were not very distinct from each other in the cluster plots.

# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("TN", "MBC"))

# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order (the sort.by parameter)!
topTags(et)$table
tt <- topTags(et, n = Inf, sort.by = "none")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tt$table, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("TN vs MBC p-value distribution")

       MBC-TN
Down       38
NotSig   4086
Up          8

We have only 46 DE candidates¶

We have a reasonable p-value distribution that is consistent with the reduced number of DE candidates. The variance in the samples is likely a factor in the smaller number of candidates. We probably have smaller differences in means between these samples compared to normal tissue, so the p-values are smaller.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and also accumulate all of the three comparisons into all_results.

Even though we have fewer candidates, we will still make the MA plot, scatter plot, and volcano plot.

# get the results summary
results <- collect_results(tmt_tmm, tt$table, TN, "TN", MBC, "MBC")

# make column names unique by adding comparison
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_TN_MBC")

# accumulate the testing results
all_results <- cbind(all_results, results_temp)

Count candidates and look at fold-change distributions¶

# see how many candidates by category
results %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-4, 4)) +
  ggtitle("TN vs MBC logFC distributions by candidate")

MA plots¶

# make MA plots
MA_plots(results, "ave_TN", "ave_MBC", "TN vs MBC")

Scatter plots¶

# make scatter plots
scatter_plots(results, "ave_TN", "ave_MBC", "TN vs MBC")

Volcano plot¶

# make a volcano plot
volcano_plot(results, "ave_TN", "ave_MBC", "TN vs MBC")

Reduced number of candidates, but still some down regulation for MBC¶

Check individual protein expression¶

# look at the top 10 candidates (up in MBC, then down in MBC)
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 6, 14, 10)
set_plot_dimensions(7, 7)

Variation between cancer samples might be higher¶

We seem to have a few individual samples in the groups that drive the mean differences for these top candidates (at least they seem less consistent that in the first two comparisons). One might expect cancer samples to be more divergent compared to normal tissue. Comparing cancer samples to cancer samples might be more difficult.

Summary ¶

We had significant differences between normal tissue samples and either set of cancer samples. It might take some careful comparing and contrasting of the DE candidates to see what proteins might be involved to make metaplastic cancer more aggressive than non-metaplastic breast cancer. We will see how the different sub-categories of the metaplastic samples compare in the second notebook.

Save the `all_results` frame to TSV file¶

# write the results to disk
write.table(all_results, "three_tissues_results.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information¶

# log the session details
sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_1.9.12.31 edgeR_3.28.1    limma_3.42.2    scales_1.1.0   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.4     purrr_0.3.3    
 [9] readr_1.3.1     tidyr_1.0.2     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-38 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_0.19.2      backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] lazyeval_0.2.2   uuid_0.1-2       readxl_1.3.1     rstudioapi_0.11 
[21] splines_3.6.1    labeling_0.3     munsell_0.5.0    broom_0.5.4     
[25] compiler_3.6.1   modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3 
[29] mnormt_1.5-6     htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1     
[33] crayon_1.3.4     dbplyr_1.4.2     withr_2.1.2      grid_3.6.1      
[37] nlme_3.1-144     jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.1.0 
[41] DBI_1.1.0        magrittr_1.5     cli_2.0.2        stringi_1.4.6   
[45] farver_2.0.3     fs_1.3.1         xml2_1.2.2       ellipsis_0.3.0  
[49] generics_0.0.2   vctrs_0.2.3      IRkernel_0.8.15  tools_3.6.1     
[53] glue_1.4.0       hms_0.5.3        parallel_3.6.1   colorspace_1.4-1
[57] rvest_0.3.5      pbdZMQ_0.3-3     haven_2.2.0

	genes	logFC	logCPM	PValue	FDR
2340	sp\|O95810\|CAVN2_HUMAN	-2.768358	4.686582	1.525828e-10	6.304721e-07
3688	sp\|P28906\|CD34_HUMAN	-2.469589	2.671826	1.336460e-09	2.761125e-06
1398	sp\|Q969G5\|CAVN3_HUMAN	-2.240681	6.833232	1.633640e-08	2.142636e-05
111	sp\|P12109\|CO6A1_HUMAN	-2.453344	11.423918	2.074188e-08	2.142636e-05
1671	sp\|O14558\|HSPB6_HUMAN	-2.998748	6.069957	2.625225e-08	2.169486e-05
45	sp\|P07585\|PGS2_HUMAN	-3.586562	12.569686	3.429727e-08	2.361939e-05
2311	sp\|P02750\|A2GL_HUMAN	-2.071664	5.164807	5.155130e-08	2.612057e-05
1589	sp\|P55268\|LAMB2_HUMAN	-2.262869	5.726815	5.434628e-08	2.612057e-05
287	sp\|P20774\|MIME_HUMAN	-4.222056	10.443917	5.817671e-08	2.612057e-05
1507	sp\|P22105\|TENX_HUMAN	-2.906840	5.968815	6.321532e-08	2.612057e-05

	genes	logFC	logCPM	PValue	FDR
2340	sp\|O95810\|CAVN2_HUMAN	-2.918234	4.686582	4.429248e-20	1.830165e-16
1671	sp\|O14558\|HSPB6_HUMAN	-3.443438	6.069957	2.967644e-18	6.131152e-15
1398	sp\|Q969G5\|CAVN3_HUMAN	-2.547367	6.833232	1.072333e-17	1.476959e-14
273	sp\|Q9NZN4\|EHD2_HUMAN	-3.391545	10.244597	3.252129e-16	3.359449e-13
2061	sp\|P42330\|AK1C3_HUMAN	-2.929691	5.572255	6.624749e-16	5.474692e-13
535	sp\|P21695\|GPDA_HUMAN	-4.538453	9.277384	9.604273e-16	6.614143e-13
971	sp\|Q13642\|FHL1_HUMAN	-3.090465	7.856586	3.110590e-15	1.836137e-12
445	sp\|P05091\|ALDH2_HUMAN	-2.370792	9.233643	4.239073e-15	2.189481e-12
1523	sp\|Q9HBL0\|TENS1_HUMAN	-2.089781	6.454683	8.822628e-15	4.050567e-12
627	sp\|P23141\|EST1_HUMAN	-3.737797	8.603866	1.112169e-14	4.595484e-12

	genes	logFC	logCPM	PValue	FDR
1481	sp\|O95994\|AGR2_HUMAN	-3.6422893	4.692122	4.738334e-09	1.957879e-05
1412	sp\|P02794\|FRIH_HUMAN	1.0387952	6.309903	1.295923e-06	2.677376e-03
1870	sp\|P09455\|RET1_HUMAN	-1.4862928	5.246895	2.573481e-06	3.544541e-03
2061	sp\|P42330\|AK1C3_HUMAN	-1.6637297	5.572255	8.138761e-06	5.638049e-03
2162	sp\|O43405\|COCH_HUMAN	-2.4802346	4.851930	8.618347e-06	5.638049e-03
775	sp\|P49773\|HINT1_HUMAN	-0.8238740	7.700555	9.498031e-06	5.638049e-03
3360	sp\|Q96I59\|SYNM_HUMAN	-0.9209535	1.965586	1.128082e-05	5.638049e-03
113	sp\|P05090\|APOD_HUMAN	-3.6508504	11.547581	1.163343e-05	5.638049e-03
812	sp\|Q9NZT1\|CALL5_HUMAN	-3.1204665	6.578901	1.228036e-05	5.638049e-03
853	sp\|P36776\|LONM_HUMAN	-0.8212505	7.531435	2.198747e-05	9.085224e-03

Re-analysis of PXD014414: Metaplastic breast cancer¶

Sabra I. Djomehri, et al.¶

Data re-analysis performed by Phil Wilmarth, PSR Core OHSU¶

April 24, 2020¶

References¶

Load the necessary R libraries¶

Define common functions for notebook use¶

Group samples into the three types of tissue¶

Check the data intensity distributions before and after IRS¶

Check clustering of sample¶

IRS removes TMT plex effects¶

Compare sample-to-sample consistency within groups using scatter plots¶

Scatter is reduced with IRS¶

We can also check the pooled standard channels before and after IRS¶

Use data from all 3 tissue types for the edgeR analysis¶

TMM factors are close to 1 for most samples¶

IRS and TMM reduce the CV significantly¶

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

These sample have some relatively high inherent variance¶

We have 910 DE candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

Main summary plots¶

MA plots¶

Scatter plots¶

Volcano plot¶

Many proteins are down regulated in breast cancer¶

Check some individual protein expression¶

We have more candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

MA plots¶

Scatter plots¶

Volcano plot¶

The candidates have a similar pattern¶

Check some individual protein expression¶

The top DE candidates look pretty convincing¶

We have only 46 DE candidates¶

Compute averages, flag candidates, and save the pair-wise test results¶

Count candidates and look at fold-change distributions¶

MA plots¶

Scatter plots¶

Volcano plot¶

Reduced number of candidates, but still some down regulation for MBC¶

Check individual protein expression¶

Variation between cancer samples might be higher¶

Save the all_results frame to TSV file¶

Log the session information¶

Save the `all_results` frame to TSV file¶