Spectral Counting (SpC) Analysis with R and edgeR

Phil Wilmarth, OHSU

February 2019

Objectives and overview

This notebook will demonstrate using Jupyter notebooks, R, and edgeR to perform statistical analysis of a very large spectral counting study using a paired study design.

The notebook will:

  • load libraries and set up the R session
  • load the proteomics result file
  • do some exploratory data analysis of starting data
  • load data into edgeR data structures
  • normalize the data
  • compute trended variances
  • perform a paired study design analysis using generalized linear models
  • visualize the test results
  • collect results into table and write to a file

Dataset description

The data is from a recent study (Ref-1) where human retinal and choroidal endothelial cells were compared. The study was 5 donor eyes where retinal and choroidal cells were collected and cultured in a paired design. The cell lysates from each of the 10 cell cultures were profiled using large-scale separations with a fast-scanning linear ion trap. There were about half a million MS2 scans per sample for a dataset size of a little over 5 million. The data are available at the PRIDE archive (PXD005972).

A few relevant files from the archive are present in the repository:

  • analysis_overview.pptx - summary of the data analysis steps
  • quant_protein_summary_8.txt - a grouped protein summary file
  • edgeR_input.txt - data extracted from results file for edgeR analysis
  • HCEC_HREC_quant_protein_summary_sprot.xlsx - final summary sheet
    • proteomics data from quant_protein_summary_8.txt
    • statistical results from edgeR
    • extra protein annotations

The goal here is to demonstrate edgeR (Ref-2) statistical testing that can be used for spectral counting (Ref-3) semi-quantitative experiments. The PAW pipeline has some features for shotgun quantitation that we will be using. One is a protein grouping algorithm where highly homologous proteins (majority of peptides are shared with few unique peptides) are grouped together and the context for shared or unique peptides are updated accordingly. This extended parsimony logic is described in more detail here. The other feature is the splitting of shared peptide spectral counts based on relative unique peptide evidence.

1. Smith JR, David LL, Appukuttan B, Wilmarth PA. Angiogenic and Immunologic Proteins Identified by Deep Proteomic Profiling of Human Retinal and Choroidal Vascular Endothelial Cells: Potential Targets for New Biologic Drugs. Am J Ophthalmol. 2018 Sep;193:197-229. doi: 10.1016/j.ajo.2018.03.020. Epub 2018 Mar17. PubMed PMID: 29559410; PubMed Central PMCID: PMC6109601.

2. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

3. Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.


Load R libraries

In [1]:
library("tidyverse")
library("edgeR")
library("limma")
library("psych")
library("gridExtra")
library("stringr")
library("scales")
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
 ggplot2 3.2.1      purrr   0.3.2
 tibble  2.1.3      dplyr   0.8.3
 tidyr   0.8.3      stringr 1.4.0
 readr   1.3.1      forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Loading required package: limma

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘gridExtra’

The following object is masked from ‘package:dplyr’:

    combine


Attaching package: ‘scales’

The following objects are masked from ‘package:psych’:

    alpha, rescale

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Data was prepped in Excel and exported to text file

The PAW pipeline, like most proteomics summary tables, has a lot more information that we need for the statistical analysis. Selecting the relevant columns from a large table is not too hard to do in R. Row filtering can be trickier. Decoy protein sequences are easy to drop. Common contaminant proteins can also be easier to identify. The situation for contaminants is often not so cut and dried. There are many ways to have contaminants in your samples that are not in any common contaminants database.

I prefer to open the PAW summary files with Excel and then do some searches in the protein description column for things like "keratin", "albumin", "hemoglobin", "trypsin" and look at the data across samples to decide if I think any of the matches should go into the contaminants pile. I typically flag all proteins that I do not want to use in the quantitative analysis in a "Filter" column and leave the column blank for usable proteins. Then I can sort on the "Filter" column and move excluded proteins to the bottom of the table. I usually make a new table with the "keepers" that has just the protein accessions and the quantitative columns of interest.


Set an appropriate average SpC cutoff

In any shotgun experiment, more things can be identified than can be quantified. For quantification we need data points for all samples. The "bottom" of the dataset will be proteins seen in one or a few samples. We will need to exclude those. How do we identify them? An ad hoc approach is to think about a two-condition experiment. These are often some sort of biomarker experiment so there might be a protein in one condition that is not in the other. A good rule of thumb for protein identification is two peptides per protein. That means that an SpC (spectral count) of 2 is at the detection limit. We want something above that lower limit to define a protein as confidently "present". A value of 5 is a good robust spectral count value; few decoy proteins have an SpC of 5 or greater. If we have 5 counts in one condition and no counts in the other, then an average SpC of 2.5 could be a good cutoff value. Do the data agree with this cutoff logic?

Plot fraction of missing data points versus average SPC

We have 10 samples per protein. We can average the counts over the 10 samples for each protein. We can sort the count table by decreasing average SpC. For each row as we go down the table, we can count the number of cells with zero out of the total number of cells. At the top of the sorted table, we have no missing data and the missing fraction is always zero. As we get towards the bottom of the table, we will encounter zero count cells and the missing fraction will start to increase.

I do not know how to program that in R (without some Google and Stack Overflow searching), but I do know how to do that in a couple of minutes in Excel. I created a text file with two columns: the average SpC and the missing fraction. Let's read that and plot the data.

In [2]:
# read in the data
temp <- read_tsv("ave_missing.txt")

# make a basic plot
ggplot(temp, aes(x = Ave, y = FracMissing)) +
  geom_point() +
  ggtitle("Missing Fraction versus Average SpC") + 
  labs(x = "Ave SpC", y = "Missing Fraction")
Parsed with column specification:
cols(
  Ave = col_double(),
  FracMissing = col_double()
)

We have sharp increase in missing data at low average SPC

We need to expand the x-axis to see where the increase starts.

In [3]:
# expanded x-axis plot
ggplot(temp, aes(x = Ave, y = FracMissing)) +
  geom_line() + 
  coord_cartesian(xlim = c(0, 8)) +
  ggtitle("Missing Fraction versus Average SpC") + 
  labs( x = "Ave SpC", y = "Missing Fraction") + 
  geom_vline(xintercept = 2.5, linetype = "dotted")

Cutoff of 2.5 looks pretty good

The fraction of data with missing values starts to rise at an average SpC of about 5.0. The missing data really starts to rise at average SpC values below 2.0. We usually want to cast a wide net for differential expression candidates in discovery experiments. A cutoff of 2.5 seems like a reasonable compromise. We might get some false positives near the cutoff, but those will be easy to catch manually later.


Read in the data

The data exported in the edgeR_input.txt file excluded contaminants, decoys, and any proteins with an average SpC of less than 2.5. We will read that data in and separate the accessions from the count data.

In [4]:
# read in the prepped data
paw_spc <- read_tsv("edgeR_input.txt")

# save accessions in vector and remove from data table
accession <- paw_spc$Accession
paw_spc <- select(paw_spc, -Accession)
head(paw_spc)
nrow(paw_spc)
Parsed with column specification:
cols(
  Accession = col_character(),
  HCEC_189 = col_double(),
  HCEC_191 = col_double(),
  HCEC_194 = col_double(),
  HCEC_195 = col_double(),
  HCEC_199 = col_double(),
  HREC_189 = col_double(),
  HREC_191 = col_double(),
  HREC_194 = col_double(),
  HREC_195 = col_double(),
  HREC_199 = col_double()
)
A tibble: 6 × 10
HCEC_189HCEC_191HCEC_194HCEC_195HCEC_199HREC_189HREC_191HREC_194HREC_195HREC_199
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
3088382431572925394040374838302830964492
2166276724432678291328063413226434433204
2233287321632395338821083405193327683491
1400189919641665196610261662129415121264
1207135215251456154911201544108617871511
1237143014571474141710861700113216471319
3454

The sample codes are HCEC for human choroidal endothelial cells and HREC for human retinal endothelial cells. The numbers are the 5 subject codes.

Are we "losing out" by having a minimum SpC cutoff?

Depending a little on how you approach protein FDR, we have about 5000 protein IDs. After the 2.5 average SpC cutoff, we are down to 3454. Is this an issue? The answer is no. Counting things by the number of proteins can lead to erroneous conclusions. We have 10 samples and we can tally some things based on in how many samples were the proteins seen: all 10 out of the 10, any 9 out of the 10, etc. I did that in Excel and made a short table.

In [5]:
# read in short table
freq <- read_tsv("Fractions_SpC_ID.txt")

# make a bar plot to summarize the data
ggplot(freq, aes(InHowMany, Fraction, fill = Measure)) +
  geom_bar(stat = "identity", color = "black", position = position_dodge()) +
  geom_text(aes(label = Fraction), vjust = 1.6, color = "black",
            position = position_dodge(0.9), size = 2.5) +
  scale_x_continuous( breaks = 1:10) +
  ggtitle("Fractions of total spectra counts or of total identifications") +
  xlab("In How Many Samples") + ylab("Fraction (%)")
Parsed with column specification:
cols(
  InHowMany = col_double(),
  Proteins = col_double(),
  Fraction = col_double(),
  Measure = col_character()
)

There are 2990 protein seen in all samples. That is about 60% of the identified proteins. However, those proteins account for 97% of all of the spectral counts. The 3454 proteins with an average SpC of 2.5 or greater account for 98.3% of the total spectral counts.

These separations were pretty extensive, so we are capturing most of the proteome in each sample. That is why the "bulk" of the spectral counts are associated with the proteins seen in all samples. The proteins that are not seen in each sample are lower abundant proteins (those can be numerous) that do not add up to a very large fraction of the "total" proteome.

Check choroidal and retinal sample similarities

This is a paired study design and we will use that in the statistical testing below. To get a sense of the data characteristics, we will compare the 5 choroidal samples to each other, and do the same for the 5 retinal samples. Multi-panel scatter plots are a nice method for exploratory data analysis. We will try linear and log scales. We have some zeros in the data, so we will add 1 for the log plots.

In [6]:
# function for simple normalization
SL_Norm <- function(df, color = NULL, plot = TRUE) {
    # This makes each channel sum to the average grand total
        # df - data frame of TMT intensities
        # returns a new data frame with normalized values
    
    # compute scaling factors to make colsums match the average sum
    norm_facs <- mean(c(colSums(df))) / colSums(df)
    cat("SL Factors:\n", sprintf("%-5s -> %f\n", colnames(df), norm_facs))
    df_sl  <- sweep(df, 2, norm_facs, FUN = "*")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(df_sl + 1), col = color, notch = TRUE, main = "SL Normalized data")
    }
    df_sl
}

# normalize the data before plotting
color <- c(rep('blue', 5), rep('red', 5))
paw_sl <- SL_Norm(paw_spc, color)

# shortcuts for the cell types
C <- 1:5
R <- 6:10

pairs.panels(paw_sl[C], main = "Choroidal")
pairs.panels(log2(paw_sl[C]+1), main = "Choroidal")
pairs.panels(paw_sl[R], main = "Retinal")
pairs.panels(log2(paw_sl[R]+1), main = "Retinal")
SL Factors:
 HCEC_189 -> 1.204953
 HCEC_191 -> 0.926952
 HCEC_194 -> 0.918081
 HCEC_195 -> 1.017995
 HCEC_199 -> 0.888918
 HREC_189 -> 1.248862
 HREC_191 -> 0.956278
 HREC_194 -> 1.155367
 HREC_195 -> 0.839545
 HREC_199 -> 1.008395

Spectral counting is not real precise

The linear plots are not too informative. The log plots indicate considerable sample-to-sample scatter. These experiments have about 40 SCX fractions run in 2-hour RP gradients. That is a lot of instrument time per sample. We will need pretty large differences in means to achieve statistical significance given the sample-to-sample variances.


We can get started with edgeR testing

We will use the counts and accessions that we created in R above from the main summary file and load data into some edgeR data structures. We will run the built-in TMM normalization (Ref-4) and see if the samples have any structure in a clustering plot.

4. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

In [7]:
# load the data into edgeR data structures
# group labels need to be factors
y <- DGEList(counts = paw_spc, genes = accession)

# run the TMM normalization (and library size corrections)
y <- calcNormFactors(y)

apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute grand total (library size) scalings
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size

    # the TMM factors are library adjustment factors (so divide by them)
    norm_facs <- lib_facs / y$samples$norm.factors
    cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n", 
                                                     colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    df_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(df_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(df_tmm + 1), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    df_tmm
}

paw_spc_tmm <- apply_tmm_factors(y, color)

# check clustering
plotMDS(y, col = color, main = "Retina vs Choroid")
Overall Factors (lib.size+TMM):
 HCEC_189 -> 1.168534
 HCEC_191 -> 0.930945
 HCEC_194 -> 0.792689
 HCEC_195 -> 1.078526
 HCEC_199 -> 0.877565
 HREC_189 -> 1.283733
 HREC_191 -> 1.045656
 HREC_194 -> 1.080556
 HREC_195 -> 0.827676
 HREC_199 -> 1.106263

Clustering has left-to-right separation between choroidal and retinal cells

The overall normalization factors (library size plus TMM factors) were all pretty close to 1.0. The centers of the boxplots are pretty well aligned. TMM is more of a center of the distribution matching method, so we might expect that medians of the distributions would end up reasonably well matched.

Can also see how normalization affected CV distributions

In [8]:
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# see what effect TMM had on CV distributions
par(mfrow = c(2, 2))
labeled_boxplot(paw_spc[C], 150, "Choroid before")
labeled_boxplot(paw_spc[R], 150, "Retina before")
labeled_boxplot(paw_spc_tmm[C], 150, "Choroid after")
labeled_boxplot(paw_spc_tmm[R], 150, "Retina after")
par(mfrow = c(1, 1))

These low 30% CV for a technique like spectral counting for human subjects give some ballpark for comparison to newer methods.


ASIDE: try non-paired test for fun...

We will do a basic exact test in edgeR comparing choroidal to retinal cells (ignoring the paired nature of the study). We have a modest number of DE candidates (121 down and 181 up).

In [9]:
group <- factor(c(rep("C", 5), rep("R", 5)))
yy <-  DGEList(counts = paw_spc, group = group, genes = accession)
yy <- calcNormFactors(yy)
yy <- estimateDisp(yy)
et <- exactTest(yy)
topTags(et)$table
summary(decideTests(et, p.value = 0.10))
Design matrix not provided. Switch to the classic mode.
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
312sp|P20591|MX1_HUMAN 6.4069899.3263031.344724e-434.644676e-40
533sp|P00352|AL1A1_HUMAN-7.5738488.6303564.616319e-337.972384e-30
1406sp|Q16799|RTN1_HUMAN 8.0725507.0660931.247038e-281.435757e-25
864sp|Q12797|ASPH_HUMAN 3.0168457.9891546.499063e-235.611941e-20
1348sp|Q14766|LTBP1_HUMAN 4.3842167.1402195.071970e-193.503717e-16
254sp|Q92626|PXDN_HUMAN 1.7970129.5554802.962647e-171.705497e-14
2454sp|Q5EBM0|CMPK2_HUMAN 6.6327005.8481104.692311e-142.315320e-11
1466sp|Q53EP0|FND3B_HUMAN 2.5714006.9761948.960200e-143.868566e-11
2034sp|P09914|IFIT1_HUMAN 5.1648966.2930811.488682e-135.713230e-11
2310sp|O14879|IFIT3_HUMAN 4.8096356.0123221.801410e-126.222069e-10
        R-C
Down    121
NotSig 3152
Up      181

Set up design matrix for paired study

EdgeR can do more complicated paired study designs if we use the linear modeling options. The edgeR user's guide has several examples to use as templates. We will be mostly using Example 4.1 but with a switch to the quasi-likelihood versions of function instead of the likelihood ratio ones. The QL framework (example 4.4) is perhaps more conservative. These test options have some differences in the numbers of candidates and the magnitudes of p-values computed under different models.

The glm tests require experimental design matrices and contrasts. limma has some functions to help set those up. We will need samples categorized by donor and by cell type. The modeling will allow multiple tests to be made controlling for different factors. The paired test is basically comparing choroidal versus retinal controlling for donor.

In [10]:
# create the experimental design matrix
donor <- factor(rep(c(189, 191, 194, 195, 199), 2))
cell <- factor(c(rep("C", 5), rep("R", 5)))

# Example 4.1 in edgeR user's guide
design <- model.matrix(~donor+cell)
rownames(design) <- colnames(y)
design
A matrix: 10 × 6 of type dbl
(Intercept)donor191donor194donor195donor199cellR
HCEC_189100000
HCEC_191110000
HCEC_194101000
HCEC_195100100
HCEC_199100010
HREC_189100001
HREC_191110001
HREC_194101001
HREC_195100101
HREC_199100011
In [11]:
# extimate the dispersion parameters and check
y <- estimateDisp(y, design, robust = TRUE)
y$common.dispersion
plotBCV(y, main = "Variance Trends")
0.0145726333184053

There are more dispersion checks after the QL modeling has been done

In [12]:
# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(y, design, robust = TRUE)
plotQLDisp(fit)
In [13]:
# if we do not specify a contrast, the default is the last column
# of the design matrix - a C versus R comparison
paired <- glmQLFTest(fit) # default comparison

# check test results
topTags(paired)$table
tt <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))
A data.frame: 10 × 6
geneslogFClogCPMFPValueFDR
<chr><dbl><dbl><dbl><dbl><dbl>
533sp|P00352|AL1A1_HUMAN-7.6671108.651386494.08059.205467e-213.179568e-17
864sp|Q12797|ASPH_HUMAN 3.0174407.960610212.64521.601009e-152.764942e-12
1406sp|Q16799|RTN1_HUMAN 8.0313527.053481230.58707.396518e-158.515858e-12
254sp|Q92626|PXDN_HUMAN 1.8188219.544250171.18783.013074e-142.601790e-11
312sp|P20591|MX1_HUMAN 6.4308879.315547401.37434.502522e-142.741758e-11
1348sp|Q14766|LTBP1_HUMAN 4.3583657.129367165.39204.762752e-142.741758e-11
693sp|O00468|AGRIN_HUMAN 2.2635018.326636141.09163.813072e-131.881479e-10
294sp|Q63HN8|RN213_HUMAN 1.8005569.369372127.95271.605397e-126.199916e-10
2034sp|P09914|IFIT1_HUMAN 5.2428396.265716126.02551.615496e-126.199916e-10
366sp|Q9Y3Z3|SAMH1_HUMAN 1.6242419.136414120.85192.739599e-129.462577e-10
       cellR
Down     255
NotSig  2883
Up       316

We get more candidates with paired study design

We get quite a few more DE candidates using the paired study option and the glm modeling. In the publication, the likelihood ratio testing was used instead of the quasi-likelihood testing. That produced a few more candidates than we have here.

We need to reformat and visualize results

In [14]:
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# get the results summary
results <- collect_results(paw_spc_tmm, tt, C, "choroid", R, "retina")

pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# check the p-value distribution
pvalue_plots(results, 100, "Retina vs Choroid - SpC")

P-value distribution looks nice

We have a flat distribution at larger p-values and a spike at low p-values. We can see what the fold changes look like for the different levels of DE candidates and count candidates.

In [15]:
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# make log2FC plots
log2FC_plots(results, 4, "Faceted log2FC")

# see how many candidates are in each category
results %>% count(candidate)
A tibble: 4 × 2
candidaten
<fct><int>
high 272
med 166
low 133
no 2883

We need to visualize the DE candidates (and non-candidates)

We will make:

  • MA plots
  • Scatter plots
  • A volcano plot
In [16]:
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# make MA plots
MA_plots(results, "ave_choroid", "ave_retina", "Choroid vs Retina")

edgeR robustly models the wider dispersion at low spectral counts


REALLY IMPORTANT ASIDE

The spectral count values have been maintained in their natural measurement scale. That means that they are actual counts that have a Poisson distribution. EdgeR models that correctly so that the large uncertainty in small count numbers does not result in large numbers of false positive DE candidates.

There are other commonly used transformations of spectral count data where this is not the case. Samples can be normalized to have all counts sum up to one by dividing each column by its total. That makes the measurement values be small fractions between 0.0 and 1.0. Taking logs of counts would also alter the natural measurement scale and change the mean-variance relationship. There can also be protein length correction (dividing by length or number of theoretical peptides) of counts to make relative abundances more like molar concentrations instead of weight percentages. Those have the effect of somewhat local scrambling of abundances and it disrupts the smooth dispersion trend making trended variance estimates less effective.

The bottom line is that just about anything you would do transformation-wise to spectral counts will hurt the data analysis. This is also likely true in other quantitative measures beside spectral counting. Determining a natural measurement scale is very important for data modeling. Do not arbitrarily transform data.

One more soap box moment. Often the distribution of all of the protein measures are examined and they typically have skewed distributions. Taking the logs of the measures can make the distributions appear more normally distributed (as taking logs does for just about anything). Arguments are made that this is appropriate when using a t-test. That is incorrect. The t-test is being applied to a single protein. It is the distribution of the measures of that particular protein that should be normally distributed, not the distribution of different proteins relative to each other in the whole dataset. We always have too few data points (the number of biological replicates) to say much of anything about whether individual protein measures are normally distributed. It can be a good thing to understand what a model is actually modeling.


In [17]:
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# make scatter plots
scatter_plots(results, "ave_choroid", "ave_retina", "Choroid vs Retina")
Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Transformation introduced infinite values in continuous x-axis”Warning message:
“Transformation introduced infinite values in continuous y-axis”Warning message:
“Transformation introduced infinite values in continuous x-axis”

We have some zeros in the average count vectors...

In [18]:
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# make a volcano plot
volcano_plot(results, "ave_choroid", "ave_retina", "Choroid vs Retina")

The DE candidates look convincing in all of the plot types

Look at 25 of the top up and down DE candidate proteins

We can plot data for individual proteins, too. We will pick the top 25 up- and down-regulated proteins where the selection was based on the edgeR FDR values.

In [19]:
# ============== individual protein expression plots ===========================

# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_candidates <- function(results, nleft, nright, show_these = c("high", "med", "low")) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    
    # get desired candidates
    proteins <- proteins %>% 
        filter(candidate %in% show_these) %>% arrange(candidate)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", p-val: ", scientific(row$FDR, digits = 3), ", ", row$candidate)
        barplot(vec, col = color, main = title)
    } 
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", p-val: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title)
    }    
}
# plot the top 25 up and 20 down proteins
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 5, 5, 25)
set_plot_dimensions(7, 7)

Save the results and log the session

Save results data frame to a disk file and log the R session. Generating the DE candidates is one of the earlier steps in the analysis and interpretation. Adding annotations, functional enrichment analysis, pathway analysis, and more are needed to extract the biological meaning from the data. The results table should be easy to add back to the main proteomics table before the next analysis steps.

In [20]:
# write results
write.table(results, "edgeR_results.txt", sep = "\t", row.names = FALSE, na = " ")

# log the session
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.0.0    gridExtra_2.3   psych_1.8.12    edgeR_3.24.3   
 [5] limma_3.38.3    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [9] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
[13] ggplot2_3.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     statmod_1.4.32   locfit_1.5-9.1   tidyselect_0.2.5
 [5] repr_1.0.1       splines_3.5.3    haven_2.1.1      lattice_0.20-38 
 [9] colorspace_1.4-1 generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6 
[13] base64enc_0.1-3  rlang_0.4.0      pillar_1.4.2     foreign_0.8-72  
[17] glue_1.3.1       withr_2.1.2      modelr_0.1.5     readxl_1.3.1    
[21] uuid_0.1-2       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
[25] rvest_0.3.4      evaluate_0.14    labeling_0.3     parallel_3.5.3  
[29] broom_0.5.2      IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4 
[33] IRkernel_1.0.2   jsonlite_1.6     mnormt_1.5-5     hms_0.5.1       
[37] digest_0.6.20    stringi_1.4.3    grid_3.5.3       cli_1.1.0       
[41] tools_3.5.3      magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4    
[45] pkgconfig_2.0.2  zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4 
[49] assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10  R6_2.4.0        
[53] nlme_3.1-141     compiler_3.5.3  
In [ ]: