This notebook will demonstrate using Jupyter notebooks, R, and edgeR to perform statistical analysis of a very large spectral counting study using a paired study design.
The notebook will:
The data is from a recent study (Ref-1) where human retinal and choroidal endothelial cells were compared. The study was 5 donor eyes where retinal and choroidal cells were collected and cultured in a paired design. The cell lysates from each of the 10 cell cultures were profiled using large-scale separations with a fast-scanning linear ion trap. There were about half a million MS2 scans per sample for a dataset size of a little over 5 million. The data are available at the PRIDE archive (PXD005972).
A few relevant files from the archive are present in the repository:
The goal here is to demonstrate edgeR (Ref-2) statistical testing that can be used for spectral counting (Ref-3) semi-quantitative experiments. The PAW pipeline has some features for shotgun quantitation that we will be using. One is a protein grouping algorithm where highly homologous proteins (majority of peptides are shared with few unique peptides) are grouped together and the context for shared or unique peptides are updated accordingly. This extended parsimony logic is described in more detail here. The other feature is the splitting of shared peptide spectral counts based on relative unique peptide evidence.
1. Smith JR, David LL, Appukuttan B, Wilmarth PA. Angiogenic and Immunologic Proteins Identified by Deep Proteomic Profiling of Human Retinal and Choroidal Vascular Endothelial Cells: Potential Targets for New Biologic Drugs. Am J Ophthalmol. 2018 Sep;193:197-229. doi: 10.1016/j.ajo.2018.03.020. Epub 2018 Mar17. PubMed PMID: 29559410; PubMed Central PMCID: PMC6109601.
2. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
3. Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.
library("tidyverse")
library("edgeR")
library("limma")
library("psych")
library("gridExtra")
library("stringr")
library("scales")
The PAW pipeline, like most proteomics summary tables, has a lot more information that we need for the statistical analysis. Selecting the relevant columns from a large table is not too hard to do in R. Row filtering can be trickier. Decoy protein sequences are easy to drop. Common contaminant proteins can also be easier to identify. The situation for contaminants is often not so cut and dried. There are many ways to have contaminants in your samples that are not in any common contaminants database.
I prefer to open the PAW summary files with Excel and then do some searches in the protein description column for things like "keratin", "albumin", "hemoglobin", "trypsin" and look at the data across samples to decide if I think any of the matches should go into the contaminants pile. I typically flag all proteins that I do not want to use in the quantitative analysis in a "Filter" column and leave the column blank for usable proteins. Then I can sort on the "Filter" column and move excluded proteins to the bottom of the table. I usually make a new table with the "keepers" that has just the protein accessions and the quantitative columns of interest.
In any shotgun experiment, more things can be identified than can be quantified. For quantification we need data points for all samples. The "bottom" of the dataset will be proteins seen in one or a few samples. We will need to exclude those. How do we identify them? An ad hoc approach is to think about a two-condition experiment. These are often some sort of biomarker experiment so there might be a protein in one condition that is not in the other. A good rule of thumb for protein identification is two peptides per protein. That means that an SpC (spectral count) of 2 is at the detection limit. We want something above that lower limit to define a protein as confidently "present". A value of 5 is a good robust spectral count value; few decoy proteins have an SpC of 5 or greater. If we have 5 counts in one condition and no counts in the other, then an average SpC of 2.5 could be a good cutoff value. Do the data agree with this cutoff logic?
We have 10 samples per protein. We can average the counts over the 10 samples for each protein. We can sort the count table by decreasing average SpC. For each row as we go down the table, we can count the number of cells with zero out of the total number of cells. At the top of the sorted table, we have no missing data and the missing fraction is always zero. As we get towards the bottom of the table, we will encounter zero count cells and the missing fraction will start to increase.
I do not know how to program that in R (without some Google and Stack Overflow searching), but I do know how to do that in a couple of minutes in Excel. I created a text file with two columns: the average SpC and the missing fraction. Let's read that and plot the data.
# read in the data
temp <- read_tsv("ave_missing.txt")
# make a basic plot
ggplot(temp, aes(x = Ave, y = FracMissing)) +
geom_point() +
ggtitle("Missing Fraction versus Average SpC") +
labs(x = "Ave SpC", y = "Missing Fraction")
We need to expand the x-axis to see where the increase starts.
# expanded x-axis plot
ggplot(temp, aes(x = Ave, y = FracMissing)) +
geom_line() +
coord_cartesian(xlim = c(0, 8)) +
ggtitle("Missing Fraction versus Average SpC") +
labs( x = "Ave SpC", y = "Missing Fraction") +
geom_vline(xintercept = 2.5, linetype = "dotted")
The fraction of data with missing values starts to rise at an average SpC of about 5.0. The missing data really starts to rise at average SpC values below 2.0. We usually want to cast a wide net for differential expression candidates in discovery experiments. A cutoff of 2.5 seems like a reasonable compromise. We might get some false positives near the cutoff, but those will be easy to catch manually later.
The data exported in the edgeR_input.txt
file excluded contaminants, decoys, and any proteins with an average SpC of less than 2.5. We will read that data in and separate the accessions from the count data.
# read in the prepped data
paw_spc <- read_tsv("edgeR_input.txt")
# save accessions in vector and remove from data table
accession <- paw_spc$Accession
paw_spc <- select(paw_spc, -Accession)
head(paw_spc)
nrow(paw_spc)
The sample codes are HCEC for human choroidal endothelial cells and HREC for human retinal endothelial cells. The numbers are the 5 subject codes.
Depending a little on how you approach protein FDR, we have about 5000 protein IDs. After the 2.5 average SpC cutoff, we are down to 3454. Is this an issue? The answer is no. Counting things by the number of proteins can lead to erroneous conclusions. We have 10 samples and we can tally some things based on in how many samples were the proteins seen: all 10 out of the 10, any 9 out of the 10, etc. I did that in Excel and made a short table.
# read in short table
freq <- read_tsv("Fractions_SpC_ID.txt")
# make a bar plot to summarize the data
ggplot(freq, aes(InHowMany, Fraction, fill = Measure)) +
geom_bar(stat = "identity", color = "black", position = position_dodge()) +
geom_text(aes(label = Fraction), vjust = 1.6, color = "black",
position = position_dodge(0.9), size = 2.5) +
scale_x_continuous( breaks = 1:10) +
ggtitle("Fractions of total spectra counts or of total identifications") +
xlab("In How Many Samples") + ylab("Fraction (%)")
There are 2990 protein seen in all samples. That is about 60% of the identified proteins. However, those proteins account for 97% of all of the spectral counts. The 3454 proteins with an average SpC of 2.5 or greater account for 98.3% of the total spectral counts.
These separations were pretty extensive, so we are capturing most of the proteome in each sample. That is why the "bulk" of the spectral counts are associated with the proteins seen in all samples. The proteins that are not seen in each sample are lower abundant proteins (those can be numerous) that do not add up to a very large fraction of the "total" proteome.
This is a paired study design and we will use that in the statistical testing below. To get a sense of the data characteristics, we will compare the 5 choroidal samples to each other, and do the same for the 5 retinal samples. Multi-panel scatter plots are a nice method for exploratory data analysis. We will try linear and log scales. We have some zeros in the data, so we will add 1 for the log plots.
# function for simple normalization
SL_Norm <- function(df, color = NULL, plot = TRUE) {
# This makes each channel sum to the average grand total
# df - data frame of TMT intensities
# returns a new data frame with normalized values
# compute scaling factors to make colsums match the average sum
norm_facs <- mean(c(colSums(df))) / colSums(df)
cat("SL Factors:\n", sprintf("%-5s -> %f\n", colnames(df), norm_facs))
df_sl <- sweep(df, 2, norm_facs, FUN = "*")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(df_sl + 1), col = color, notch = TRUE, main = "SL Normalized data")
}
df_sl
}
# normalize the data before plotting
color <- c(rep('blue', 5), rep('red', 5))
paw_sl <- SL_Norm(paw_spc, color)
# shortcuts for the cell types
C <- 1:5
R <- 6:10
pairs.panels(paw_sl[C], main = "Choroidal")
pairs.panels(log2(paw_sl[C]+1), main = "Choroidal")
pairs.panels(paw_sl[R], main = "Retinal")
pairs.panels(log2(paw_sl[R]+1), main = "Retinal")
The linear plots are not too informative. The log plots indicate considerable sample-to-sample scatter. These experiments have about 40 SCX fractions run in 2-hour RP gradients. That is a lot of instrument time per sample. We will need pretty large differences in means to achieve statistical significance given the sample-to-sample variances.
We will use the counts and accessions that we created in R above from the main summary file and load data into some edgeR data structures. We will run the built-in TMM normalization (Ref-4) and see if the samples have any structure in a clustering plot.
4. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
# load the data into edgeR data structures
# group labels need to be factors
y <- DGEList(counts = paw_spc, genes = accession)
# run the TMM normalization (and library size corrections)
y <- calcNormFactors(y)
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute grand total (library size) scalings
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / y$samples$norm.factors
cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n",
colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
df_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(df_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(df_tmm + 1), col = color, notch = TRUE, main = "TMM Normalized data")
}
df_tmm
}
paw_spc_tmm <- apply_tmm_factors(y, color)
# check clustering
plotMDS(y, col = color, main = "Retina vs Choroid")
The overall normalization factors (library size plus TMM factors) were all pretty close to 1.0. The centers of the boxplots are pretty well aligned. TMM is more of a center of the distribution matching method, so we might expect that medians of the distributions would end up reasonably well matched.
CV <- function(df) {
# Computes CVs of data frame rows
# df - data frame,
# returns vector of CVs (%)
ave <- rowMeans(df) # compute averages
sd <- apply(df, 1, sd) # compute standard deviations
cv <- 100 * sd / ave # compute CVs in percent (last thing gets returned)
}
labeled_boxplot <- function(df, ylim, title) {
# Makes a box plot with the median value labeled
# df - data frame with data to compute CVs of
# ylim - upper limit for y-axis
# title - plot title
cv = CV(df)
boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
text(x = 0.65, y = boxplot.stats(cv)$stats[3],
labels = round(boxplot.stats(cv)$stats[3], 1))
}
# see what effect TMM had on CV distributions
par(mfrow = c(2, 2))
labeled_boxplot(paw_spc[C], 150, "Choroid before")
labeled_boxplot(paw_spc[R], 150, "Retina before")
labeled_boxplot(paw_spc_tmm[C], 150, "Choroid after")
labeled_boxplot(paw_spc_tmm[R], 150, "Retina after")
par(mfrow = c(1, 1))
These low 30% CV for a technique like spectral counting for human subjects give some ballpark for comparison to newer methods.
We will do a basic exact test in edgeR comparing choroidal to retinal cells (ignoring the paired nature of the study). We have a modest number of DE candidates (121 down and 181 up).
group <- factor(c(rep("C", 5), rep("R", 5)))
yy <- DGEList(counts = paw_spc, group = group, genes = accession)
yy <- calcNormFactors(yy)
yy <- estimateDisp(yy)
et <- exactTest(yy)
topTags(et)$table
summary(decideTests(et, p.value = 0.10))
EdgeR can do more complicated paired study designs if we use the linear modeling options. The edgeR user's guide has several examples to use as templates. We will be mostly using Example 4.1 but with a switch to the quasi-likelihood versions of function instead of the likelihood ratio ones. The QL framework (example 4.4) is perhaps more conservative. These test options have some differences in the numbers of candidates and the magnitudes of p-values computed under different models.
The glm tests require experimental design matrices and contrasts. limma has some functions to help set those up. We will need samples categorized by donor and by cell type. The modeling will allow multiple tests to be made controlling for different factors. The paired test is basically comparing choroidal versus retinal controlling for donor.
# create the experimental design matrix
donor <- factor(rep(c(189, 191, 194, 195, 199), 2))
cell <- factor(c(rep("C", 5), rep("R", 5)))
# Example 4.1 in edgeR user's guide
design <- model.matrix(~donor+cell)
rownames(design) <- colnames(y)
design
# extimate the dispersion parameters and check
y <- estimateDisp(y, design, robust = TRUE)
y$common.dispersion
plotBCV(y, main = "Variance Trends")
# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(y, design, robust = TRUE)
plotQLDisp(fit)
# if we do not specify a contrast, the default is the last column
# of the design matrix - a C versus R comparison
paired <- glmQLFTest(fit) # default comparison
# check test results
topTags(paired)$table
tt <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))
We get quite a few more DE candidates using the paired study option and the glm modeling. In the publication, the likelihood ratio testing was used instead of the quasi-likelihood testing. That produced a few more candidates than we have here.
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# get the results summary
results <- collect_results(paw_spc_tmm, tt, C, "choroid", R, "retina")
pvalue_plots <- function(results, ylim, title) {
# Makes p-value distribution plots
# results - results data frame
# ylim - ymax for expanded view
# title - plot title
p_plot <- ggplot(results, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(results$PValue, breaks = 100,
plot = FALSE)$counts[26:100]))
# we will need an expanded plot
p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + ggtitle("p-values expanded")
grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}
# check the p-value distribution
pvalue_plots(results, 100, "Retina vs Choroid - SpC")
We have a flat distribution at larger p-values and a spike at low p-values. We can see what the fold changes look like for the different levels of DE candidates and count candidates.
log2FC_plots <- function(results, range, title) {
# Makes faceted log2FC plots by candidate
# results - results data frame
# range - plus/minus log2 x-axis limits
# title - plot title
ggplot(results, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
ggtitle(title) +
coord_cartesian(xlim = c(-range, range))
}
# make log2FC plots
log2FC_plots(results, 4, "Faceted log2FC")
# see how many candidates are in each category
results %>% count(candidate)
We will make:
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
# make MA plots
MA_plots(results, "ave_choroid", "ave_retina", "Choroid vs Retina")
The spectral count values have been maintained in their natural measurement scale. That means that they are actual counts that have a Poisson distribution. EdgeR models that correctly so that the large uncertainty in small count numbers does not result in large numbers of false positive DE candidates.
There are other commonly used transformations of spectral count data where this is not the case. Samples can be normalized to have all counts sum up to one by dividing each column by its total. That makes the measurement values be small fractions between 0.0 and 1.0. Taking logs of counts would also alter the natural measurement scale and change the mean-variance relationship. There can also be protein length correction (dividing by length or number of theoretical peptides) of counts to make relative abundances more like molar concentrations instead of weight percentages. Those have the effect of somewhat local scrambling of abundances and it disrupts the smooth dispersion trend making trended variance estimates less effective.
The bottom line is that just about anything you would do transformation-wise to spectral counts will hurt the data analysis. This is also likely true in other quantitative measures beside spectral counting. Determining a natural measurement scale is very important for data modeling. Do not arbitrarily transform data.
One more soap box moment. Often the distribution of all of the protein measures are examined and they typically have skewed distributions. Taking the logs of the measures can make the distributions appear more normally distributed (as taking logs does for just about anything). Arguments are made that this is appropriate when using a t-test. That is incorrect. The t-test is being applied to a single protein. It is the distribution of the measures of that particular protein that should be normally distributed, not the distribution of different proteins relative to each other in the whole dataset. We always have too few data points (the number of biological replicates) to say much of anything about whether individual protein measures are normally distributed. It can be a good thing to understand what a model is actually modeling.
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
# make scatter plots
scatter_plots(results, "ave_choroid", "ave_retina", "Choroid vs Retina")
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# make a volcano plot
volcano_plot(results, "ave_choroid", "ave_retina", "Choroid vs Retina")
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, "\\|", simplify = TRUE)
identifier[,3]
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_candidates <- function(results, nleft, nright, show_these = c("high", "med", "low")) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get desired candidates
proteins <- proteins %>%
filter(candidate %in% show_these) %>% arrange(candidate)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3), ", ", row$candidate)
barplot(vec, col = color, main = title)
}
}
plot_top_tags <- function(results, nleft, nright, top_tags) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title)
}
}
# plot the top 25 up and 20 down proteins
set_plot_dimensions(6, 3.5)
plot_top_tags(results, 5, 5, 25)
set_plot_dimensions(7, 7)
Save results data frame to a disk file and log the R session. Generating the DE candidates is one of the earlier steps in the analysis and interpretation. Adding annotations, functional enrichment analysis, pathway analysis, and more are needed to extract the biological meaning from the data. The results table should be easy to add back to the main proteomics table before the next analysis steps.
# write results
write.table(results, "edgeR_results.txt", sep = "\t", row.names = FALSE, na = " ")
# log the session
sessionInfo()