This is publicly available data from this publication:
D’Angelo, G., Chaerkady, R., Yu, W., Hizal, D.B., Hess, S., Zhao, W., Lekstrom, K., Guo, X., White, W.I., Roskos, L. and Bowen, M.A., 2017. Statistical models for the analysis of isobaric tags multiplexed quantitative proteomics. Journal of proteome research, 16(9), pp.3124-3136.
This notebook looks at the data from a commercial serum systemic lupus erythematosus (SLE) data set with 10-plex TMT labeling analyzed on a Q-Exactive platform. An Agilent MARS-6 immuno-depletion column was used to remove albumin, IgG, IgA, transferrin, haptoglobin, and antitrypsin. Proteins were digested with trypsin and labeled according to manufacturer's protocol. There were 10 normal controls and 10 SLE cases. Samples were split between plexes in a balanced fashion where the first 5 channels were controls and the second 5 channels were the SLE cases. There were no common reference channels.
Each plex was separated into 24 fractions (not clear by what technique) followed by 90-min RP gradients on a Dionex Ultimate 3000 RS system. The Q-Exactive used a top-15 method with MS1 resolution of 70K and MS2 resolution of 35K.
Peptides and proteins were identified using Comet and the PAW pipeline. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine was specified as a variable modification, (semi) trypsin enzyme specificity was used, and a canonical UniProt reference human protein database was used. Fragment ion tolerance was set for high resolution MS2: 0.02 Da, offset of 0.0. Confident peptide identifications were obtained using accurate mass conditional score histograms, the target/decoy method, and the protein inference used basic parsimony principles.
The PAW pipeline was used to infer proteins, perform homologous protein grouping, establish peptide uniqueness to the inferred proteins, and sum unique PSM reporter ions into protein intensity totals. The criteria for protein detection was at least two fully-tryptic, distinct peptide sequences per protein per plex. Semi-tryptic peptides contributed to PSMs for the proteins meeting the criteria.
Results tables were filtered to remove common contaminants (and the 6 MARS proteins), and the protein intensity tables were saved as tab-delimited text files for reading by the scripts below. Normalizations and statistical testing were performed using the Bioconductor package edgeR as detailed in the steps below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.
Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.
Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.
Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is straightforward. The dplyr package from the tidyverse (https://r4ds.had.co.nz/) makes it easy to separate the data by biological condition, and to look at the data before and after the Internal Reference Scaling (IRS) normalization procedure. We can do some sanity checks with the pooled internal standard channels. We know that those should be highly similar within each TMT experiment and that IRS should make them also very similar between TMT experiments.
The internal reference scaling method outlined in Plubell et al. 2017 uses separate pooled internal reference standards as measuring sticks to put multiple isobaric plexes onto common reporter ion intensity scales (see these repositories for more details: general IRS concepts and validation of IRS).
This study design did not have any common reference channels, but the study design was balanced with 5 control and 5 SLE samples in each plex. The N3 sample in the first plex did not label well. It was about 1/8 of the average total signal of the other channels. That channel was excluded before the IRS script was run. We can use the average intensities across all channels in each plex as a mock reference channel.
# get the default plot width and height
width <- options()$repr.plot.width
height <- options()$repr.plot.height
# load the IRS-normalized data and check the table
data_import <- read_tsv("ave_labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", guess_max = 596)
# the "Filter" column flags contams and decoys
# the "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the prepped table from pandas is sorted so these are the upper rows
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession
# see how many rows in the table
nrow(data_all)
The MARS-6 depletion was quite good with the sum of the depleted proteins accounting for only about 1% of the total signal. There were 24 fractions per plex (details on separation were missing from the publication), so a reasonably deep proteome might be expected. We end up with 467 total proteins identified by at least two peptides. This really is the way experiments turn out for serum/plasma. It is really, really hard to get down to the putative biomarker levels.
The publication used tryptic enzyme cleavage. Secreted proteins always have signal peptides, so the N-terminal region of the protein never matches the database sequence. There are also other protein processing cleavages for protein activation in some cases. There can also be many endogenous proteases in bio fluids. Semi-tryptic cleavage (or even non-specific cleavage) should always be used for saliva, serum, and other biological fluids.
I did both fully tryptic and semi-tryptic searches with Comet. There was a 14% gain in number of filtered PSMs with the semi-tryptic search. This was about half of what I would have guessed based on my previous experience, but it is still pretty significant.
Protein database choice and search parameters really should be considered as variable in most proteomics analyses. This is particularly important when systems, sample preps, and organisms have not been encountered previously. Proteomics is not a one-size-fits all situation. This can be annoying at times; however, it does keep life interesting.
# we want to get the SL normed columns, and subsetted by condition
sl_all <- data_all %>%
select(starts_with("SLNorm"))
sl_N <- sl_all %>% select(contains("_N"))
sl_SLE <- sl_all %>% select(contains("_SLE"))
# and the IRS normed columns by condition
irs_all <- data_all %>%
select(starts_with("IRSNorm"))
irs_N <- irs_all %>% select(contains("_N"))
irs_SLE <- irs_all %>% select(contains("_SLE"))
# and collect the pooled channels before and after IRS
sl_pool <- sl_all %>% select(contains("pool"))
irs_pool <- irs_all %>% select(contains("pool"))
head(sl_N)
# multi-panel scatter plot grids from the psych package
pairs.panels(log2(sl_pool), lm = TRUE, main = "Pooled Std before IRS")
pairs.panels(log2(irs_pool), lm = TRUE, main = "Pooled Std after IRS")
The random sampling of MS2 scans is why the before IRS scatter plot is awful. We are using the average of all channels in each plex for a mock pooled channel in the IRS method. We can use the mock standard channel to check the IRS correction.
Since we have a lot of samples, we will randomly look at a few.
# multi-panel scatter plot grids
N_sample <- sample(1:9, 4)
pairs.panels(log2(sl_N[N_sample]), lm = TRUE, main = "Normal before IRS (random 4)")
pairs.panels(log2(irs_N[N_sample]), lm = TRUE, main = "Normal after IRS (same 4)")
# multi-panel scatter plot grids
SLE_sample <- sample(1:10, 4)
pairs.panels(log2(sl_SLE[SLE_sample]), lm = TRUE, main = "SLE before IRS (random 4)")
pairs.panels(log2(irs_SLE[SLE_sample]), lm = TRUE, main = "SLE after IRS (same 4)")
A multi-dimensional scaling plot (similar to hierarchical clustering or PCA plots) is a good way to check the data. We should expect the samples to group by condition. We can do the clustering before and after IRS to verify that we are able to recovery the true biological differences between groups. We need to load the data into some edgeR data structures and make a couple of function calls to generate the cluster plots.
# get the biological sample data into a DGEList object
group = c(rep('N', 9), rep('SLE', 10))
y_sl <- DGEList(counts = cbind(sl_N, sl_SLE), group = group, genes = accessions)
y_irs <- DGEList(counts = cbind(irs_N, irs_SLE), group = group, genes = accessions)
# run TMM normalization (also includes a library size factor)
y_sl <- calcNormFactors(y_sl)
y_irs <- calcNormFactors(y_irs)
# set some colors by condition
colors = c(rep('red', 9), rep('blue', 10))
# check the clustering
plotMDS(y_sl, col = colors, main = "Before IRS: all samples")
plotMDS(y_irs, col = colors, main = "After IRS: all samples")
The top MDS plot shows the first TMT experiment on the left and the second experiment on the right. After IRS, (bottom plot) we have the biological-condition clustering.
We need to get the data into some edgeR objects to work with. We can call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. EdgeR combines these two normalizations into one function call. The library size corrections are somewhat redundant because that was done as part of the IRS procedure. The TMM normalization was designed for 'omics data and it makes sense to apply that after we have done the IRS correction. The TMM normalization corrects for compositional differences between biological conditions. If some abundant proteins change expression, since we have equal total amounts of protein per sample, that will make all other protein abundances change to compensate. TMM normalization will detect and correct these situations.
Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
# we do not want the technical replicates in the mix for dispersion estimates
irs <- cbind(irs_N, irs_SLE)
# load a new DGEList object (need to update the groups)
y <- DGEList(counts = irs, group = group, genes = accessions) # group was set above
y <- calcNormFactors(y)
# see what the normalization factors look like
y$samples
Unless we have biological conditions with large compositional differences, we expect the library sizes and normalization factors after the calcNormFactors call to be around 1.0.
The deliverable in a proteomics experiment these days are all of the underlying data that generated in and all conclusions. Raw files are needed to allow for independent confirmational analyses. The actual results for a particular analysis can be a bit simpler. Some sort of a PSM report is needed to justify the accepted PSMs (PSMs below some defined cutoffs can be excluded). That is really kind of low-level intermediate information - a dot on an "i" or a cross on a "t". Peptide summaries start to have some biological utility and are important to generate. However, the meat of the results lies in the protein-level summary. There are important proteome-specific sets of details, such as, numbers of PSMs, numbers of peptides, sequence coverage, peptide set groupings, etc. There are annotation details such as accessions and descriptions, gene symbols, sequence lengths, molecular weights, etc.
When quantitative measurements are included, we need the sample keys, quantitative values, normalized quantitative values, statistical testing results, and maybe some qualitative assessment information (significant or not, etc.). It is nice to have all of the protein details in one table. Part of the notebook collects useful data from the normalizations and statistical testing so that a table can be written that can be added back to extend the proteomics protein summary table into a more complete experimental summary.
The first step in that process is to compute the TMM normalized intensities in their natural scale. EdgeR uses the normalization factors in its modeling but does not compute a table of normalized values.
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute grand total (library size) scalings
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / y$samples$norm.factors
cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n",
colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
}
tmt_tmm
}
# compute the normalized data as a new data frame
irs_tmm <- apply_tmm_factors(y, colors)
Box plots for well normalized data should be similar in size and the medians should align with each other. There can be problems with the data that may not be apparent with boxplots (some data distortions can average out and not appear different with boxplots). Regardless, we should have boxplots with good median alignment. Good boxplot behavior is a necessary but not sufficient requirement of proper normalization.
The boxplots here look worse than we typically see from cell lysates where we have much larger proteomes. Fewer proteins with such a wide range of abundance is something that we need to keep in mind for serum samples.
We will use ggplot2 this time and flip the graph on its side, so we can read the labels easier. We need tidy (long form) data for ggplot:
long_results <- gather(irs_tmm, key = "sample", value = "intensity") %>%
mutate(log_int = log10(intensity)) %>%
extract(sample, into = 'group', ".*?_(.*?)\\d", remove = FALSE)
head(long_results)
ggplot(long_results, aes(x = sample, y = log_int, fill = group)) +
geom_boxplot(notch = TRUE) +
coord_flip() +
ggtitle("edgeR normalized data")
Boxplots and density plots are closely related. We can easily see the smoothed density distributions for each sample.
ggplot(long_results, aes(x = log_int, color = sample)) +
geom_density() +
guides(color = FALSE) +
ggtitle("edgeR normalized data (with legend is too busy)")
The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This seems to be an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the two biological conditions.
# we can compare CVs before and after IRS
sl <- cbind(sl_N, sl_SLE)
# save column indexes for different conditions (indexes to data_raw frame)
# these make things easier (and reduce the chance for errors)
N <- 1:9
SLE <- 10:19
# create a CV computing function
CV <- function(df) {
ave <- rowMeans(df)
sd <- apply(df, 1, sd)
cv <- 100 * sd / ave
}
# put CVs in data frames to simplify plots and summaries
cv_frame <- data.frame(N_sl = CV(sl[N]), N_final = CV(irs_tmm[N]),
SLE_sl = CV(sl[SLE]), SLE_final = CV(irs_tmm[SLE]))
# see what the median CV values are
medians <- apply(cv_frame, 2, FUN = median)
print("Median CVs by condition, before/after IRS (%)")
round(medians, 1)
We see some big improvements in median CVs after IRS and TMM normalizations. We can also check the full distributions.
# see what the CV distibutions look like
# need long form for ggplot
long_cv <- gather(cv_frame, key = "condition", value = "cv") %>%
extract(condition, into = 'group', "(.*?)_+", remove = FALSE)
# traditional boxplots
cv_plot <- ggplot(long_cv, aes(x = condition, y = cv, fill = group)) +
geom_boxplot(notch = TRUE) +
ggtitle("CV distributions")
# vertical orientation
cv_plot
# horizontal orientation
#cv_plot + coord_flip()
# density plots
ggplot(long_cv, aes(x = cv, color = condition)) +
geom_density() +
coord_cartesian(xlim = c(0, 150)) +
ggtitle("CV distributions")
The CVs improve with IRS and TMM normalizations. The median CVs get smaller, but the distributions of the CV values are much improved with considerably smaller inter-quartile ranges.
We have used scatter plots, correlation coefficients, clustering, and boxplots to verify that the IRS procedure and TMM normalization behaved as expected. None of the biological samples needed any excessive normalization factors or appeared as an outlier in the cluster views (the N3 sample was already removed). Everything looks ready for statistical testing with edgeR.
One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 19 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.
We loaded the IRS data into DGEList object "y" a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in "y".
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS normed, TMM normed, all 19")
467 proteins is a lot smaller number that the 2000 to 5000 we typically work with. There is also a lot of variability to the protein variances.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt
. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("N", "SLE"))
# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))
# the topTags function adds the BH FDR values to an exactTest data frame
# make sure we do not change the row order (the sort.by parameter)!
topTags(et, n = 31)
tt <- topTags(et, n = Inf, sort.by = "none")
tt <- tt$table # tt is a list. We just need the "table" data frame
# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")
# check the p-value distribution
ggplot(tt, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100,
plot = FALSE)$counts[26:100])) +
ggtitle("N vs SLE PValue distribution")
The number of candidates is modest, and more are over-expressed in SLE, as can be seen in the MA plot. The p-value distribution has a nice flat distribution for larger p-values (from non-DE candidates) and a sharper spike at small p-values from true DE candidates.
The D'Angelo publication evaluated statistical models on this data. They looked at generalized linear models, mixed models, and limma (a close cousin to edgeR). There data analysis pipeline was Mascot/Percolator in Proteome Discoverer 1.4. That includes single peptide per protein IDs. The PAW processing requires 2 peptides per protein. They used fully trpytic cleavages for secreted proteins and semi-tryptic would have been more appropriate. There were a lot of keratins and the families of the MARS-6 proteins that need to be excluded from the quantitative analysis. All of this makes comparing protein identification numbers difficult. The number of DE candidates at a 5% FDR for four statijstical models were listed in Table 6. They had a range of candidates from 10 to 13 (limma had 12). We have 21 at the 5% FDR cut. This is double the number of candidates with a more-strict protein ID criteria.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results
data frame (which has the TMM-normalized data).
# get the averages within each condition
# results already has the normalized data in its left columns
tt$ave_N <- rowMeans(irs_tmm[N])
tt$ave_SLE <- rowMeans(irs_tmm[SLE])
# add the cadidate status column
tt <- tt %>%
mutate(candidate = cut(FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no")))
tt %>% count(candidate) # count candidates
ggplot(tt, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
coord_cartesian(xlim = c(-2, 2)) +
ggtitle("N vs SLE logFC distributions by candidate")
# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags table from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# get the results
results <- collect_results(irs_tmm, tt, N, "N", SLE, "SLE")
We will make some helper functions to generate a series of plots. This can be handy if there are multiple comparisons to do in larger study designs. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color.
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# make the DE plots
MA_plots(results, "ave_N", "ave_SLE", "Normal vs SLE")
scatter_plots(results, "ave_N", "ave_SLE", "Normal vs SLE")
volcano_plot(results, "ave_N", "ave_SLE", "Normal vs SLE")
The fold-changes for candidates are not particularly large. Many of the statistically significant proteins have less than 2-fold changes. This could be due to using MS2 reporter ions instead of the newer SPS MS3 method. There are more candidates with larger fold changes that have over-expression in SLE. The candidates are spread out over the 5-decades of protein intensities.
Note: Protein total reporter ion intensity is well correlated with protein total MS1 feature intensities. In other words, it is a good relative protein abundance measure.
The normal data points are in red and the SLE samples are in blue. The protein identifier, the average intensity, the individual test p-value (not the FDR), and the traditional fold-change are used as plot labels.
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, "\\|", simplify = TRUE)
identifier[,3]
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them into one data frame
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title)
}
}
# set plot size, make plots, reset plot size
set_plot_dimensions(7, 4)
plot_top_tags(results, length(N), length(SLE), 10)
set_plot_dimensions(width, height)
I have no idea. I leave that to the biological experts. My goal is to create a framework to generate DE candidates with high confidence and full data transparency. It is clear from the above bar plots that some candidate proteins have more subject-to-subject biological variability than do other candidates. I do not have enough experience (or any at all) with the SLE disease, so I have no idea what characteristics define interesting markers. I do have some confidence that the measurement data has been properly processed and that the results are valid. I feel that I have met my data analysis professional obligation of having done no harm to the data.
results
frame to TSV file¶write.table(results, "IRS_R_ave_results.txt", sep = "\t",
row.names = FALSE, na = " ")
sessionInfo()