Analysis of single-plex TMT experiments

MaxQuant

Phil Wilmarth, OHSU PSR Core, May 2019

Overview and objectives

Most ways that isobaric labeling data are analyzed hinder statistical testing and data visualization. This notebook will analyze data from (Ref-1) to demonstrate a two-condition (3 controls and 4 treatments) comparison done with 10-plex TMT reagents using the SPS MS3 method (Ref-2) on a Thermo Fusion instrument. The notebook will show how to:

  • load in prepped results table and check the data
  • normalize the data and visualize the result
  • compare the two conditions using edgeR
  • visualize the differential expression candidates
  • compare reporter ion sums to MS1 intensities
  • compare MQ reporter ion values to reporter ion peak heights
  • export the statistical test results in a useful format
  • quick comparison to two-sample t-test

Experiment background and MaxQuant data processing

The KUR1502 data were mouse bone marrow cell cultures where the "media" samples are controls (one of the 4 did not label) and there are leukemia exosome-dosed cells (4 "exosome" samples). 10-plex TMT was used on a Thermo Fusion using the SPS MS3 method. The data were processed using MaxQuant (Ref-3), version 1.6.5.0.

Most parameters were kept at default values. The type of experiment was "Reporter ion MS3" and the reporter mass tolerance was kept at the default of 0.003 Da. MaxQuant has precursor ion purity filters for MS2 reporter ions but not for MS3 reporter ion experiments. The FASTA protein database was the same one as used in the Comet/PAW processing. It is not clear exactly what reporter ion "intensities" are in MaxQuant. Most quantities called "intensities" seem to be peak heights in MaxQuant.

There are a couple of other significant differences in how MaxQuant behaves compared to the PAW pipeline. MaxQuant uses a protein ranking function and decoy protein matches to implement an ad hoc protein FDR filter. Single peptide per protein identifications are allowed. How correct and incorrect PSMs relate to score thresholds and how correct and incorrect proteins meet evidence requirements for reporting are very different processes. As such, a simplistic application of the target/decoy method from PSM identification is not really valid for proteins. The PAW pipeline tracks decoy protein identifications for protein FDR estimates, but does not do any explicit error control at the protein level. If a resulting protein FDR is deemed too high, the remedy is to filter PSMs more strictly to reduce the number of incorrect peptide sequences. The PAW pipeline uses a minimum protein reporting criterion of two peptides per protein.

Treatment of shared peptides is also different. MaxQuant uses an all-or-nothing razor peptide method. For sets of proteins linked by shared peptides, one protein (or group) will have the strongest evidence (what about ties?) and gets all of the shared peptides. The shared peptides are excluded from the proteins with less evidence. The PAW pipeline instead groups together proteins that have mostly shared peptides in common. The combined protein "family" gets all of the peptides from the individual family members.

Many housekeeping genes can have almost all peptides being shared with sparse unique peptide evidence (actins, tubulins, etc.). In the razor peptide approach, one family member can get assigned extensive peptide information. The other family members only get the sparse unique evidence. These sparse family members can be problematic with unreliable data and appearing to be at extremely different relative expression levels compared to other family members. The magnitude of these issues depends on underlying genomic complexity, choice of protein database, and sample composition. Grouping of homologous proteins is a more reliable and stable approach.

1. Huan, J., Hornick, N.I., Goloviznina, N.A., Kamimae-Lanning, A.N., David, L.L., Wilmarth, P.A., Mori, T., Chevillet, J.R., Narla, A., Roberts Jr, C.T. and Loriaux, M.M., 2015. Coordinate regulation of residual bone marrow function by paracrine trafficking of AML exosomes. Leukemia, 29(12), p.2285.

2. McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

3. Cox, J. and Mann, M., 2008. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), p.1367.


Load libraries and read in the data file

In [32]:
# load libraries
library("tidyverse")
library("psych")
library("gridExtra")
library("scales")
library("limma") 
library("edgeR") 

# read the grouped protein summary file
MQ_raw <- read_tsv("KUR1502_data-export.txt")
Parsed with column specification:
cols(
  Accession = col_character(),
  Media_2.1 = col_double(),
  Media_2.2 = col_double(),
  Media_3.2 = col_double(),
  Exo_2.1 = col_double(),
  Exo_2.2 = col_double(),
  Exo_3.1 = col_double(),
  Exo_3.2 = col_double(),
  Intensity = col_double(),
  iBAQ = col_double()
)

Extract the data tables

In [33]:
# separate accessions from the data
accession <- MQ_raw$Accession

# get MS1 quantities
intensities <- MQ_raw$Intensity
ibaqs <- MQ_raw$iBAQ

# get the reporter ion intensities
MQ_tmt <- MQ_raw %>% select(-Accession, -Intensity, -iBAQ)

head(MQ_tmt)
nrow(MQ_tmt)
Media_2.1Media_2.2Media_3.2Exo_2.1Exo_2.2Exo_3.1Exo_3.2
108150012454001069200990220 848080 12809001568400
730260 907010 743560759690 624090 8968101048800
687340 856020 685910554000 435720 756950 863480
571540 713260 598050610810 474850 755830 850350
624970 768790 563760588560 500550 670370 722580
8720201093500 672060450790 324490 410460 472880
4878

Check zero replacement situation

During data preparation in Excel, a minimum sum of reporter ion intensity of 1000 was used. Missing data is mostly associated with low abundance, so a low abundance cutoff can remove most missing data. A replacement value of 20 was used after filtering. The smallest non-zero data points were around 50. We can see how many proteins had different numbers of zero replacements.

In [34]:
counts <- 1:nrow(MQ_tmt) # create vector of appropriate length
for(i in 1:nrow(MQ_tmt)){
    # TRUE will be coerced to 1 for the summing
    counts[i] <- sum(MQ_tmt[i, ] == 20)
}
table(counts) # create the summary
counts
   0    1    2    3    4    5 
4746   79   28   19    5    1 

We have more missing data with MaxQuant

The Comet/PAW had 68 proteins with some missing values; MaxQuant has 132. There are 53 proteins that have more than one missing value (Comet/PAW had 17). The vast majority of the protein have no missing values or just one. We can keep an eye out for those in the final results sheet after the analysis. We really only need to worry about them if they get flagged as significant differential expression (DE) candidates.


What do the unnormalized data distributions look like?

We will use box plots, a common way to summarize distributions, to visualize the Log base 10 of the intensities. The 3 media samples are in red color and the 4 exosome-dosed samples are in blue color.

In [35]:
# let's see what the starting data look like
color = c(rep("red", 3), rep("blue", 4))
boxplot(log10(MQ_tmt), col = color, notch = TRUE, main = "Starting MaxQuant data")

Do we need to do some data normalization?

The boxplots are not very well horizontally aligned. A global scaling seems in order. Each channel is supposed to be the same total amount of protein digest that was labeled. We would, therefore, expect the total sum of reporter ion intensities to be the same in each channel.

NOTE: The MaxQuant reporter ion intensity values differ in magnitude from PAW (or Proteome Discoverer) by a significant factor (about a factor of 8 smaller).

We can make a function that takes the column sums, computes the average sum, and scales each column so that its new sum will be the average.

In [36]:
SL_Norm <- function(df, color = NULL, plot = TRUE) {
    # This makes each channel sum to the average grand total
        # df - data frame of TMT intensities
        # returns a new data frame with normalized values
    
    # compute scaling factors to make colsums match the average sum
    norm_facs <- mean(c(colSums(df))) / colSums(df)
    cat("SL Factors:\n", sprintf("%-5s -> %f\n", colnames(df), norm_facs))
    df_sl  <- sweep(df, 2, norm_facs, FUN = "*")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(df_sl), col = color, notch = TRUE, main = "SL Normalized data")
    }
    df_sl
}

# SL norm the tmt data
MQ_tmt_sl <- SL_Norm(MQ_tmt, color)
SL Factors:
 Media_2.1 -> 0.877458
 Media_2.2 -> 0.712193
 Media_3.2 -> 0.997123
 Exo_2.1 -> 1.216413
 Exo_2.2 -> 1.494475
 Exo_3.1 -> 1.089059
 Exo_3.2 -> 0.957944

We can check the column sums before and after normalization

The normalization brought the boxplots into horizontal alignment. We can double check the computations by seeing if the column sums are equal after normalization.

In [37]:
# compute column sums before and after normalization
print("Before:")
format(round(colSums(MQ_tmt), digits = 0), big.mark = ",")
print("After:")
format(round(colSums(MQ_tmt_sl), digits = 0), big.mark = ",")
[1] "Before:"
Media_2.1
'110,182,550'
Media_2.2
'135,750,485'
Media_3.2
' 96,959,547'
Exo_2.1
' 79,480,035'
Exo_2.2
' 64,691,993'
Exo_3.1
' 88,774,401'
Exo_3.2
'100,925,042'
[1] "After:"
Media_2.1
'96,680,579'
Media_2.2
'96,680,579'
Media_3.2
'96,680,579'
Exo_2.1
'96,680,579'
Exo_2.2
'96,680,579'
Exo_3.1
'96,680,579'
Exo_3.2
'96,680,579'

Data was loaded, TMT data extracted, and normalization checked

The basic starting steps in any analysis are to make sure the data was read in correctly. Proteomics summary files usually have lots of information in addition to the quantitative data. Some work to extract the correct quantitative columns will be needed (this is often called the data wrangling part of an analysis). This can be done in Excel before reading into R (this could be a possible source of errors) or can be done in R (a different set of possible errors). Filtering of rows that should be excluded from the statistical testing is also necessary. Those include common contaminants, any decoy proteins, and proteins that might have too much missing data to be reliable. Some care to make sure that the testing results can be merged back with the proteomics results has to be used.

The extracted data may need some further formatting depending on the downstream statistical testing requirements. Here, we are saving the protein accession information as a vector and the actual TMT reporter ions as a 2D table of numbers.

Some exploratory statistics on the starting data are always a good idea. R has many functions to do far more than we have done here. We did some very basic checks to see what the data looked like before and after one simple normalization method. The basic features of the data look okay, so we will move on to the statistical testing.


A brief history of quantitative proteomics

In the dark ages before proteomics, mass spectrometers were made of stone (sector magnets and chemical ionization) and the t-test was developed to test beer (an essential ingredient of proteomics from the very beginning). Then came the not-so-golden ratio age (Penning ion traps, Q-TOFs, ICAT, and SILAC) where quantitative proteomics became synonymous with ratios and fold-changes. Battle lines were drawn between complicated standard formats with lengthy command line processing pipelines and the enduring allure of mysterious "black box" commercial products.

Ironically, progress on the instrument side seems to have accelerated while advances on the processing side seems to be in an ice age glacial state. We have seen an increase in the number of standard formats and their complexity. They seemed to have found the sweet spot of being quite hard for both humans and computers to read. We now have shiny "open source" black boxes to compete with more faded "commercial" black boxes. Statistical testing has evolved from one ratio (and no need for statistics) to boatloads of ratios as multiplexing capacity has increased. This leads to a literal "log2" jam that might involve beer consumption because it always ends up with a t-test.

Can we actually do quantitative proteomics without going down this ancient, overgrown path to failure? The answer is yes. Genomics had a head start on proteomics and has been doing the larger sample number, complicated experimental designs that biological studies require for many years. Modern instruments and methods are producing quantitative proteomics datasets that are much more similar to genomics data than they were in the past. If we do a little research (are we not scientists?) into genomic tools and data formats, we can learn how to prepare proteomics data to leverage these more mature statistical packages and stop trying to invent square wheels and making sacrifices to the Volcano plot Gods.

Joking aside, proteomics needs to recognize the similarity between proteomics and genomics data and abandon outdated thinking that was never all that useful. Computers and data science methods have changed just as much as new instruments like the timsTOF have changed compared to a sector magnet mass specs.


Differential expression (DE) testing

This is not the place for a review of hypothesis testing. It is, however, a good place to get the gist of statistical testing in this context and define some concepts for the rest of the notebook. There are many types of proteomics experiments, and they do not fit into just one pigeon hole. A common scenario is testing if a protein is expressed at the same or different levels in a background of many other proteins.

Complex protein mixtures from two or more biological conditions (such as control and treatment, or normal and disease) are compared by making measurements of relative abundances. There are some hidden constraints in these experimental designs. The total amount of protein is usually fixed at the same value across all samples (an individual biological replicate) and there is a dynamic range within the proteomes. High abundance proteins are constrained by the total protein amount and do not have the same freedom for abundance differences that proteins that are smaller fractions of the total have. Higher abundance proteins must, by experimental design, have moderated differences in means and reduced variances. Lower abundance proteins have more freedom and will not really appear to be constrained.

The constrained nature of these experimental designs (both genomics and proteomics) also has implications in normalization algorithms. An increased expression in abundant proteins will "push" all other proteins down (their abundance total has to remain the same). Grand total normalization like we use above can be too simplistic. The Bioconductor (Ref-4) differential expression testing package edgeR (Ref-5) includes a trimmed mean of M values (TMM) normalization step (Ref-6) designed for these types of samples.

4. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

5. Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

6. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.


The next parts of the notebook will demonstrate using edgeR to do DE testing:

  • proteomics data can be very similar to genomics data
  • robust genomics statistical packages can be used
  • statistical models and test results can be checked with visualizations

Load data into edgeR data structures

Bioconductor uses its own data classes. The main data table is organized by each gene or protein in a separate row, and each biological sample in a separate column. Genes/proteins can have rich annotations and the samples can have metadata. Think of this like a table of information about the rows and a table of information about the columns. These are linked tables. If you change the data rows, you have to update the row table, etc.

EdgeR has extensive documentation and you can read more about its data classes in the user's guide. We need to get the reporter ion intensities into a DGEList object. In addition to the data table, we can add the protein accessions as row labels, and the group membership (plain media or exosome dosed) information. We will use variable names similar to what is in the user's guide to make learning more about edgeR easier.

In [38]:
# load data into DGEList object
group <- c(rep("media", 3), rep("exosome", 4))
y <- DGEList(counts = MQ_tmt, group = group, genes = accession)

# we can see what y looks like
#y
y$samples # this is shorter to view
grouplib.sizenorm.factors
Media_2.1media 1101825501
Media_2.2media 1357504851
Media_3.2media 969595471
Exo_2.1exosome 794800351
Exo_2.2exosome 646919931
Exo_3.1exosome 887744011
Exo_3.2exosome 1009250421

We loaded the unnormalized TMT data

EdgeR normalization is actually done in two steps. The first, called a library size adjustment, is like the SL normalization we did above. This gets the rid of the big differences between samples so that the TMM algorithm has better starting data.

In [39]:
# run the TMM normalization
y <- calcNormFactors(y)

# check what changed
y$samples
grouplib.sizenorm.factors
Media_2.1media 1101825501.0418281
Media_2.2media 1357504851.0603080
Media_3.2media 969595471.0351963
Exo_2.1exosome 794800350.9848199
Exo_2.2exosome 646919930.9951479
Exo_3.1exosome 887744010.9376066
Exo_3.2exosome 1009250420.9516650

edgeR computes correction factors

EdgeR does not change the counts table. The "samples" list slot does get new norm.factor values. EdgeR uses the normalization factors as offsets in its modeling. We will want to check the TMM step and visualize the data results, so we need to get a table of normalized data values. A more detailed discussion of edgeR normalization can be found here. We will use a function that returns a data frame of normalized data from a DGEList object.

In [40]:
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute grand total (library size) scalings
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size

    # the TMM factors are library adjustment factors (so divide by them)
    norm_facs <- lib_facs / y$samples$norm.factors
    cat("Overall Factors (lib.size+TMM):\n", sprintf("%-5s -> %f\n", 
                                                     colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# get the normalized data values
MQ_tmt_tmm <- apply_tmm_factors(y, color)
Overall Factors (lib.size+TMM):
 Media_2.1 -> 0.842229
 Media_2.2 -> 0.671685
 Media_3.2 -> 0.963221
 Exo_2.1 -> 1.235163
 Exo_2.2 -> 1.501762
 Exo_3.1 -> 1.161531
 Exo_3.2 -> 1.006598

Examine returned data and compute column totals

In [41]:
# check returned table
head(MQ_tmt_tmm)

# check column totals
print("After TMM:")
format(round(colSums(MQ_tmt_tmm), digits = 0), big.mark = ",")
Media_2.1_tmmMedia_2.2_tmmMedia_3.2_tmmExo_2.1_tmmExo_2.2_tmmExo_3.1_tmmExo_3.2_tmm
910871.0 836516.8 1029875.91223083.41273614.21487805.21578748.8
615046.4 609225.3 716212.6 938341.2 937234.61041672.71055720.3
578897.9 574976.0 660682.9 684280.5 654347.7 879221.0 869177.5
481367.8 479086.2 576054.3 754450.1 713111.6 877920.1 855960.9
526368.1 516384.9 543025.5 726967.7 751706.9 778655.6 727347.8
734440.8 734487.8 647342.3 556799.3 487306.7 476762.1 476000.2
[1] "After TMM:"
Media_2.1_tmm
' 92,798,975'
Media_2.2_tmm
' 91,181,601'
Media_3.2_tmm
' 93,393,473'
Exo_2.1_tmm
' 98,170,820'
Exo_2.2_tmm
' 97,151,971'
Exo_3.1_tmm
'103,114,230'
Exo_3.2_tmm
'101,590,979'

Column totals are not all the same

The point of TMM is that changes in the abundance of more abundant proteins distort the grand totals. That is the trimmed part of the algorithm. With freedom to change the abundances of the more abundant proteins, the column totals will no longer stay constrained. Even though the column totals are not the same, the boxplots are nicely aligned.

If we have large adjustment factors, it means we have some compositional differences between samples. That has to be reconciled with what is known about the system under study. Generally, some compositional differences between conditions could make sense. Compositional differences between samples within the same condition could mean contaminating tissue in a dissection, for example.

It is critical to remember that most normalization methods assume large numbers of proteins with unchanged expression are present in the samples. These methods are for samples in conditions that are mostly similar in their proteomes. When that is not the case, much more care has to go into picking a normalization method.

CV distributions are a good metric to test normalizations

Now that we have the table of TMM normalized data, we can do a few more QC type visualizations. We should have smaller CV values when normalization has worked well. If we do not get relatively tight CV distributions with nice median CV values, there are two likely reasons. There is significant inherent biological variability between samples (humans), or the sample preparation is too variable (need to optimize more).

In [42]:
# define variables for the columns in each condition
M <- 1:3
E <- 4:7
In [43]:
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

labeled_boxplot <- function(df, ylim, title) {
    # Makes a box plot with the median value labeled
        # df - data frame with data to compute CVs of
        # ylim - upper limit for y-axis
        # title - plot title
    cv = CV(df)
    boxplot(cv, ylim = c(0, ylim), notch = TRUE, main = title)
    text(x = 0.65, y = boxplot.stats(cv)$stats[3], 
         labels = round(boxplot.stats(cv)$stats[3], 1))
}

# compare CV distributions
par(mfrow = c(2, 2))
labeled_boxplot(MQ_tmt[M], ylim = 150, title = "Starting Media CVs")
labeled_boxplot(MQ_tmt[E], ylim = 150, title = "Starting Exosome CVs")
labeled_boxplot(MQ_tmt_tmm[M], ylim = 75, title = "Media CVs after TMM")
labeled_boxplot(MQ_tmt_tmm[E], ylim = 75, title = "Exosome CVs after TMM")
par(mfrow = c(1, 1))

Median CVs with Comet/PAW are a little lower

The median CVs are much improved after normalization. The median CVs with the Comet/PAW analysis were 9.3% and 11.3%, respectively. Considerably more peptide-spectrum-matches (PSMs) were identified with Comet/PAW than from MaxQuant at a 1% FDR:

What Number
Acquired MS2 scans 275695
Identified PSMs, MaxQuant 60766
Identified PSMs, Comet/PAW 85936
Gain 41%

Aggregation of more data at the protein level could be the reason that the median CVs were improved. Lower PSM identification numbers from MaxQuant are consistently seen across all datasets and discussed more in this blog entry.

Check how the samples cluster

This is a two-condition comparison. One of the 4 media only samples did not label. The cultures were done in two batches ("2.x" and "3.x" in the labels). We expect to see the media only samples cluster together and be separated from a cluster of the exosome dosed samples.

In [44]:
# see how things cluster after we have normalized data
plotMDS(log2(MQ_tmt_tmm), col = c(rep("red", 3), rep("blue", 4)), main = "MaxQuant TMM")

Conditions separate left-to-right in first dimension

Media only clearly separates from exosome dosed in the leading dimension. These MDS plots are like principal component analysis. The x-axis is the major dimension of differences between samples, the y-axis is the second most important dimension of difference. The samples are separating left-to-right by condition. We probably also have a culture prep batch effect between batch 2 and 3. We should have more samples in each condition before attempting things like batch corrections, so we will focus on the differences between media and exosome conditions.

Compare samples to each other with scatter plots

Multi-panel scatter plot grids are another exploratory data analysis technique that is well supported in R. We can see how similar biological replicates are to each other within the median only condition or in the exosome dosed condition. We can also see if the samples between conditions seem different. We would expect to see larger differences (more scatter) between groups that within groups.

In [45]:
# scatter plots within groups and betwen groups
pairs.panels(log10(MQ_tmt_tmm), lm = TRUE, main = "After TMM")

Within condition samples are a bit tighter than between conditions

Correlation coefficients are larger within conditions (0.98-ish) than between conditions (0.93-ish), but the differences are not large. Correlation coefficients are not a great metric. Media only samples are a little more similar between batch 2 and 3 than are the exosome dosed samples. Samples within the same condition and within the same batch are pretty similar to each other.

Back to the statistical testing...

The typical analysis goes something like this:

  • load the data into DGEList object
  • run TMM and normalize the data
  • check for experimental issues with cluster plot
  • estimate dispersion trends for moderated statistics
  • plot the Biological Coefficient of Variation (BCV)
  • do the pair-wise tests and get p-values, FDRs
  • check the p-value distribution
  • visualize DE results
    • MA plots
    • scatter plots
    • volcano plot
  • gather normalized data and test results into table
  • write out the final results table

We have already done the first 3 steps.

In [46]:
# we need to get dispersion estimates
y <- estimateDisp(y)
plotBCV(y, main = "Dispersion trends")
Design matrix not provided. Switch to the classic mode.

Dispersion depends on the protein intensity

There are commonly abundance-dependent dispersion effects in many measurement methods. Higher abundance signals have better signal-to-noise (or are the result of more averaging), so they can appear more precise than lower abundance signals. We see that the blue line increases as the protein intensity decreases. The statistical testing will use trended variances (estimated from multiple proteins) as a function of protein intensity, which spans several orders of magnitude.

If we had summarized the data using ratios, we would have collapsed the range of protein values into a concentrated blob centered at 1:1 ratios. We would lose all ability to have trended variance estimates. The trended variance is how the test statistics are moderated and is why limma and edgeR outperform t-tests in typical study designs where replicate numbers can be small. I do not know how many replicates in each condition are needed before the trended variance stops improving results (10, 20, 50?). I think most proteomics studies are still considerably short of replicate numbers where the t-test will work well. Avoiding ratios is very important in getting the best results.

edgeR exact test

We will use the exact test in edgeR for this simple two-state comparison. EdgeR has general linear model extensions that allow for more complicated experimental designs. They are a bit more work to set up and run. We will avoid that here, but the user's guide has many examples.

If you read up on edgeR or know a bit about statistics, you may wonder why we are using statistical testing built on a negative binomial distribution for counting experiments. The short answer is because edgeR is easier to setup and run, and the modeling seems okay given the data. The longer story is that the variance is modeled by two terms: one for Poisson numbers and an over-dispersion term. As counts get larger (a few hundred or so), the Poisson term becomes small and the variance is handled by the over-dispersion term. The values of reporter ion peak heights, particularly when summed into protein totals, are large enough numbers. One could argue that limma would be a better choice. However, limma is a little more complicated to set up and run than edgeR.

Notebooks are a great framework for trying different analysis steps. In the previous version of this data set analysis, edgeR testing was compared to a two-sample t-test. Running different tests in R is not hard, what is hard is comparing different analysis methods and trying to define a metric to measure performance.

In [47]:
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))

    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# compute the exact test models, p-values, FC, etc.
et <- exactTest(y, pair = c("media", "exosome"))

# make the results table 
tt <- topTags(et, n = Inf, sort.by = "none")$table
med_exo <- collect_results(MQ_tmt_tmm, tt, M, "med", E, "exo")

Run test and reformat results

The exactTest function does the modeling for the specified pair. More than one pair can be present in DGEList objects to improve the dispersion estimates. The topTags function does the multiple testing corrections and returns more concise summaries of the testing results. The collect_results function combines some of the normalized data and test results into a handy results data frame.

Check if testing looks okay

It is important to see if the modeling looks reasonable. Our general assumptions are that we have a large fraction of the proteins that are not differentially expressed. Those will have a uniform (flat) p-value distribution from 0.0 to 1.0. We also expect (hopefully) some true differential expression candidates. Those should have very small p-values and have a sharper distribution at low p-values.

In [48]:
pvalue_plots <- function(results, ylim, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # ylim - ymax for expanded view
        # title - plot title
    p_plot <- ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]))

    # we will need an expanded plot
    p1 <- p_plot + ggtitle(str_c(title, " p-value distribution"))
    p2 <- p_plot + coord_cartesian(xlim = c(0, 1.0), ylim = c(0, ylim)) + 
        ggtitle("p-values expanded")
    grid.arrange(p1, p2, nrow = 2) # from gridExtra package
}

# check the p-value distrubution
pvalue_plots(med_exo, 100, "Media vs Exosome-dosed")

We have the two distributions of p-values, so the testing seems reasonable. We can also see how the up-regulated protein number compares to the down-regulated number. We can use the topTags function to see which proteins have the smallest p-values.

In [49]:
# see how many up and down candidates (10% FDR)
summary(decideTests(et, p.value = 0.10))

# see which proteins have the smallest p-values
topTags(et)$table
       exosome-media
Down            1181
NotSig          2502
Up              1195
geneslogFClogCPMPValueFDR
2151sp|Q80V94|AP4E1_MOUSE 9.712092 6.351154 2.204329e-142 1.075272e-138
2518sp|Q6PDS3|SARM1_MOUSE 9.257728 5.905050 6.788691e-48 1.655762e-44
1057tr|E9Q6R7|E9Q6R7_MOUSE3.914449 8.044726 9.796961e-47 1.592986e-43
177sp|P54987|IRG1_MOUSE 2.578283 10.256279 5.549544e-45 6.767669e-42
2916tr|Q8R5I6|Q8R5I6_MOUSE8.735104 5.383087 5.384968e-44 5.253575e-41
480sp|P35173|CYT3_MOUSE 3.012063 9.226840 7.239595e-44 5.885791e-41
2670sp|Q9CRB6|TPPP3_MOUSE 9.078868 5.726764 9.860130e-44 6.871102e-41
3665sp|P04918|SAA3_MOUSE 3.462006 4.138469 1.271082e-42 7.750422e-40
2484sp|P33766|FPR1_MOUSE 3.225450 5.805522 4.588253e-42 2.486833e-39
1377sp|Q8BGW1|FTO_MOUSE 3.441914 7.394756 1.492332e-41 7.279594e-39

We can categorize candidates by ranges of adjusted p-values

In many discovery experiments, a Benjamini-Hochberg adjusted p-value (FDR) cutoff of 0.05 (5%) might be pretty strict. We use a 10% cutoff to distinguish DE from non-DE candidates. We define three cuts on the FDR: 10% to 5% are "low" significance, 5% to 1% are medium significance, and less than 1% are more "highly" significant. Cut values can be adjusted depending on the experimental situation.

We can look at expression ratio distributions as a function of candidate category. If variance is not too variable protein-to-protein, then we would expect larger mean differences to be associated with lower FDR values. Faceted plotting in ggplot2 is a nice method for showing such things.

In [50]:
# see how many candidates are in each category
med_exo %>% count(candidate)

log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# can look at log2FC distributions as a check
log2FC_plots(med_exo, 3, "LogFC by candidate for Media vs Exosome-dosed")
candidaten
high1490
med 553
low 333
no 2502

Visualize the edgeR results several ways

  • MA plots
  • scatter plots
  • volcano plot

We need some transformed axes for MA plots and for volcano plots. We will make a function for that and also some functions for the plotting. MA plots first. The dotted lines indicate 2-fold changes.

In [51]:
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

MA_plots <- function(results, x, y, title, make_facet = TRUE) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # make_facet - flag to plot facet views
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    if (make_facet == TRUE) {
        ma_facet <- ggplot(temp, aes(x = A, y = M)) +
            geom_point(aes(color = candidate, shape = candidate)) +
            scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
            scale_x_continuous("log10 Ave_intensity") +
            ma_lines +
            facet_wrap(~ candidate) +
            ggtitle(str_c(title, " (separated)"))
    }

    # make the plots visible
    print(ma)
    if (make_facet == TRUE) {
         print(ma_facet)
    }
}    

# MA plots of DE candidates
MA_plots(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed")

Scatter plots

The solid diagonal line is 1:1, the dotted lines are 2-fold changes. The axes are in log scale.

In [52]:
scatter_plots <- function(results, x, y, title, make_facet = TRUE) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # make_facet - flag to plot facet views
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    if (make_facet == TRUE) {
        scatter_facet <- ggplot(results, aes_string(x, y)) +
            geom_point(aes(color = candidate, shape = candidate)) +
            scatter_lines +
            facet_wrap(~ candidate) +
            ggtitle(str_c(title, " (separated)")) 
    }

    # make the plots visible
    print(scatter)
    if (make_facet == TRUE) {
         print(scatter_facet)
    }
}

scatter_plots(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed")

Volcano plot

Volcano plots are another common way to visualize DE candidates.

In [53]:
volcano_plot <- function(results, x, y, title, ymax) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
        # ymax - upper limit for y-axis
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        coord_cartesian(xlim = c(-5, 5), ylim = c(0, ymax)) + 
        ggtitle(str_c(title, " Volcano Plot"))
}

# finally, a volcano plot
volcano_plot(med_exo, "ave_med", "ave_exo", "Media versus exosome-dosed", 50)

See what the top DE candidates look like

We can see how the intensities of the individual samples compare for the top 20 up- and down-regulated candidate proteins.

In [54]:
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", p-val: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title)
    }    
}
# plot the top 20 up and 20 down proteins
set_plot_dimensions(7, 4)
plot_top_tags(med_exo, 3, 4, 20)
set_plot_dimensions(7, 7)

Comparison of MaxQuant and PAW reporter ions

One mystery is what are the reporter ion quantities reported by MaxQuant? The smallest non-zero values are around 50 and in the PAW processing they are around 350. The data in the dilution series analysis has been processed with MaxQuant and PAW. The column total intensities reflect the expected dilution factors. Those values are highly correlated between MaxQuant and PAW (see next cell).

Correlation between MaxQuant and PAW reporter ion intensities

In [55]:
# column totals from the dilution series (25, 20, 15, 10, 5, 2.5) data
MQ <- c(252042061, 194468567, 147076787, 101203918, 49272033, 26071287)
PAW <- c(1996324081, 1562525113, 1207642244, 835353557, 400841992, 221591661)
dilution <- data.frame(PAW = PAW, MQ = MQ)

# from https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
ggplotRegression <- function (fit, title) {
    require(ggplot2)
    ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
        geom_point(size = 3) +
        stat_smooth(method = "lm", col = "red") +
        labs(title = title, 
             subtitle = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5), 
                              "Intercept =",signif(fit$coef[[1]],5 ),
                              " Slope =",signif(fit$coef[[2]], 5),
                              " P =",signif(summary(fit)$coef[2,4], 5)))
}

ggplotRegression(lm(MQ ~ PAW, data = dilution), "MQ reporter ions versus peak heights")

MaxQuant values are about 8 time smaller than reporter ion peak heights

The Comet/PAW peak heights have been checked against Proteome Discoverer (1.4 and 2.x), RawQuant, and Qual Browser. They are the centroided peak intensities. The MaxQuant values must be derived from the peak heights to have such a correlated scatter plot. Why they are smaller is unclear.

Compare MS1 intensities to summed reporter ion intensities

MaxQuant produces summed MS1 feature intensities for proteins by default. iBAQ values can also be computed. These can be compared to summed reporter ion signals to see if those are good protein relative abundance measures.

In [56]:
# get summ of reporter ions for each protein
totals  <- rowSums(MQ_tmt_tmm)

# add to data frame with MS1 abundances (drop zeros and take logs)
abundances <- data.frame(Total_TMT = totals, MS1_Intensity = intensities, iBAQ_Intensity = ibaqs)
abundances <- log10(abundances[abundances$MS1_Intensity > 0, ])

# make some scatter plots
pairs.panels(abundances, lm = TRUE, 
             main = "Protein relative abundance measures")
ggplotRegression(lm(MS1_Intensity ~ Total_TMT, data = abundances), 
                 "Correlation between MS1 and reporter ion intensities")

Summed reporter ions are a good relative abundance measure

Another argument to avoiding expression ratios, is that the summed reporter ions are a good protein relative abundance measure. This adds another dimension to the data for biological interpretation. It can be very useful to know if a DE candidates was a high abundance protein or a low abundance protein.

More missing data in MaxQuant

There seemed to be a few more missing data values on the low abundance end in MaxQuant compared to PAW. This could be due to overall lower sensitivity with MaxQuant compared to Comet/PAW (or many other search engines). The numbers of identified PSMs and proteins are in the table below.

Category Comet/PAW MaxQuant
MS2 Scans 272221 275695

PSMs (1% FDR)|85936|60766| Protein IDs|5462|5522| Proteins, 2 peptides|5462|4711|

PAW/Comet identified 41% more PSMs. Including single peptide per protein identifications pads the MaxQuant protein IDs and helps hide the poorer identification performance at the PSM level. I have a blog entry that talks more about MaxQuant performance here.

Top candidate overlap not so great for the top 25 up and top 25 down proteins

If people understood just how much statistical model p-values can change depending on data, model choice, and model parameter choices, maybe they would worry less about them. There were about 1200 up and 1200 down candidates at a 10% FDR cutoff. MaxQuant was a little under 1200 and PAW a little over 1200. The difference was not really too large. Seeing how well the top 25 compare based on p-values is not really a fair test. The full lists would need to be examined. For the many proteins that were the same in the top 25 plots, the expression patterns looked qualitatively similar. Things are probably more similar than different.

An important point is that these lists are a bit dynamic at the single protein level. Drawing conclusions from single proteins might be a perilous exercise. Most biological process and overall system level changes should involve many proteins. Hypotheses should be based on the behavior of multiple proteins. Also of real importance, is that the sets of proteins that drive the biological conclusions may not all be DE candidates at some arbitrary significance level. It is critical to have full results tables with statistical test results, normalized data, expression changes, and annotations for ALL quantifiable proteins. The tools to do the complex queries of these complex data sets for complex biological questions are desperately needed.

Summary

EdgeR normalizations and testing seemed reasonable for this data.

  • non-DE candidate much less than 2-fold (most are closer to 1:1)
  • larger fold changes associated with more significant candidates
  • wider dispersion at lower intensities is incorporated into edgeR testing
    • need larger fold changes at lower intensities to get the same significance
  • proteins with larger fold-changes have much more significant p-values
    • we have wide dynamic range in the p-values
  • we did not do any isotopic purity corrections
    • true fold-changes could be larger than measured

There were a large number of significant candidates. There are options in edgeR (if using the glm extensions) to add a minimum fold-change to the testing. That can be used to reduce the number of candidates. Stricter FDR cutoffs could also be used. Ranking candidates for follow-up study is not trivial. Statistical values like fold-changes and p-values may not be sufficient to determine biologically relevant rankings. Biological information for this system should be an important consideration.

There is a very uneven distribution of where effort is directed in these workflows. Label reagents, instrument methods, search engines, and PSM validation have gotten the most attention. Normalizations and statistical models have received some attention. Visualizing results and making biological sense of things are the important steps that have been neglected for too long. This is the critical bottleneck.

Getting results out of R

One step towards more meaningful results summaries are getting the main proteomics results and the statistical results combined into a single table. If we are careful with how we bring the data into R and how we take it back out, we can keep tables in order so merging results are easier. There are several strategies that can work well. If the data coming into R and the data coming out of R are not easy to align correctly, then edit the notebooks to make it work right. More data prep can occur before importing into R, for example. We kept the accessions column as we worked with the data. We can be really robust and use merge functions to weave together starting tables with final results tables.

We should always end notebooks with information about what packages and versions were used in the analysis.

In [57]:
# save the testing results
write.table(med_exo, file = "KUR1502_MQ-results.txt", sep = "\t",
            row.names = FALSE, na = " ")

# log the session
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.24.3    limma_3.38.3    scales_1.0.0    gridExtra_2.3  
 [5] psych_1.8.12    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.0.1  
 [9] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.1   
[13] ggplot2_3.1.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     locfit_1.5-9.1   tidyselect_0.2.5 repr_0.19.2     
 [5] splines_3.5.3    haven_2.1.0      lattice_0.20-38  colorspace_1.4-1
 [9] generics_0.0.2   htmltools_0.3.6  base64enc_0.1-3  rlang_0.3.4     
[13] pillar_1.3.1     foreign_0.8-71   glue_1.3.1       withr_2.1.2     
[17] modelr_0.1.4     readxl_1.3.1     uuid_0.1-2       plyr_1.8.4      
[21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.3     
[25] evaluate_0.13    labeling_0.3     parallel_3.5.3   broom_0.5.2     
[29] IRdisplay_0.7.0  Rcpp_1.0.1       backports_1.1.4  IRkernel_0.8.15 
[33] jsonlite_1.6     mnormt_1.5-5     hms_0.4.2        digest_0.6.18   
[37] stringi_1.4.3    grid_3.5.3       cli_1.1.0        tools_3.5.3     
[41] magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.2 
[45] xml2_1.2.0       lubridate_1.7.4  assertthat_0.2.1 httr_1.4.0      
[49] rstudioapi_0.10  R6_2.4.0         nlme_3.1-139     compiler_3.5.3  

One extra thing

Let's see how a basic t-test works with the MQ data

The best statistical option in Perseus is probably a two-sample t-test with a multiple testing correction. Let's see how that compares to the edgeR analysis. We will use the shared variance option on the t-test to get a little more variance pooling.

In [58]:
# do the t-test on log transformed intensities to be safe
ttest_MQ <- log2(MQ_tmt_tmm)
# add average ratio columns (non-logged ratios), fold-change column, and row names
ttest_MQ$ave_med <- rowMeans(MQ_tmt_tmm[1:3])
ttest_MQ$ave_exo  <- rowMeans(MQ_tmt_tmm[4:7])
ttest_MQ$logFC <- log2(ttest_MQ$ave_exo / ttest_MQ$ave_med)
row.names(ttest_MQ) <- accession

# apply the basic two-sample t-test (we will pool variance)
t.result <- apply(ttest_MQ, 1, function(x) t.test(x[1:3], x[4:7], var.equal = TRUE))
# extract the p-value column from the t-test thingy 
ttest_MQ$PValue <- unlist(lapply(t.result, function(x) x$p.value))
# do a Benjamini-Hochberg multiple testing correction
ttest_MQ$FDR <- p.adjust(ttest_MQ$PValue, method = "BH")

# add a DE candidate status column
ttest_MQ$candidate <- cut(ttest_MQ$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                          labels = c("high", "med", "low", "no"))
    
# count up, down and the rest (FDR less than 0.05)
all <- dim(ttest_MQ)[1]
up <- dim(ttest_MQ[(ttest_MQ$FDR <= 0.10) & (ttest_MQ$logFC > 0.0), ])[1]
down <- dim(ttest_MQ[(ttest_MQ$FDR <= 0.10) & (ttest_MQ$logFC <= 0.0), ])[1]
print("This is like decideTest in edgeR - 10% FDR cut:")
up 
all - up - down
down
print("Candidate Counts:")
summary(ttest_MQ$candidate)
    
# what does the test p-value distribution look like?
ggplot(ttest_MQ, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(ttest_MQ$PValue, breaks = 100, plot = FALSE)$counts[26:100])) +
  ggtitle("MQ data with t-test p-value distribution")
[1] "This is like decideTest in edgeR - 10% FDR cut:"
1212
2407
1259
[1] "Candidate Counts:"
high
665
med
1254
low
552
no
2407

Candidate numbers and p-value distribution look okay

The total up and down candidate numbers are around 1200 like we had with edgeR. The number of candidates in the different FDR classes is different. The t-test has most candidates in the medium category in contrast to edgeR where most of the candidates were in the high class. The p-value distribution is qualitatively similar. These bird's eye views seem reasonable.

We can look at the more detailed plots (MA plots, scatter plots, and volcano plot) to see if a deeper look into the DE candidates still seems okay. We will repeat the edgeR plots here for comparison.

In [59]:
MA_plots(med_exo, "ave_med", "ave_exo", "MQ edgeR results", FALSE)
MA_plots(ttest_MQ, "ave_med", "ave_exo", "MQ t-test results")
In [60]:
scatter_plots(med_exo, "ave_med", "ave_exo", "MQ edgeR results", FALSE)
scatter_plots(ttest_MQ, "ave_med", "ave_exo", "MQ t-test results")
In [61]:
# compare volcano plots
volcano_plot(med_exo, "ave_med", "ave_exo", "EdgeR results", 4)
volcano_plot(ttest_MQ, "ave_med", "ave_exo", "t-test results", 4)

Plain t-test is noisier than edgeR

Many aspects of the t-test look pretty bad

  • too many protein with small fold-changes are significant
  • less correlation between low p-value and large fold change
  • less useful candidate categorization
  • statistically significant candidates are a little biased towards higher intensities
  • ranking the top candidates for follow up work looks more difficult to do

More detailed comparisons between edgeR, the t-test, and limma are in other notebooks in this repository.

In [ ]: