# load the ggplot libraries
library(tidyverse)
# read in the data
temp <- read_tsv("average_fractionMissing.txt")
# make a basic plot
ggplot(temp, aes(x = AverageSpC, y = FracMissing)) +
geom_point() +
geom_smooth(method = "loess", span = 0.05) +
ggtitle("Fraction Missing Data versus Average SpC") +
labs(x = "Average SpC", y = "Fraction Missing")
We see a sharp rise in the fraction of missing data as the average SpC decreases. We need to expand the x-axis closer to the origin to see at what average SpC we start to see the sharp rise.
# expanded x-axis plot
ggplot(temp, aes(x = AverageSpC, y = FracMissing)) +
coord_cartesian(xlim = c(0, 50)) +
geom_line() +
# geom_smooth(method = "loess", span = 0.05) +
ggtitle("Missing versus Average SpC") +
labs( x = "Average SpC", y = "Fraction Missing") +
geom_vline(xintercept = 2.5, linetype = "dotted") +
geom_vline(xintercept = 5.0, linetype = "dashed")
The rise in missing data seems to start at an average SpC of about 5.0 (the dashed line). That is a relatively high average SpC cutoff and it would reduce the number of testable proteins a lot. We have 2262 total non-contaminant proteins that were confidently identified. The number with an average SpC of 5 or greater is only 414. We often use an average SpC cutoff of 2.5 (the dotted line) in spectral counting experiments. That increases the number of testable proteins to 669. That is about the best we can do with this data. The single LC runs and the wide dynamic range of the proteome result in too many proteins having very small spectral counts or zeros. At the cutoff of 5, there is 10% missing data. At the 2.5 cutoff, the missing data has risen to 20%. Overall there are 53% missing data associated with the 2262 proteins.
Another way to look at this is to compute the fraction of the total SpC that are above the two SpC cutoff choices. With the 5.0 cutoff, we would include 81.9% of the total counts. The 2.5 cutoff would include 89.3% of the total SpC counts.
sessionInfo()