This is a 3 x 20 TMT experiment done with 7 11-plex TMT kits. The 10th and 11th channels (131-N and 131-C tags) were used for the pooled internal standards according to the sample keeping records. We will verify that the record keeping was correct using an edited copy of the auto_finder notebook.
# library imports
library(tidyverse)
library(stringr)
library(limma)
library(psych)
library(Matrix)
This can be either the grouped protein file or the labeled file (for reading by the IRS script). We have inspected the file using Excel and determined that the main table has 3675 rows (not counting the header row).
I need to figure out how to read the data files without having to see how long they are beforehand...
# load the raw data
# the table starts in Row 5
# there are extra lines at the bottom - need to stop before the end
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_8.txt",
skip = 4, n_max = 3675, guess_max = 3675)
# the "Filter" column flags contams and decoys
data_all <- filter(data_import, is.na(Filter))
# how big is the data frame?
print("loaded data dimensions:")
dim(data_import)
# check number of rows before and after filtering
print("Rows before and after filtering:")
cat(nrow(data_import), nrow(data_all))
We need a data frame for each TMT-plex that we are checking. The function we will be calling assumes each column in the data frame is a TMT channel from one plex (samples labeled with one TMT kit will be referred to as a "plex").
# make a data frame for each TMT-plex (and drop zeros)
get_data <- function(df, substring) {
# extracts PAW TMT columns containing a substring
# df: data frame of PAW results
# substring: a text substring that defines one TMT-plex
# get the columns
df <- df %>%
select(starts_with("TotInt")) %>%
select(contains(substring))
# drop rows with zeros
df[apply(df, 1, function(row) all(row !=0 )), ] # return filtered frame
}
# get each experiment and set column names to something easy to parse
exp1_raw <- get_data(data_all, "_exp1")
exp2_raw <- get_data(data_all, "_exp2")
exp3_raw <- get_data(data_all, "_exp3")
exp4_raw <- get_data(data_all, "_exp4")
exp5_raw <- get_data(data_all, "_exp5")
exp6_raw <- get_data(data_all, "_exp6")
exp7_raw <- get_data(data_all, "_exp7")
The find_best_pair function will make an interquartile range (IQR) grid of standardized differences for all channels against all channels after sample loading normalization. The top-N smallest IQR values will be found, the associated pair indices determined, channel versus channel scatter plots made, and box plots of the IQRs generated. The next cell has the function definitions.
# automatically find closest channel pairs
SL_norm <- function(df, suffix, print_factors = TRUE) {
# Normalizes each channel's sum to the average grand total
# df: data frame of TMT data (one column for each channel)
# suffix: a text string to append to 1-to-n numbers for column names
# print_factors: logical to control printing
# compute norm factors and scale columns
norm_facs <- mean(c(colSums(df))) / colSums(df)
df_sl <- sweep(df, 2, norm_facs, FUN = "*")
# simplify column names
colnames(df_sl) <- str_c(as.character(1:ncol(df)), suffix)
# print the normalization factors for QC check
if (print_factors == TRUE) {
cat("\nNormalization Factors: ", suffix, "\n ")
cat(sprintf("%s - %0.3f\n", colnames(df), norm_facs))
}
df_sl # return normalized data
}
diff <- function(df, x, y) {
# computes a standardized difference between two vectors
# df: data frame that contains x and y
# x: index of first column
# y: index of second column
# compute the difference and set the column name
diff <- 100 * abs(df[x] - df[y]) / rowMeans(df[c(x, y)])
colnames(diff) <- sprintf("%d_%d", x, y)
diff # return difference vector
}
iqr_grid <- function(df, print_grid = TRUE) {
# computes an NxN grid of interquartile ranges of diffs
# df: a data frame of TMT data
# print_grid: logical to control printing
# make an empty matrix ncol-by-ncol
iqrs <- matrix(0.0, ncol(df), ncol(df))
# populate an upper diagonal "diff" matrix
for (i in 1:ncol(df)) {
for (j in (i+1):ncol(df)) {
if (j > ncol(df)) {
break
}
iqrs[i, j] <- IQR(diff(df, i, j)[[1]])
}
}
# print the grid
if( print_grid == TRUE) {
cat("\nInterquartile range grid:\n")
print(round(iqrs, 2))
}
iqrs # return the iqr grid
}
find_best_pair <- function(df, nmax, suffix, title) {
# finds channel pairs having the smallest standardized differences
# df: data frame with TMT channels
# nmax: how many channels/pairs to check
# suffix: suffix to add to channel labels
# title: title string to use in plots
# first do SL normalization
df_sl <- SL_norm(df, suffix)
# compute the ird grid
iqrs <- iqr_grid(df_sl)
# find the nmax number of smallest IQR values
# from https://stackoverflow.com/questions/38664241/ranking-and-counting-matrix-elements-in-r
iqrs_long <- summary(Matrix(iqrs, sparse = TRUE))
iqrs_ordered <- iqrs_long[order(-iqrs_long[, 3], decreasing = TRUE),]
candidates <- iqrs_ordered[1, 1] # starting vector of channels
for (row in 1:nrow(iqrs_long)){
candidates <- c(candidates, iqrs_ordered[row, 1])
candidates <- c(candidates, iqrs_ordered[row, 2])
}
# get the set of indexes and print the best pairs
for (pair in 1:nmax) {
top_n <- sort(unique(candidates)[1:nmax])
i <- iqrs_ordered[pair, 1]
j <- iqrs_ordered[pair, 2]
cat(sprintf("\nPair #%d: (%d, %d) => %0.2f", pair, i, j, iqrs[i, j]))
}
# make the box plots for the top-n pairs
distributions <- list()
for (row in 1:nmax){
distributions[[row]] <- diff(df_sl, iqrs_ordered[row, 1], iqrs_ordered[row, 2])
}
boxplot(do.call(cbind, distributions), main = title)
# multi-panel scatter plots for the top-n channels
cat(pairs.panels(log10(df_sl[top_n]), lm = TRUE, main = title))
}
# nmax is how many top pairs to consider (3 or 4 are recommended values)
top_n <- 4
# find the standards in experiment 1
find_best_pair(exp1_raw, top_n, "_Exp1", "Exp 1")
# experiment 2
find_best_pair(exp2_raw, top_n, "_Exp2", "Exp 2")
This experiment has an extra pooled channel in 3 of the 7 plexes. The pooled channels are generally in channels 10 and 11 in all seven plexes. In experiments 2, 4, and 6, there is an extra channel of the standard in channel 9 (130-C tag).
# experiment 3
find_best_pair(exp3_raw, top_n, "_Exp3", "Exp 3")
# Experiment 4
find_best_pair(exp4_raw, top_n, "_Exp4", "Exp 4")
# Experiment 5
find_best_pair(exp5_raw, top_n, "_Exp5", "Exp 5")
# Experiment 6
find_best_pair(exp6_raw, top_n, "_Exp6", "Exp 6")
# and Experiment 7
find_best_pair(exp7_raw, top_n, "_Exp7", "Exp 7")
Channels 10 and 11 (and also 9 in three experiments) were the correct internal reference channels and that was verified here. The sample key record keeping was correct, at least for the pooled standard channels. Similarities between standard channels were excellent and consistent between experiments. We can safely perform IRS on these data using the 131-N and 131-C channels as the refrences.
# log the session
sessionInfo()