# Exploring data normalization and analysis in large TMT experimental designs¶

## Overview:¶

Many factors make data analyses challenging in experiments where more than one set of isobaric tag reagents are used. Even with up to 11 reagents available, many experiments require more replicates/conditions than can fit in one labeling. Three to five conditions (different groups, treatments, times, etc.) each with 3 to 5 replicates results in numbers of samples from 9 to 25. It can be difficult to squeeze a proper study design into just 11 samples.

Any series of complicated measurements is never going to be perfect. There will be many sources of variation in addition to the biological differences between groups. Are the measured differences between groups due to the biology or the other sources of variation? There are many methods for removing unwanted sources of variation such as inaccuracies in protein assays and pipetting errors. These are collectively known as normalization methods.

The underlying concept behind normalization is that some samples (such as technical replicates, or biological replicates within one group) should be similar to each other based on some criteria. If these samples are not as similar as they should be, that extra variance may obscure the biological differences we are trying to see. If we can devise schemes and algorithms (normalizations) to make these similar samples more similar without removing or affecting biological differences, we can improve the power to measure what we want to measure.

A misconception in data normalization is that complicated, multi-step procedures will require just one normalization step. Let's explore a typical TMT experiment. We start with some biological experiment where each biological condition is well defined and that some number of replicates in each condition have been defined. We assume that any treatments and/or time courses were all done correctly. We assume that the protein samples were collected and processed carefully. Just getting to the "start" of a TMT labeling requires many steps that took effort to optimize.

Having the set of protein samples is the starting point for the labeling protocol. We have to know how much protein is in each sample to aliquot the same amount of protein from each sample (this is effectively the first normalization step). That requires protein assays and pipetting. The proteins must be digested into peptides before labeling. Sometimes a peptide assay is done after digestion to make sure that the digestion was okay (this might result in another normalization step). The digested peptides are labeled with the TMT reagents. Sometimes the amounts of labeled peptides in each channel are checked with a short mass spec run to make sure the labeling reaction worked okay, and aliquot adjustments made so that the final mixture is more similar between channels (another normalization step). Next is the full-scale LC/MS experiment (often some sort of multi-dimensional separation) where deep sampling of the proteome is a typical goal. The total signal per channel can again be checked for consistency. Despite upstream corrections, the measured signal totals are seldom as identical as expected and will typically require additional adjustments during data analysis (more normalization steps). We will see below how the random sampling selection of peptides for MS2 analysis makes multiple TMT experiments difficult to combine without using a special IRS correction procedure (a specialized normalization step).

If we successfully get to the point where we have a table of protein expression values for a set of biological samples, we may still have compositional bias that needs to be corrected. Since we process similar total amounts of material (protein or mRNA), changes in expression of a small number of highly abundant proteins can make all of the rest of the protein appear to be down regulated. Depending on your point of view, the data can be a few abundant proteins increasing in expression and other proteins being unchanged, or a large number of proteins decreasing in expression and a small number of abundant proteins being mostly unchanged. Normalization methods, such as the trimmed mean of M values (TMM), developed for RNA-Seq data, can remove compositional bias:

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.

To summarize, there are many experimental and analysis steps in these complicated measurements, and many have implicit or explicit associated normalizations. There are many software normalization algorithms that have been used in proteomics and an even larger number that have been developed for genomics. One strategy is to make the "size" of every sample the same since an equal amount of mRNA or protein is usually processed for each sample. Another strategy is to take something specific and make that "the same" between each sample. These "landmarks" can be the levels of housekeeping genes, the average or median signal, or the average or median expression ratio. These normalizations typically use a single multiplicative factor to adjust the samples to each other.

There are many other more complicated methods such as cyclic loess or quantile normalization that have more degrees of freedom and apply more complicated adjustment factors. In genomics, platform variability is well known, and this can result in rather interesting distortions between datasets that are referred to as "batch" effects. In many cases, these distortions can be recognized and corrected with more elaborate batch correction methods. Data can be corrected for these batch effects by normalization steps before statistical modeling is done. Alternatively, some statistical packages can incorporate batch factors into their models and may perform better with uncorrected data.

We are not doing a review of all possible normalization methods, though. We just want to learn how to analyze experiments using isobaric labeling when the study design needs more than 10 or 11 samples, and multiple TMT experiments have to be combined. We will explore how to analyze these experiments using the internal reference scaling (IRS) methodology that was first described in this publication:

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Plubell, et al. introduced a novel normalization methodology (IRS) capable of correcting the random MS2 sampling that occurs between TMT experiments. Analytes are sampled during their elution profile and the measured reporter ion signal intensity depends on the level of the analyte at the time of the sampling. This MS2 selection process can be modeled as a random sampling process:

Liu, H., Sadygov, R.G. and Yates, J.R., 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14), pp.4193-4201.

The random MS2 sampling affects the overall intensity of reporter ions, but not their relative intensities. The relative expression pattern of reporter ions is robust with respect to MS2 sampling. This random MS2 sampling creates a source of variation that is unique to isobaric tagging experiments. As we will see below, failure to correct for this source of variation makes combining data from multiple TMT experiments practically impossible.

## Example dataset:¶

When testing new methodologies, it is helpful to work on a dataset where something is already known, and some safe assumptions can be made. This paper is a developmental time course study of mice lens and has publicaly available data:

Khan, S.Y., Ali, M., Kabir, F., Renuse, S., Na, C.H., Talbot, C.C., Hackett, S.F. and Riazuddin, S.A., 2018. Proteome Profiling of Developing Murine Lens Through Mass Spectrometry. Investigative Ophthalmology & Visual Science, 59(1), pp.100-107.

There were 2 embryonic time points and 4 post-natal time points (E15, E18, P0, P3, P6, and P9) each three days apart that were TMT labeled (6 channels in each experiment). There were many lenses pooled at each time point and the experiments were repeated three times as three independent TMT experiments. The instrument was a Thermo Fusion Lumos and the SPS MS3 method was used to get reporter ion signals with reduced interference.

There were no common channels (standards or repeated samples) present in the three TMT experiments. In the publication by Plubell et al. where the IRS method was first described, duplicate channels of the same pooled reference mixtures were run in each TMT experiment and were used to match the protein reporter ion intensities between TMT experiments. That is the preferred experimental design. We will try and "mock up" a reference channel for each TMT experiment and perform a mock IRS method.

These are inbred laboratory mice and there are several lenses pooled at each time point in each TMT experiment. We can expect the three replicates at each time point to be pretty similar to each other (assuming good sample collection and processing). The same set of 6 time points were in each TMT experiment. If we average all 6 time points in each TMT experiment, this should be a good "mock" reference channel to use in the IRS procedure.

We know a lot about mouse lenses from years of study. The central part of the lens is composed of fiber cells that contain a dramatic over-expression of a small number of lens-specific proteins called crystallins. Fiber cells are not metabolically active, no longer have organelles, and there is no cellular turnover in the lens. There is a monolayer of metabolically active epithelial cells on the anterior face of the spherical shaped lens. There is an outer region of the lens, called the cortex, where the cells undergo differentiation from epithelial cells to mature fiber cells. One simple model of lens protein expression with time is that the complement of cellular proteins from the epithelial cells (and some cortical cells) will be progressively diluted out by the accumulated crystallins. In mature lenses, the crystallins make up roughly 90% of the lens.

Supplemental File 1 in the above study contains the total protein reporter ion intensities for each protein (the rows) in each experiment (6 channels per experiment for 18 columns). The data were processed with Proteome Discoverer 2.1 (PD) and exported as text files. A mouse Swiss-Prot database was used to keep redundant peptides to a minimum. Maximum parsimony was used to group proteins. Peak height intensities (not the PD default signal/noise ratios) were used for the reporter ions. Note that a signal-to-noise ratio is a compressed unitless number and is not a valid quantitative measurement. PD sums all usable PSM (peptide spectrum match) reporter ion signals into total protein signals for each TMT experiment (summing over any first-dimension fractions).

## Load libraries and the dataset:¶

R script development was initially done using RStudio. This document was created as a Jupyter notebook using an R kernel. R libraries will need to be installed before loading the dataset. We will only work with proteins where reporter ion values were observed in all 3 experiments. We will do some column name cleanup and get some of the annotation information (gene symbols and protein accessions) set aside to add back later.

In [139]:
# Analysis of IOVS mouse lens data (Supplemental Table S01):
# Khan, Shahid Y., et al. "Proteome Profiling of Developing Murine Lens Through Mass Spectrometry."
# Investigative Ophthalmology & Visual Science 59.1 (2018): 100-107.

# load libraries (this gets ggplot2 and related libraries)
library(tidyverse)

# these are from Bioconductor
library(limma)
library(edgeR)
library(sva)

# we use this for the pairwise correlation plots
library(psych)

# read the Supplemental 01 file (saved as a CSV export from XLSX file)

# filter out proteins not seen in all three runs
data_no_na <- na.omit(data_start)

# save the annotations (gene symbol and protein accession) and remove from data frame
annotate_df <- data_no_na[1:2]
data_raw <- as.data.frame(data_no_na[3:20])
row.names(data_raw) <- annotate_df\$Protein Accession No.

Parsed with column specification:
cols(
.default = col_double(),
Gene Symbol (NCBI) = col_character(),
Protein Accession No. = col_character()
)
See spec(...) for full column specifications.


### Let's see what we have in the data frame:¶

In [140]:
head(data_raw)

E15_Set1E18_Set1P0_Set1P3_Set1P6_Set1P9_Set1E15_Set2E18_Set2P0_Set2P3_Set2P6_Set2P9_Set2E15_Set3E18_Set3P0_Set3P3_Set3P6_Set3P9_Set3
P246221069664652 1457114681 1179045944 1455372629 1289042085 1407632648 794413496.8948014907.01937604914 1869740434 2319811405 1979600014 504923157.8799957030 694554606 815374345 1120186978 952547300
Q9JJU9 526906032 784281820 534395113 600416635 569878248 647355498 194176625.9197270382.6 460870620 331488498 478119986 367960325 296834237.4288048268 245193675 267307062 431955394 393248818
Q9WVJ5 586695635 792968077 553232976 626183065 528280042 578745749 480706341.9350268581.81018230717 600709392 851756335 674769074 411187998.3472248038 337236872 342318702 616158187 521683037
P04345 315209709 488623202 362367811 433275153 396120791 445134422 181105752.5202759948.5 416298217 460506970 556686630 450709511 140301513.8235519864 218656765 244302151 333440543 284491654
P62696 30064315 43240455 25447349 95541340 199771510 368264651 970485.4 965249.9 13289066 60115071 184639333 292264145 413635.3 2577753 5953925 29244166 84402818 134267384
Q61597 121970802 209140086 189655080 201060109 237678671 309076418 29836712.7 27384410.0 78502992 78695122 139666562 131618214 45393640.0 94515836 92566739 117465776 175146797 170114753

## What do the raw (un-normalized) data distributions look like?¶

We will use box plots, a common way to summarize distributions, and distribution density plots to visualize the Log base 2 of the intensities. If the data distributions do not have well aligned box plots, the need for normalization would be indicated. The 6 samples from each TMT experiment are adjacent and in the same color, and each of the three TMT experiments are in different colors (red, green, and blue).

In [141]:
# let's see what the starting data look like
boxplot(log2(data_raw), col = rep(c('red', 'green', 'blue'), each = 6),
notch = TRUE, main = 'RAW data: Exp1 (red), Exp2 (green), Exp3 (blue)',
xlab = 'TMT Samples', ylab = 'log2 of Intensity')

# can also look at density plots (like a distribution histogram)
plotDensities(log2(data_raw), col = rep(c('red', 'green', 'blue'), 6),
main = 'Raw data')

In [142]:
# check the column totals (per channel sums)
format(round(colSums(data_raw), digits = 0), big.mark = ",")

E15_Set1
'15,454,863,217'
E18_Set1
'14,852,067,937'
P0_Set1
' 9,946,167,870'
P3_Set1
'10,047,936,585'
P6_Set1
' 8,553,796,614'
P9_Set1
' 9,083,779,828'
E15_Set2
'11,952,290,485'
E18_Set2
' 9,749,057,664'
P0_Set2
'14,558,974,664'
P3_Set2
' 9,631,496,297'
P6_Set2
'12,401,976,363'
P9_Set2
'10,883,266,433'
E15_Set3
' 8,997,954,961'
E18_Set3
' 6,613,312,876'
P0_Set3
' 5,462,406,449'
P3_Set3
' 5,075,004,896'
P6_Set3
' 7,313,410,134'
P9_Set3
' 6,298,792,787'

## Do we need to do some data normalization?¶

The above box plots and density distributions are not well aligned. Let's check the total signal in each channel for the 18 samples (below). There was supposed to be the same amount of protein labeled in each sample. We should have the total signals in each channel summing to the same value. We have significant differences that we can correct with some basic normalizing. We can average the numbers below and compute normalization factors to make the sums end up the same.

In [143]:
# separate the TMT data by experiment
# we do not need to do this for the normalization factor calculation here,
# but we will need these data frames for the IRS step below.
exp1_raw <- data_raw[c(1:6)]
exp2_raw <- data_raw[c(7:12)]
exp3_raw <- data_raw[c(13:18)]

# first basic normalization is to adjust each TMT experiment to equal signal per channel
# figure out the global scaling value
target <- mean(c(colSums(exp1_raw), colSums(exp2_raw), colSums(exp3_raw)))

# there is a different correction factor for each column
norm_facs <- target / colSums(exp1_raw)
exp1_sl <- sweep(exp1_raw, 2, norm_facs, FUN = "*")
norm_facs <- target / colSums(exp2_raw)
exp2_sl <- sweep(exp2_raw, 2, norm_facs, FUN = "*")
norm_facs <- target / colSums(exp3_raw)
exp3_sl <- sweep(exp3_raw, 2, norm_facs, FUN = "*")