This is re-analysis of the data from "Proteogenomics and Hi-C reveal transcriptional dysregulation in high hyperdiploid childhood acute lymphoblastic leukemia" published in Nature Communications April 3, 2019.
Yang, M., Vesterlund, M., Siavelis, I., Moura-Castro, L.H., Castor, A., Fioretos, T., Jafari, R., Lilljebjörn, H., Odom, D.T., Olsson, L. and Ravi, N., 2019. Proteogenomics and Hi-C reveal transcriptional dysregulation in high hyperdiploid childhood acute lymphoblastic leukemia. Nature communications, 10(1), p.1519.
The paper is comparing samples from pediatric B-cell precursor acute lymphoblastic leukemia (BCP ALL) that were labeled with 10-plex TMT and analyzed on a Q-Exactive instrument. There were two leukemia conditions: high hyperdiploid (18 samples) and diploid/near-diploid ETV6/RUNX1-positive cases (9 samples). Samples were allocated in a balanced design with 6 and 3 samples per 10-plex, respectively. A pooled standard sample was also created and added to each plex (131N channel).
Digestion was via an eFASP protocol, peptide cleanup with an sp3 method, and TMT labeling was according to manufacturer's recommendations.
Each plex was first separated into 72 fractions by isoelectric point and each fraction run in a 60-min RP gradient. The QE was run in a top-10 mode at 70K resolution. MS2 HCD scans were acquired at 35K resolution.
Peptides and proteins were identified using Comet and the PAW pipeline. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine was specified as a variable modification, trypsin enzyme specificity was used, and a canonical UniProt reference human protein database was used. Fragment ion tolerance was set for high resolution MS2: 0.02 Da, offset of 0.0. Confident peptide identifications were obtained using accurate mass conditional score histograms, the target/decoy method, and the protein inference used basic parsimony principles.
The PAW pipeline was used to infer proteins, perform homologous protein grouping, establish peptide uniqueness to the inferred proteins, and sum unique PSM reporter ions into protein intensity totals.
Results tables were filtered to remove common contaminants and the protein intensity tables were saved as tab-delimited text files for reading by the scripts below. Normalizations and statistical testing were performed using the Bioconductor package edgeR as detailed in the steps below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.
Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.
Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.
Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
www.jupyter.org
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
# get the default plot width and height
width <- options()$repr.plot.width
height <- options()$repr.plot.height
The pandas Python script that does the IRS normalization arranges the tab-delimited table so that importing into R is straightforward. The dplyr package from the tidyverse (https://r4ds.had.co.nz/) makes it easy to separate the data by biological condition, and to look at the data before and after the Internal Reference Scaling (IRS) normalization procedure. We can do some sanity checks with the pooled internal standard channels. We know that those should be highly similar within each TMT experiment and that IRS should make them also very similar between TMT experiments.
This study design only used a single reference channel. Any uncertainties in that single measurement in each plex propagate into the biological samples. The study design was balanced with 6 HeH and 3 ETV6-RUNX1 samples in each plex. We can also use the average intensities across the 10 channels in each plex as a mock reference channel. The approach that yields the better CV distributions within each condition will be the winner in this case.
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_AVE_IRS_normalized.txt", guess_max = 10326)
# the "Filter" column flags contams and decoys
# the "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the prepped table from pandas is sorted so these are the upper rows
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession
# see how many rows in the table
nrow(data_all)
# we want to get the SL normed columns, and subsetted by condition
sl_all <- data_all %>%
select(starts_with("SLNorm"))
sl_HeH <- sl_all %>% select(contains("_HeH_"))
sl_ETV6 <- sl_all %>% select(contains("_ETV6-RUNX1_"))
# and the IRS normed columns by condition
irs_all <- data_all %>%
select(starts_with("IRSNorm"))
irs_HeH <- irs_all %>% select(contains("_HeH_"))
irs_ETV6 <- irs_all %>% select(contains("_ETV6-RUNX1_"))
# and collect the pooled channels before and after IRS
sl_pool <- sl_all %>% select(contains("QC"))
irs_pool <- irs_all %>% select(contains("QC"))
# multi-panel scatter plot grids from the psych package
pairs.panels(log2(sl_pool), lm = TRUE, main = "Pooled Std before IRS")
pairs.panels(log2(irs_pool), lm = TRUE, main = "Pooled Std after IRS")
The random sampling of MS2 scans is why the before IRS scatter plots are horrible. We are using the average of all 10 channels (which includes the pooled channel) for the IRS method in this notebook. We can use the pooled standard channel to check the IRS correction. The IRS method dramatically reduces the scatter in the data. Remember that this is an identical sample added to each plex. We should expect to get the same protein intensities in all plexes.
Since we have a lot of samples, we will randomly look at a few.
# multi-panel scatter plot grids
heh_sample <- sample(1:18, 5)
pairs.panels(log2(sl_HeH[heh_sample]), lm = TRUE, main = "HeH before IRS (random 5)")
pairs.panels(log2(irs_HeH[heh_sample]), lm = TRUE, main = "HeH after IRS (same 5)")
# multi-panel scatter plot grids
etv6_sample <- sample(1:9, 5)
pairs.panels(log2(sl_ETV6[etv6_sample]), lm = TRUE, main = "ETV6-RUNX1 before IRS (random 5)")
pairs.panels(log2(irs_ETV6[etv6_sample]), lm = TRUE, main = "ETV6-RUNX1 after IRS (same 5)")
A multi-dimensional scaling plot (similar to hierarchical clustering or PCA plots) is a good way to check the data. We should expect the samples to group by condition. We can do the clustering before and after IRS to verify that we are able to recovery the true biological differences between groups. We need to load the data into some edgeR data structures and make a couple of function calls to generate the cluster plots.
# get the biological sample data into a DGEList object
group = c(rep('HeH', 18), rep('ETV6', 9))
y_sl <- DGEList(counts = cbind(sl_HeH, sl_ETV6), group = group, genes = accessions)
y_irs <- DGEList(counts = cbind(irs_HeH, irs_ETV6), group = group, genes = accessions)
# run TMM normalization (also includes a library size factor)
y_sl <- calcNormFactors(y_sl)
y_irs <- calcNormFactors(y_irs)
# set some colors by condition
colors = c(rep('red', 18), rep('blue', 9))
# check the clustering
plotMDS(y_sl, col = colors, main = "SL: all samples")
plotMDS(y_irs, col = colors, main = "IRS: all samples")
The top MDS plot shows the first TMT experiment on the left and the second experiment on the right. After IRS, (bottom plot) we have the biological-condition clustering.
We need to get the data into some edgeR objects to work with. We do not want to include the pooled standard channels with the biological sample channels for the edgeR testing. The pooled standards are technical replicates and have a very different variance behavior compared to the biological replicates. We can call the calcNormFactors function to perform library size and the trimmed mean of M-values (TMM) normalization. EdgeR combines these two normalizations into one function call. The library size corrections are somewhat redundant because that was done as part of the IRS procedure. The TMM normalization was designed for 'omics data and it makes sense to apply that after we have done the IRS correction. The TMM normalization corrects for compositional differences between biological conditions. If some abundant proteins change expression, since we have equal total amounts of protein per sample, that will make all other protein abundances change to compensate. TMM normalization will detect and correct these situations.
Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
# we do not want the technical replicates in the mix for dispersion estimates
irs <- cbind(irs_HeH, irs_ETV6)
# load a new DGEList object (need to update the groups)
y <- DGEList(counts = irs, group = group, genes = accessions) # group was set above
y <- calcNormFactors(y)
# see what the normalization factors look like
y$samples
Unless we have biological conditions with large compositional differences, we expect the library sizes and normalization factors after the calcNormFactors call to be around 1.0.
The visualizations in this notebook are informative, but we ultimately have to share the experimental results with the scientific community. It is important to be able to examine the underlying data that generates all of these plots. We can use the edgeR normalization factors to produce the TMM normalized data that is what the statistical testing will be working with. EdgeR uses the normalization factors in its statistical modeling but does not output the normalized intensities. We have to compute the normalized intensities by hand below. We will save the normalized data as the first part of a results data frame that we can write out at the end of the analysis.
# Compute the normalized intensities (start with the IRS data)
# sample loading adjusts each channel to the same average total
lib_facs <- mean(colSums(irs)) / colSums(irs)
# print("Sample loading normalization factors")
print("Library size factors")
round(lib_facs, 4)
# the TMM factors are library adjustment factors (so divide by them)
norm_facs <- lib_facs / y$samples$norm.factors
# print these final correction factors
print("Combined (lib size and TMM) normalization factors")
round(norm_facs, 4)
# compute the normalized data as a new data frame
irs_tmm <- sweep(irs, 2, norm_facs, FUN = "*")
colnames(irs_tmm) <- str_c(colnames(irs), "_TMMnorm") # add suffix to col names
# head(results) # check that the column headers are okay
Box plots for well normalized data should be similar in size and the medians should align with each other. There can be problems with the data that may not be apparent with boxplots (some data distortions can average out and not appear different with boxplots). Regardless, we should have boxplots with good median alignment. Good boxplot behavior is a necessary but not sufficient requirement of proper normalization.
We will use ggplot2 this time and flip the graph on its side, so we can read the labels easier. We need tidy (long form) data for ggplot:
long_results <- gather(irs_tmm, key = "sample", value = "intensity") %>%
mutate(log_int = log10(intensity)) %>%
extract(sample, into = 'group', ".*?_(.*?)_", remove = FALSE)
head(long_results)
ggplot(long_results, aes(x = sample, y = log_int, fill = group)) +
geom_boxplot(notch = TRUE) +
coord_flip() +
ggtitle("edgeR normalized data")
ggplot2 groups the distributions by the sample groups and does the colors automatically. The standard boxplot is easy to call. We have samples grouped and the color vector already defined.
# look at normalized intensity distributions for each sample
boxplot(log10(irs_tmm), col = colors,
xlab = 'TMT samples', ylab = 'log10 Intensity',
main = 'edgeR normalized data', notch = TRUE)
Boxplots and density plots are closely related. We can easily see the smoothed density distributions for each sample.
ggplot(long_results, aes(x = log_int, color = sample)) +
geom_density() +
guides(color = FALSE) +
ggtitle("edgeR normalized data (with legend is too busy)")
The distributions of Coefficients of Variation (CVs) are another way to get an idea of how individual proteins are behaving. This seems to be an effective way to assess proper normalization in these experiments. We will compute CV distributions for each of the two biological conditions.
# we can compare CVs before and after IRS
sl <- cbind(sl_HeH, sl_ETV6)
# save column indexes for different conditions (indexes to data_raw frame)
# these make things easier (and reduce the chance for errors)
HeH <- 1:18
ETV6 <- (1:9) + 18
# create a CV computing function
CV <- function(df) {
ave <- rowMeans(df)
sd <- apply(df, 1, sd)
cv <- 100 * sd / ave
}
# put CVs in data frames to simplify plots and summaries
cv_frame <- data.frame(HeH_sl = CV(sl[HeH]), HeH_final = CV(irs_tmm[HeH]),
ETV6_sl = CV(sl[ETV6]), ETV6_final = CV(irs_tmm[ETV6]))
# see what the median CV values are
medians <- apply(cv_frame, 2, FUN = median)
print("Median CVs by condition, before/after IRS (%)")
round(medians, 1)
# see what the CV distibutions look like
# need long form for ggplot
long_cv <- gather(cv_frame, key = "condition", value = "cv") %>%
extract(condition, into = 'group', "(.*?)_+", remove = FALSE)
# traditional boxplots
cv_plot <- ggplot(long_cv, aes(x = condition, y = cv, fill = group)) +
geom_boxplot(notch = TRUE) +
ggtitle("CV distributions")
# vertical orientation
cv_plot
# horizontal orientation
cv_plot + coord_flip()
# density plots
ggplot(long_cv, aes(x = cv, color = condition)) +
geom_density() +
coord_cartesian(xlim = c(0, 150)) +
ggtitle("CV distributions")
The CVs improve with IRS and TMM normalizations. The median CVs get smaller, but the distributions of the CV values are much improved with considerably smaller inter-quartile ranges. We get a little better final median CVs (about 1% lower) using the plex averages compared to the single pooled standard.
We have used scatter plots, correlation coefficients, clustering, and boxplots to verify that the IRS procedure and TMM normalization behaved as expected. None of the biological samples needed any excessive normalization factors or appeared as an outlier in the cluster views. Everything looks ready for statistical testing with edgeR.
One of the most powerful features of edgeR (and limma) is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 27 samples across all conditions to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.
We loaded the IRS data into DGEList object "y" a few cells above and did the normalization step. We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take two conditions and compare them using the normalization factors and dispersion estimates saved in "y".
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS normed, TMM normed, all 27")
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The decideTestsDGE call will tell us how many up and down regulated candidates we have at an FDR of 0.10. The topTags call does the Benjamini-Hochberg multiple testing corrections. We save the test results in tt
. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates, and then check the p-value distribution.
# the exact test object has columns like fold-change, CPM, and p-values
et <- exactTest(y, pair = c("HeH", "ETV6"))
# this counts up, down, and unchanged genes (proteins) at 10% FDR
summary(decideTestsDGE(et, p.value = 0.10))
# the topTags function adds the BH FDR values to an exactTest data frame
# make sure we do not change the row order (the sort.by parameter)!
topTags(et, n = 25)
tt <- topTags(et, n = Inf, sort.by = "none")
tt <- tt$table # tt is a list. We just need the "table" data frame
# make an MD plot (like MA plot)
plotMD(et, p.value = 0.10)
abline(h = c(-1, 1), col = "black")
# check the p-value distribution
ggplot(tt, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100,
plot = FALSE)$counts[26:100])) +
ggtitle("HeH vs ETV6 PValue distribution")
The numbers of candidates are large and balanced between up and down regulation, as can be seen in the MA plot. The p-value distribution has a nice flat distribution for larger p-values (from non-DE candidates) and a sharp spike at small p-values from true DE candidates.
We end up with about 100 more candidates in each direction here with the 10-channel averages compared to using just the single pooled standard in the IRS method.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results
data frame (which has the TMM-normalized data).
# get the averages within each condition
# results already has the normalized data in its left columns
tt$ave_HeH <- rowMeans(irs_tmm[HeH])
tt$ave_ETV6 <- rowMeans(irs_tmm[ETV6])
# add the cadidate status column
tt <- tt %>%
mutate(candidate = cut(FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no")))
tt %>% count(candidate) # count candidates
ggplot(tt, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
coord_cartesian(xlim = c(-4, 4)) +
ggtitle("HeH vs ETV6-RUNX1 logFC distributions by candidate")
# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags table from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# get the results
results <- collect_results(irs, tt, HeH, "HeH", ETV6, "ETV6")
We have many comparisons to visualize, so we will make a function to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color.
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# make the DE plots
MA_plots(results, "ave_HeH", "ave_ETV6", "HeH vs ETV6/RUNX1")
scatter_plots(results, "ave_HeH", "ave_ETV6", "HeH vs ETV6/RUNX1")
volcano_plot(results, "ave_HeH", "ave_ETV6", "HeH vs ETV6/RUNX1")
The fold-changes for candidates are not particularly large. Many of the statistically significant proteins have less than 2-fold changes. This could be due to using MS2 reporter ions instead of the newer SPS MS3 method. There are more candidates with larger fold changes that have over-expression in ETV6-RUNX1. The candidates are well spread out over the 5-decades of protein intensities.
Note: Protein total reporter ion intensity is well correlated with protein total MS1 feature intensites. In other words, it is a good relative protein abundance measure.
The HeH data points are in red and the ETV6-RUNX1 samples are in blue. The protein identifier, the average intensity, the individual test p-value (not the FDR), and the traditional fold-change are used as plot labels.
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, "\\|", simplify = TRUE)
identifier[,3]
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them into one data frame
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", p-val: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title)
}
}
# set plot size, make plots, reset plot size
set_plot_dimensions(6, 4)
plot_top_tags(results, length(HeH), length(ETV6), 25)
set_plot_dimensions(width, height)
results
frame to TSV file¶write.table(results, "IRS_R_averages_results.txt", sep = "\t",
row.names = FALSE, na = " ")
sessionInfo()