RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Single-cell RNA-seq

Exercise: A Complete Seurat Workflow

Drawn in part from the Seurat vignettes at https://satijalab.org/seurat/vignettes.html

Step 1: Preparation

Note that output files generated here can be downloaded to your local machine using the scp command. For example, for userN to download the precomputed DEG file to the current working directory on your laptop:

scp userN@login1.CBW.calculquebec.cloud:/project/50042/share/RNA_data/scRNA_data/DEGs.Wilcox.PCA.10.precomputed.xls ./

Now, working at the linux command line in your home/workspace directory on the cluster, create a new directory for your output files called scrna. The full path to this directory will be /home/userN/workspace/scrna/ where N is your assigned userID. The command is:

mkdir scrna

Now start an interactive job that will run for 8 hours, or as long as needed:

salloc --mem 32000M -c 4 -t 8:0:0

Load R4.0 and Python:

module load r/4.0.0 python/3.7.4

Open R (by typing R at the command line) and load some R libraries as follows:

library("Seurat");
library("sctransform");
library("dplyr");
library("RColorBrewer");
library("ggthemes");
library("ggplot2");
library("cowplot");

Create a vector of convenient sample names, such as “A”, “B”, and “C”:

samples = c("A","B","C");

Create a variable outdir for your output directory. Be sure to use your user ID in place of N:

outdir = "/home/userN/workspace/scrna";

Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline.

data.10x = list(); # first declare an empty list in which to hold the feature-barcode matrices
data.10x[[1]] <- Read10X(data.dir = "~/CourseData/scRNA_data/ND050119_CD34_3pV3/filtered_feature_bc_matrix"); 
data.10x[[2]] <- Read10X(data.dir = "~/CourseData/scRNA_data/ND050119_WBM_3pV3/filtered_feature_bc_matrix");
data.10x[[3]] <- Read10X(data.dir = "~/CourseData/scRNA_data/ND050819_WBM_3pV3/filtered_feature_bc_matrix");

Step 3: Convert each feature-barcode matrix to a Seurat object.

This simultaneously performs some initial filtering in order to exclude genes that are expressed in fewer than 100 cells, and to exclude cells that contain fewer than 700 expressed genes. Note that min.cells=10 and min.features=100 are more common parameters at this stage, but we are filtering more aggressively in order to make the data set smaller. At this step, we also assign a name to the “DataSet” identity for each cell.

scrna.list = list(); # First create an empty list to hold the Seurat objects
scrna.list[[1]] = CreateSeuratObject(counts = data.10x[[1]], min.cells=100, min.features=700, project=samples[1]);
scrna.list[[1]][["DataSet"]] = samples[1];
scrna.list[[2]] = CreateSeuratObject(counts = data.10x[[2]], min.cells=100, min.features=700, project=samples[2]);
scrna.list[[2]][["DataSet"]] = samples[2];
scrna.list[[3]] = CreateSeuratObject(counts = data.10x[[3]], min.cells=100, min.features=700, project=samples[3]);
scrna.list[[3]][["DataSet"]] = samples[3];

Aside: Note that you can do this more efficiently, especially if you have many samples, using a ‘for’ loop:

for (i in 1:length(data.10x)) {
    scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=100, min.features=700, project=samples[i]);
    scrna.list[[i]][["DataSet"]] = samples[i];
}

Finally, remove the raw data to save space:

rm(data.10x);

Step 4. Merge the Seurat objects into a single object.

We will call this object scrna. We also give it a project name (here, “CBW”), and prepend the appropriate data set name to each cell barcode. For example, if a barcode from data set “B” is originally AATCTATCTCTC, it will now be B_AATCTATCTCTC. Then clean up some space by removing scrna.list. Finally, save the merged object as an RDS file. Should you need to load this file into R at any time, it can be done using the readRDS command.

scrna <- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]]), add.cell.ids = c("A","B","C"), project="CBW");
rm(scrna.list); # save some space
str(scrna@meta.data) # examine the structure of the Seurat object meta data
saveRDS(scrna, file = sprintf("%s/MergedSeuratObject.rds", outdir));

Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data

The meta data contains the following:

It can be accessed and queried using commands such as:

scrna[[]];
scrna@meta.data;
str(scrna@meta.data); # Examine structure and contents of meta data
head(scrna@meta.data$nFeature_RNA); # Access genes (“Features”) for each cell
head(scrna@meta.data$nCount_RNA); # Access number of UMIs for each cell:
levels(x=scrna); # List the items in the current default cell identity class 
length(unique(scrna@meta.data$seurat_clusters)); # How many clusters are there?
unique(scrna@meta.data$Batch); # What batches are included in this data set?
scrna$NewIdentity <- vector_of_annotations; # Assign new cell annotations to a new "identity class" in the meta data

Step 5. Quality control plots

Plot the distributions of several quality-control variables in order to choose appropriate filtering thresholds. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. However, you will need to manually calculate the mitochondrial transcript percentage and ribosomal transcript percentage for each cell, and add them to the Seurat object meta data, as shown below.

Calculate the mitochondrial transcript percentage for each cell:

mito.genes <- grep(pattern = "^MT-", x = rownames(x = scrna), value = TRUE);
percent.mito <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[mito.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.mito']] <- percent.mito;

Calculate the ribosomal transcript percentage for each cell:

ribo.genes <- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = scrna), value = TRUE);
percent.ribo <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[ribo.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.ribo']] <- percent.ribo;

Plot as follows. Plots can be viewed by downloading using the scp command, or in the Jupyter notebook (https://jupyter.cbw.calculquebec.cloud)

jpeg(sprintf("%s/VlnPlot.jpg", outdir), width = 13, height = 6, units="in", res=300);
vln <- VlnPlot(object = scrna, features = c("percent.mito", "percent.ribo"), pt.size=0, ncol = 2, group.by="DataSet");
print(vln);
dev.off();

jpeg(sprintf("%s/VlnPlot.nCount.25Kmax.jpg", outdir), width = 10, height = 10, units="in", res=300)
vln <- VlnPlot(object = scrna, features = "nCount_RNA", pt.size=0, group.by="DataSet", y.max=25000)
print(vln)
dev.off();

jpeg(sprintf("%s/VlnPlot.nFeature.jpg", outdir), width = 10, height = 10, units="in", res=300)
vln <- VlnPlot(object = scrna, features = "nFeature_RNA", pt.size=0, group.by="DataSet")
print(vln)
dev.off()

Here, we will use Seurat’s FeatureScatter function to create scatterplots of the relationships among QC variables. However, this is a very useful wrapper function that can be used to visualize any quantitative variable in the Seurat object (including expression levels, etc).

jpeg(sprintf("%s/Scatter1.jpg", outdir), width = 8, height = 6, units="in", res=300);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.mito", pt.size=0.1)
print(scatter);
dev.off();

jpeg(sprintf("%s/Scatter2.jpg", outdir), width = 8, height = 6, units="in", res=300);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.ribo", pt.size=0.1)
print(scatter);
dev.off();

jpeg(sprintf("%s/Scatter3.jpg", outdir), width = 8, height = 6, units="in", res=300);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size=0.1)
print(scatter);
dev.off();

Step 6. Calculate a cell cycle score for each cell

This can be used to determine whether heterogeneity in cell cycle phase is driving the tSNE/UMAP layout and/or clustering. This may or may not be obscuring the signal you care about, depending on your analysis goals and the nature of the data. (If necessary, it can be removed in a later step.) The list of cell cycle genes, and the scoring method, was taken from Tirosh I, et al. (2016).

cell.cycle.tirosh <- read.table("/project/50042/share/RNA_data/scRNA_data/CellCycleTirosh.txt", sep='\t', header=FALSE); # read in the list of genes
s.genes = cell.cycle.tirosh$V2[which(cell.cycle.tirosh$V1 == "G1/S")]; # create a vector of S-phase genes
g2m.genes = cell.cycle.tirosh$V2[which(cell.cycle.tirosh$V1 == "G2/M")]; # create a vector of G2/M-phase genes
scrna <- CellCycleScoring(object=scrna, s.features=s.genes, g2m.features=g2m.genes, set.ident=FALSE) 

Step 7. Filter the cells to remove debris, dead cells, and probable doublets

QUESTION: How many cells are there in each sample before filtering? The ‘table’ function may come in handy.

First calculate some basic statistics on the various QC parameters, which can be helpful for choosing cutoffs. For example:

min <- min(scrna@meta.data$nFeature_RNA);
m <- median(scrna@meta.data$nFeature_RNA)
max <- max(scrna@meta.data$nFeature_RNA)    
s <- sd(scrna@meta.data$nFeature_RNA)
min1 <- min(scrna@meta.data$nCount_RNA)
max1 <- max(scrna@meta.data$nCount_RNA)
m1 <- mean(scrna@meta.data$nCount_RNA)
s1 <- sd(scrna@meta.data$nCount_RNA)
Count93 <- quantile(scrna@meta.data$nCount_RNA, 0.93) # calculate value in the 93rd percentile
print(paste("Feature stats:",min,m,max,s));
print(paste("UMI stats:",min1,m1,max1,s1,Count93)); 

Now, filter the data using the subset function and your chosen thresholds. Note that for large data sets with diverse samples, it may be beneficial to use sample-specific thresholds for some parameters. For the sake of this course, use the following:

scrna <- subset(x = scrna, subset = nFeature_RNA > 700  & nCount_RNA < Count93 & percent.mito < 0.1)

QUESTION: How many cells are there in each sample after filtering?

Step 8. [Optional] Subset the data.

If necessary, you can subset the data set to N cells (2000, 5000, etc) to make it more manageable:

subcells <- sample(Cells(scrna), size=N, replace=F)
scrna <- subset(scrna, cells=subcells)

Step 9. Normalize the data, detect variable genes, and scale the data.

Normalize the data:

scrna <- NormalizeData(object = scrna, normalization.method = "LogNormalize", scale.factor = 1e6);

Identify and plot the most variable genes, which will be used for downstream analyses. This has been empirically shown to reduce the contribution of noise, and to improve identification of rare cell types.

scrna <- FindVariableFeatures(object = scrna, selection.method = 'vst', mean.cutoff = c(0.1,8), dispersion.cutoff = c(1, Inf))
print(paste("Number of Variable Features: ",length(x = VariableFeatures(object = scrna))));

pdf(sprintf("%s/VG.file", outdir), useDingbats=FALSE)
vg <- VariableFeaturePlot(scrna)
print(vg);
dev.off()

Scale and center the data:

scrna <- ScaleData(object = scrna, features = rownames(x = scrna), display.progress=FALSE);

Alternatively, you can scale the data and simultaneously remove unwanted signal associated with variables such as cell cycle phase, ribosomal transcript content, etc. (This is slow, and cannot be done in the time allotted for this course.) To remove cell cycle signal, for instance:

scrna <- ScaleData(object = scrna, features = rownames(x = scrna), vars.to.regress = c("S.Score","G2M.Score"), display.progress=FALSE);

Save the normalized, scaled Seurat object:

saveRDS(scrna, file = sprintf("%s/VST.rds", outdir));

DIGRESSION: How can you use Seurat-processed data with packages that are not compatible with Seurat? Other packages may require the data to be normalized in a specific way, and often require an expression matrix (not a Seurat object) as input. As an example, here we prepare an expression data matrix for use with the popular CNV-detection package CONICSmat:

scrna.cnv <- NormalizeData(object = scrna, normalization.method = "RC", scale.factor = 1e5);
data.cnv <- GetAssayData(object=scrna.cnv, slot="data"); # get the normalized data
log2data = log2(data.cnv+1); # add 1 then take log2
df <- as.data.frame(as.matrix(log2data)); # convert it to a data frame
cells <- as.data.frame(colnames(df));
genes <- as.data.frame(rownames(df));
# save as text files:
fwrite(x = genes, file = "genes.csv", col.names=FALSE);
fwrite(x = cells, file = "cells.csv", col.names=FALSE);
fwrite(x = df, file = "exp.csv", col.names=FALSE);

Step 10. Reduce the dimensionality of the data using Principal Component Analysis.

Subsequent calculations, such as those used to derive the tSNE and UMAP projections, and the k-Nearest Neighbor graph used for clustering, are performed in a new space with fewer dimensions, namely, the principal components. Here, specify a relatively large number of principal components – more than you anticipate using for downstream analyses. Then use several techniques to characterize the components and estimate the number of principal components that captures the signal of interest while minimizing noise.

Perform Principal Component Analysis (PCA), and save the first 30 components:

scrna <- RunPCA(object = scrna, npcs = 30, verbose = FALSE);

OPTIONAL: Then run ProjectDim, which scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. This is not used elsewhere in this pipeline, but it can be useful for exploring genes that are not among the 2000 most highly variable genes selected above.

scrna <- ProjectDim(object = scrna)

QUESTION: What do the principal components “mean” from a biological standpoint? What genes contribute to the principal components? Do they represent biological processes of interest, or technical variables (such as mitochondrial transcripts) that suggest the data may need to be filtered or scaled differently? (Note: this tool offers a quick and easy way to determine functional enrichment from a list of genes: https://toppgene.cchmc.org/enrichment.jsp)

There are two easy ways to investigate these questions. First, visualize the PCA “loadings.” Each “component” identified by PCA is a linear combination, or weighted sum, of the genes in the data set. Here, the “loadings” represent the weights of the genes in any given component. These plots tell you which genes contribute most to each component:

jpeg(sprintf("%s/VizDimLoadings.jpg", outdir), width = 8, height = 30, units="in", res=300);
vdl <- VizDimLoadings(object = scrna, dims = 1:3)
print(vdl);
dev.off();

Second, use the DimHeatmap function to generate heatmaps that summarize the expression of the most highly weighted genes in each principal component. As noted in the Seurat documentation, “both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.

jpeg(sprintf("%s/PCA.heatmap.multi.jpg", outdir), width = 8.5, height = 24, units="in", res=300);
hm.multi <- DimHeatmap(object = scrna, dims = 1:12, cells = 500, balanced = TRUE);
print(hm.multi);
dev.off();

Now, decide how many components to use in downstream analyses. This number usually varies from 5-50, depending on the number of cells and the complexity of the data set. Although there is no “correct” answer, using too few components risks missing meaningful signal, and using too many risks diluting meaningful signal with noise.

There are several ways to make an informed decision. The first is to use the principal component heatmaps generated above. Components that generate noisy heatmaps likely correspond to noise. The second method is to examine a plot of the standard deviations of the principle components, and to choose a cutoff to the left of the bend in this so-called “elbow plot.”

Generate an elbow plot of principal component standard deviations:

elbow <- ElbowPlot(object = scrna)
jpeg(sprintf("%s/PCA.elbow.jpg", outdir), width = 6, height = 8, units="in", res=300);
print(elbow);
dev.off();

Next, use a bootstrapping technique called Jackstraw analysis to estimate a p-value for each component, print out a plot, and save the p-values to a file:

scrna <- JackStraw(object = scrna, num.replicate = 100, dims=30); # takes around 4 minutes
scrna <- ScoreJackStraw(object = scrna, dims = 1:30)
jpeg(sprintf("%s/PCA.jackstraw.jpg", outdir), width = 10, height = 6, units="in", res=300);
js <- JackStrawPlot(object = scrna, dims = 1:30)
print(js);
dev.off();
pc.pval <- scrna@reductions$pca@jackstraw@overall.p.values; # get p-value for each PC
write.table(pc.pval, file=sprintf("%s/PCA.jackstraw.scores.xls", outdir, date), quote=FALSE, sep='\t', col.names=TRUE);

Use the number of principal components (nPC) you selected above.

nPC = 10;
scrna <- RunUMAP(object = scrna, reduction = "pca", dims = 1:nPC);
scrna <- RunTSNE(object = scrna, reduction = "pca", dims = 1:nPC);

Now, plot the tSNE and UMAP plots next to each other in one figure, and color each data set separately:

jpeg(sprintf("%s/UMAP.%d.jpg", outdir, nPC), width = 10, height = 8, units="in", res=300);
p1 <- DimPlot(object = scrna, reduction = "tsne", group.by = "DataSet", pt.size=0.1)
p2 <- DimPlot(object = scrna, reduction = "umap", group.by = "DataSet", pt.size=0.1)
print(plot_grid(p1, p2));
dev.off();

QUESTION: How do the data sets compare to each other? (We will further investigate these differences in subsequent steps.)

QUESTION: What are the chief sources of variation in this data, as suggested by the t-SNE and UMAP layouts? Are there confounding technical variables that may be driving the layouts? What are some likely technical variables?

Color the t-SNE and UMAP plots by some potential confounding variables. Here’s an example in which we color each cell according to the number of UMIs it contains:

feature.pal = rev(colorRampPalette(brewer.pal(11,"Spectral"))(50)); # a useful color palette
jpeg(sprintf("%s/umap.%d.colorby.UMI.jpg", outdir, nPC), width = 10, height = 8,  units="in", res=300);
fp <- FeaturePlot(object = scrna, features = c("nCount_RNA"), cols = feature.pal, pt.size=0.1, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); # the text after the ‘+’ simply removes the axis using ggplot syntax
print(fp);
dev.off();

QUESTION: What is the relationship between the principal components and the t-SNE/UMAP layout?

To investigate this, plot several principal components on the t-SNE/UMAP, for example the following code plots the first principal component and prints the plot to a file:

jpeg(sprintf("%s/UMAP.%d.colorby.PCs.jpg", outdir, nPC), width = 12, height = 6, units="in", res=100);
redblue=c("blue","gray","red"); # another useful color scheme
fp1 <- FeaturePlot(object = scrna, features = 'PC_1', cols=redblue, pt.size=0.1, reduction = "umap")+ theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp1);
dev.off();

Step 12: Cluster the data and interpret the clusters.

The first step is to generate the k-Nearest Neighbor (KNN) graph using the number of principal components chosen above (nPC). The second step is to partition the graph into “cliques” or clusters using the Louvain modularity optimization algorithm. At this step, the cluster resolution (cluster.res) may be specified. (Larger numbers generate more clusters.) While there is no “correct” number of clusters, I recommend erring on the side of too many clusters. For this exercise, please use the following:

nPC = 10;
cluster.res = 0.7;
scrna <- FindNeighbors(object=scrna, dims=1:nPC);
scrna <- FindClusters(object=scrna, resolution=cluster.res); # takes time

The output of FindClusters is saved in scrna@meta.data$seurat_clusters. Note that this is reset each time clustering is performed on the data set. To ensure that each clustering result is saved, save the result as a new identity class, and give it a custom name that reflects the clustering resolution and number of principal components:

scrna[[sprintf("ClusterNames_%.1f_%dPC", cluster.res, nPC)]] <- Idents(object = scrna);

Inspect the structure of the meta data, then save the Seurat object, which now contains t-SNE and UMAP coordinates, and clustering results:

str(scrna@meta.data);
saveRDS(scrna, file = sprintf("%s/VST.PCA.UMAP.TSNE.CLUST.rds", outdir));

QUESTION: How many graph-based clusters are there? (This number is ‘n.graph’ and is used below.) How do they relate to the 2-D layouts?

First plot graph-based clusters on 2-D layouts:

n.graph = length(unique(scrna[[sprintf("ClusterNames_%.1f_%dPC",cluster.res, nPC)]][,1])); # automatically get the number of clusters from a specific clustering run

Or more simply, use the default clustering result:

n.graph = length(unique(scrna@meta.data$seurat_clusters));

rainbow.colors = rainbow(n.graph, s=0.6, v=0.9); # color palette
jpeg(sprintf("%s/UMAP.clusters.jpg", outdir), width = 10, height = 8, units="in", res=300);
p <- DimPlot(object = scrna, reduction = "umap", group.by = "seurat_clusters", cols = rainbow.colors, pt.size=0.1, label=TRUE) + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(p);
dev.off();

QUESTION: Are there sample-specific clusters? How many cells are in each cluster and each sample? cluster.breakdown <- table(scrna@meta.data$DataSet, scrna@meta.data$seurat_clusters);

QUESTION: What do the clusters represent? How do they differ from each other? Start with a differential expression analysis:

Step 13: Differential Gene Expression (DEG) analysis.

Perform DEG analysis on all clusters simultaneously using the default differential expression test (Wilcoxon), then save the results to a file. This takes too long for the time allotted to this course, so we will read in a pre-computed file of DEGs.

To compute the DEGs and save to a file (which we will NOT do):

DEGs <- FindAllMarkers(object=scrna); # output is a matrix. This is slow.
write.table(DEGs, file=sprintf("%s/DEGs.Wilcox.xls", outdir), quote=FALSE, sep="\t", row.names=FALSE)

Read in the file of DEGs:

DEGs <- read.table("/project/50042/share/RNA_data/scRNA_data/DEGs.Wilcox.PCA.10.precomputed.xls", header=TRUE)
head(DEGs) # inspect the structure of the DEG matrix

QUESTION: Do you believe the results? Plot some DEGs using FeaturePlot. What is more reliable or informative: the statistical significance or the log fold-change? Are your thresholds introducing false negatives?

QUESTION: How can we compare specific clusters? (Hint: First set the default identity class using Idents(scrna) <- “seurat_clusters”, then use the FindMarkers function.)

Step 14. Infer cell types.

There are many sophisticated methods for doing this, but the simplest and most common approach is to plot the expression levels of marker genes for known cell types. A list of some marker genes and immune-related cell types is provided in the file /project/50042/share/RNA_data/scRNA_data/gene_lists_human_180502.csv. Example for GENE1, GENE2, and GENE3:

jpeg(sprintf("%s/geneplot.jpg", outdir), units="in", res=300, height=6, width=6);
fp <- FeaturePlot(object = scrna, features = c("GENE1", "GENE2", "GENE3"), cols = c("gray","red"), ncol=2, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp);
dev.off();

QUESTION: What tissue type are we looking at?