Welcome to the blog

Posts

My thoughts and ideas

Differential expression analysis full | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differential expression analysis full


In this section we will use the previously generated Seurat object that has gone through the various preprocessing steps, clustering, and celltyping, and use it for gene expression and differential expression analyses. We will carry out two sets of differential expression analyses in this module. Firstly, since we know that the tumor cells should be epithelial cells, we will begin by trying to identify epithelial cells in our data using expression of Epcam as a marker. Subsequently, we will carry out a DE analysis within the Epcam-positive population(s). Secondly, we will compare the T cell populations of mice treated with ICB therapy against mice with their T cells depleted that underwent ICB therapy to determine differences in T cell phenotypes. We will also briefly explore pseudobulk differential expression analysis. Pseudobulk DE analysis may be more robust in situations where one is comparing conditions and has multiple replicates in their experiment.


Read-in the saved seurat object from the previous step if it is not already loaded in your current R session.

library(Seurat)
library(dplyr)
library(EnhancedVolcano)
library(presto)
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#alternatively we have a seurat object from the last step saved here if you need it.
merged <- readRDS('/cloud/project/data/single_cell_rna/backup_files/preprocessed_object.rds')

Gene expression analysis for epithelial cells

Use Seurat’s FeaturePlot function to color each cell by its Epcam expression on a UMAP. FeaturePlot requires at least 2 arguments- the seurat object, and the ‘feature’ you want to plot (where a ‘feature’ can be a gene, PC scores, any of the metadata columns, etc.). To customize the FeaturePlot, please refer to Seurat’s documentation here

FeaturePlot(merged, features = 'Epcam')

While there are some Epcam positive cells scattered on the UMAP, there appear to be 2 clusters of cells in the UMAP that we may be able to pull apart as potentially being malignant cells.

There are a few different ways to go about identifying what those clusters are. We can start by trying to use the DimPlot plotting function from before along with the FeaturePlot function. Separating plots by the + symbol allows us to plot multiple plots side-by-side.

DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
FeaturePlot(merged, features = 'Epcam') + 
DimPlot(merged, group.by = 'immgen_singler_main')

EPCAM cluster singler feature and dimplot

While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 9 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot VlnPlot may be more helpful. Similar to FeaturePlot, VlnPlot also takes the Seurat object and features as input. It also requires a group.by argument that determines the categorical variable that will be used to group the cells.
To learn more about customizing a Violin plot, please refer to the Seurat documentation

VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')

EPCAM cluster singler vlnplot

Great! Looks like we can confirm that clusters 9 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.

Differential expression for epithelial cells

We can begin by restricting the Seurat object to the cells we are interested in. We will do so using Seurat’s subset function, that allows us to create an object that is filtered to any values of interest in the metadata column. However, we first need to set the default identity of the Seurat object to the metadata column we want to use for the ‘subset’, and we can do so using the SetIdent function. After subsetting the object, we can plot the original object and subsetted object side-by-side to ensure the subsetting happened as expected. We can also count the number of cells of each type to confirm the same.

#set ident to seurat clusters metadata column and subset object to Epcam positive clusters
merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
merged_epithelial <- subset(merged, idents = c('9', '12'))

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
DimPlot(merged_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$seurat_clusters_res0.8)
table(merged_epithelial$seurat_clusters_res0.8)

Epithelial subset object dimplot

Now we will use Seurat’s FindMarkers function to carry out a differential expression analysis between both groups. FindMarkers also requires that we use SetIdent to change the default ‘Ident’ to the metadata column we want to use for our comparison. More information about FindMarkers is available here.

Note that here we use FindMarkers to compare clusters 9 and 12. The default syntax of FindMarkers requires that we provide each group of cells as ident.1 and ident.2. The output of FindMarkers is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of ident.1 with respect to ident.2. Therefore, genes upregulated in ident.1 have positive log2FC, while those downregulated in ident.1 have negative log2FC. The min.pct=0.25 argument tests genes that are expressed in 25% of cells in either of the ident.1 or ident.2 groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. The logfc.threshold=0.1 parameter ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the min.pct and logfc.threshold parameters can also result in the function running faster by reducing the number of genes being tested.

#carry out DE analysis between both groups
merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "9", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 9 changes wrt cluster 12

On opening epithelial_de in your RStudio session, you’ll see that it is a dataframe with the genes as rownames, and the following columns- p_val, avg_log2FC, pct.1, pct.2, p_val_adj. The p-values are dependent on the test used while running FindMarkers, and the adjusted p-value is based on the bonferroni correction test. pct.1 and pct.2 are the percentages of cells where the gene is detected in the ident.1 and ident.2 groups respectively.

Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the ‘significant DE genes only’ dataframe to the top 20 genes with the highest absolute log2FC. Looking at the absolute log2FC allows us to capture both, upregulated and downregulated genes.

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,] 

#get the top 20 genes by fold change
epithelial_de_sig_top20 <- epithelial_de_sig %>%
  top_n(n = 20, wt = abs(avg_log2FC)) 

epithelial_de_sig_top20 is a dataframe that is restricted to the top20 most differentially expressed genes by log2FC.

There are a few ways we can visualize the differentially expressed genes. We’ll start with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the EnhancedVolcano package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on parameters (pCutoff, FCcutoff) we specify.

#get list of top 20 DE genes for ease
epithelial_de_sig_top20_genes <- rownames(epithelial_de_sig_top20)

#plot all 20 genes in violin plots
VlnPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8', ncol = 5, pt.size = 0)

#plot all 20 genes in UMAP plots
FeaturePlot(merged_epithelial, features = epithelial_de_sig_top20_genes, ncol = 5)

#plot all 20 genes in a DotPlot
DotPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8') + RotatedAxis()

#plot all differentially expressed genes in a volcano plot
EnhancedVolcano(epithelial_de,
  lab = rownames(epithelial_de),
  x = 'avg_log2FC',
  y = 'p_val_adj',
  title = 'Cluster9 wrt Cluster 12',
  pCutoff = 0.05,
  FCcutoff = 0.5,
  pointSize = 3.0,
  labSize = 5.0,
  colAlpha = 0.3)

To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. For now, let’s create a TSV file containing our DE results for use later on. We will need to rerun FindMarkers with slightly different parameters for this- we will change the logfc.threshold parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).

#rerun FindMarkers
epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "9", 
  ident.2 = "12", min.pct=0.25, logfc.threshold=0)
#save this table as a TSV file (first move index to first column)
epithelial_de_gsea <- tibble::rownames_to_column(epithelial_de_gsea, var = "gene")
write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t', row.names = FALSE)

Classic single-cell based differential expression analysis is notorious for having a high number of false positives. This has been attributed to a few reasons such as each cell being considered an independent observation resulting in inflated p-values and it not taking replicates into account. Psuedobulk based differential expression analyses can be incorporated to at least in part get around these (Junttila S, Smolander J, et al. (2022); Squair, J.W. et al. (2021)). Depending on the type of experiment, you’re unlikely to always have replicates, but here we do, therefore, we will carry out a pseudobulk DE analysis and compare the results against our conventional single-cell DE analysis. The first step is to generate a pseudobulk Seurat object using Seurat’s in-built function AggregateExpression, more information on the function is available here and Seurat’s pseudobulk DE vignette is here. The function makes groups based on the metadata columns we specify, sums the counts for each gene within the group, and then log normalizes and scales the data. The resulting object is structured like our single-cell Seurat object with different layers for the raw counts (counts), log normalized (data) and scaled (scale.data) data. However, instead of the rows being our cell barcodes, they are a combination of the grouping categories we used as an input. Here, we will use the sample (given by orig.ident) as our replicates and the differential expression analysis will use DESeq2 to compare clusters 9 and 12. Specifying DESeq2 in the test.use argument will allow us to use Seurat’s implementation of DESeq2. There are quite a few other tests that can be used, we primarily chose DESeq2 here because it was in Seurat’s DE vignette linked previously, but we encourage you to look into the various other tests as well.

# Aggregate counts for each sample and cluster combination
pb_epithelial <- AggregateExpression(merged_epithelial, assays = 'RNA', 
                                         return.seurat = T, group.by = c('orig.ident', 'seurat_clusters_res0.8'))
# See the first few rows of the log normalized data layer
print(head(pb_epithelial@assays$RNA$data))

#Change ident to seurat clusters
Idents(pb_epithelial) <- "seurat_clusters_res0.8"

# Use FindMarkers with DESeq2 as the test to compare cluster 9 wrt cluster 12
pb_epithelial_de <- FindMarkers(object = pb_epithelial, test.use = "DESeq2",
                              ident.1 = "9", ident.2 = "12")


# The DESeq2 analysis results in NAs in the pvalue columns for some cases 
pb_epithelial_de <- na.omit(pb_epithelial_de)

# Restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
pb_epithelial_de_sig <- pb_epithelial_de[pb_epithelial_de$p_val_adj < 0.01,] 


# Compare significantly differentially expressed genes
pb_and_sc_genes <- intersect(rownames(epithelial_de_sig), rownames(pb_epithelial_de_sig))
only_sc_genes <- setdiff(rownames(epithelial_de_sig), rownames(pb_epithelial_de_sig))
only_pb_genes <- setdiff(rownames(pb_epithelial_de_sig), rownames(epithelial_de_sig))

print(paste0("Genes differentially expressed in both single-cell and pseudobulk: ", length(pb_and_sc_genes)))
print(paste0("Genes differentially expressed in single-cell but not pseudobulk: ", length(only_sc_genes)))
print(paste0("Genes differentially expressed in pseudobulk but not single-cell: ", length(only_pb_genes)))

Genes that were detected by single-cell and pseudobulk (~2800) are the most likely to be true positives. Far fewer genes are detected by pseudobulk only (~600) compared to single-cell only (~3000). The pseudobulk genes may also be false positives though, as by pseudobulking we lose information about the percent of cells expressing the genes. Therefore, some of these genes may be differentially expressed because they are only expressed in a few cells. Also, it is unlikely that all ~3000 genes detected only by single-cell are false positives- therefore, it’s difficult to advocate for only one approach, but based on the downstream application one approach may be better than the other. (If the goal is to come up with a shorter list of genes to follow up on in the wet lab, then the consensus DE gene list may be appropriate. But if the goal is more exploratory, the single cell genes can be used, but you’ll likely want to use violin plots and feature plots to make sure the genes are indeed differentially expressed).

Differential expression for T cells

For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our merged object to only have T cells, by combining the various T cell annotations from celltyping section. We’ll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will SetIdent to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we’ll doublecheck that the subsetting happened as we expected it to.

#check all the annotated celltypes
unique(merged$immgen_singler_main)

#pick the ones that are related to T cells
t_celltypes_names <- c('T cells', 'NKT', 'Tgd')
merged <- SetIdent(merged, value = 'immgen_singler_main')
merged_tcells <- subset(merged, idents = t_celltypes_names)

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'immgen_singler_main') + 
DimPlot(merged_tcells, group.by = 'immgen_singler_main')

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$immgen_singler_main)
table(merged_tcells$immgen_singler_main)

Tcells subset object dimplot

Now we want to compare T cells from mice treated with ICB vs ICBdT. First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand meta.data to get a snapshot of the columns and the what kind of data they hold.

We can leverage the orig.ident column again as it has information about the condition and replicates. For the purposes of this DE analysis, we want a meta.data column that combines the replicates of each condition. That is, we want to combine the replicates of each condition together into a single category (ICB vs ICBdT).

#we'll start by checking the possible names each replicate has.
unique(merged_tcells$orig.ident)

#there are 6 possible values, 3 replicates for the ICB treatment condition, and 3 for the ICBdT condition
#so we can combine "Rep1_ICB", "Rep3_ICB", "Rep5_ICB" to ICB, and "Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT" to ICBdT. 
#first initialize a metadata column for experimental_condition
merged_tcells@meta.data$experimental_condition <- NA

#Now we can take all cells that are in each replicate-condition, 
#and assign them to the appropriate condition
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"

#double check that the new column we generated makes sense 
#(each replicate should correspond to its experimental condition)
table(merged_tcells@meta.data$orig.ident, merged_tcells@meta.data$experimental_condition)

With the experimental conditions now defined, we can compare the T cells from both groups. We’ll start by using FindMarkers using similar parameters to last time, and see how ICBdT changes with respect to ICB. Next, restrict the dataframe to significant genes only, and then look at the top 5 most upregulated and downregulated DE genes by log2FC.

#carry out DE analysis between both groups
merged_tcells <- SetIdent(merged_tcells, value = "experimental_condition")
tcells_de <- FindMarkers(merged_tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25)

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
tcells_de_sig <- tcells_de[tcells_de$p_val_adj < 0.001,]

#find the top 5 most downregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = -avg_log2FC)
#find the top 5 most upregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = avg_log2FC)

The most downregulated gene in the ICBdT condition based on foldchange is Cd4. That makes sense, because the monoclonal antibody (GK1.5) used for T cell depletion specifically targets CD4 T cells! You can read more about the antibody here.

Interestingly, for the list of genes that are upregulated in the ICBdT condition, we see Cd8b1 show up. It could be interesting to see if the CD8 T cells’ phenotype changes based on the treatment condition. Let’s subset the object to CD8 T cells only, find DE genes to see how ICBdT CD8 T cells change compared to ICB CD8 T cells, and visualize these similar to before.

# Subset object to CD8 T cells. Since we already showed how to subset cells using the clusters earlier, 
# This time we'll subset to CD8 T cells by selecting for cells with high expression of Cd8 genes and low expression of Cd4 genes using violin plots to find thresholds for filtering
VlnPlot(merged_tcells, features = c('Cd8a', 'Cd8b1', 'Cd4'))
merged_cd8tcells <- subset(merged_tcells, subset= Cd8b1 > 1 & Cd8a > 1 & Cd4 < 0.1)

#carry out DE analysis between both groups
merged_cd8tcells <- SetIdent(merged_cd8tcells, value = "experimental_condition")
cd8tcells_de <- FindMarkers(merged_cd8tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25) #how ICBdT changes wrt ICB

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
cd8tcells_de_sig <- cd8tcells_de[cd8tcells_de$p_val_adj < 0.001,]

#get the top 20 genes by fold change
cd8tcells_de_sig %>%
  top_n(n = 20, wt = abs(avg_log2FC)) -> cd8tcells_de_sig_top20

#get list of top 20 DE genes for ease
cd8tcells_de_sig_top20_genes <- rownames(cd8tcells_de_sig_top20)


#plot all 20 genes in violin plots
VlnPlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, group.by = 'experimental_condition', ncol = 5, pt.size = 0)

#plot all 20 genes in UMAP plots
FeaturePlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, ncol = 5)

#plot all 20 genes in a DotPlot
DotPlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, group.by = 'experimental_condition') + RotatedAxis()

#plot all differentially expressed genes in a volcano plot
EnhancedVolcano(cd8tcells_de,
  lab = rownames(cd8tcells_de),
  x = 'avg_log2FC',
  y = 'p_val_adj',
  title = 'ICBdT wrt ICB',
  pCutoff = 0.05,
  FCcutoff = 0.5,
  pointSize = 3.0,
  labSize = 5.0,
  colAlpha = 0.3)

CD8Tcells DE volcanoplot

At this point, you can either start doing literature searches for some of these genes and find that for genes like Bcl2, Cdk8, and Znrf3 that were upregulated in ICB CD8 Tcells, the literature suggests- antigen-specific memory CD8 T cells have higher expression of Bcl2 and Cdk8; and Znrf3 has been implicated in CD8 T cells’ anti-tumor response following anti-PD-1 therapy. Also, we replicate the original paper’s findings of ICBdT CD8 T cells upregulating T cell exhaustion markers like Tigit and Pdcd1.

But we can also use gene set and pathway analyses to try and determine what processes the cells may be involved in.

Same as the epithelial cells, let’s create a TSV file containing our DE results for use in pathway analysis by rerunning FindMarkers with the logfc.threshold parameter set to 0.

#rerun FindMarkers
cd8tcells_de_gsea <- FindMarkers(merged_cd8tcells, ident.1 = "ICBdT", 
  ident.2 = "ICB", min.pct=0.25, logfc.threshold=0)
#save this table as a TSV file (first move index to first column)
cd8tcells_de_gsea <- tibble::rownames_to_column(cd8tcells_de_gsea, var = "gene")
write.table(x = cd8tcells_de_gsea, file = 'outdir_single_cell_rna/cd8tcells_de_gsea.tsv', sep='\t', row.names = FALSE)

Try using the pseudobulk aggregated expression for the CD8 T cells yourself!

Hint: Aggregate by orig.ident and experimental_condition. (Click here to get the answer)


# Aggregate counts for each sample and cluster combination
pb_cd8tcells <- AggregateExpression(merged_cd8tcells, assays = 'RNA', 
                                         return.seurat = T, group.by = c('orig.ident', 'experimental_condition'))

# See the first few rows of the log normalized data layer
print(head(pb_cd8tcells@assays$RNA$data))

# Change ident to seurat clusters
Idents(pb_cd8tcells) <- "experimental_condition"

# Use FindMarkers with DESeq2 as the test to compare cluster 9 wrt cluster 12
pb_cd8tcells_de <- FindMarkers(object = pb_cd8tcells, test.use = "DESeq2",
                                  ident.1 = "ICBdT", ident.2 = "ICB")

# The DESeq2 analysis results in NAs in the pvalue columns for some cases 
pb_cd8tcells_de <- na.omit(pb_cd8tcells_de)

# Restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
pb_cd8tcells_de_sig <- pb_cd8tcells_de[pb_cd8tcells_de$p_val_adj < 0.01,] 

# Compare significantly differentially expressed genes
pb_and_sc_genes <- intersect(rownames(cd8tcells_de_sig), rownames(pb_cd8tcells_de_sig))
only_sc_genes <- setdiff(rownames(cd8tcells_de_sig), rownames(pb_cd8tcells_de_sig))
only_pb_genes <- setdiff(rownames(pb_cd8tcells_de_sig), rownames(cd8tcells_de_sig))

print(paste0("Genes differentially expressed in both single-cell and pseudobulk: ", length(pb_and_sc_genes)))
print(paste0("Genes differentially expressed in single-cell but not pseudobulk: ", length(only_sc_genes)))
print(paste0("Genes differentially expressed in pseudobulk but not single-cell: ", length(only_pb_genes)))

Hmm looks like we barely have any significant DE genes for the pseudobulk data. Let's look at a few of the genes we pulled out earlier:


VlnPlot(merged_cd8tcells, features = c('Bcl2', 'Cdk8', 'Znrf3'), group.by = 'orig.ident')
print(pb_cd8tcells_de[rownames(pb_cd8tcells_de) %in% c('Bcl2', 'Cdk8', 'Znrf3'), ])

It looks like those genes didn't get past the multiple hypothesis correcting. In practice, it would be a judgement call on whether or not you expect these to be 'real' and how you want to validate them.

Single-cell RNA-seq - CSHL legacy version | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Single-cell RNA-seq - CSHL legacy version

Exercise: A Complete Seurat Workflow

In this exercise, we will analyze and interpret a small scRNA-seq data set consisting of three bone marrow samples. Two of the samples are from the same patient, but differ in that one sample was enriched for a particular cell type. The goal of this analysis is to determine what cell types are present in the three samples, and how the samples and patients differ. This was drawn in part from the Seurat vignettes at https://satijalab.org/seurat/vignettes.html.

Step 1: Preparation

Working at the linux command line in your home directory (/home/ubuntu/workspace), create a new directory for your output files called “scrna”. The full path to this directory will be /home/ubuntu/workspace/scrna. The command is:

mkdir ~/workspace/scRNA_data
cd ~/workspace/scRNA_data
wget -r -N --no-parent -nH --reject zip -R "index.html*" --cut-dirs=2 http://genomedata.org/rnaseq-tutorial/scrna/
cd ~/workspace
mkdir scrna
cd scrna
wget http://genomedata.org/rnaseq-tutorial/scrna/PlotMarkers.r

Start R, then load some R libraries as follows

library("Seurat");
library("sctransform");
library("dplyr");
library("RColorBrewer");
library("ggthemes");
library("ggplot2");
library("cowplot");
library("data.table");

Create a vector of convenient sample names, such as “A”, “B”, and “C”:

samples = c("A","B","C");

Create a variable called outdir to specify your output directory:

outdir = "/home/ubuntu/workspace/scrna";

Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline

data.10x = list(); # first declare an empty list in which to hold the feature-barcode matrices
data.10x[[1]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_CD34_3pV3/filtered_feature_bc_matrix");
data.10x[[2]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_WBM_3pV3/filtered_feature_bc_matrix");
data.10x[[3]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050819_WBM_3pV3/filtered_feature_bc_matrix");

Step 3: Convert each feature-barcode matrix to a Seurat object

This simultaneously performs some initial filtering in order to exclude genes that are expressed in fewer than 100 cells, and to exclude cells that contain fewer than 700 expressed genes. Note that min.cells=10 and min.features=100 are more common parameters at this stage, but we are filtering more aggressively in order to make the data set smaller. At this step, we also create a “DataSet” identity for each cell.

scrna.list = list(); # First create an empty list to hold the Seurat objects
scrna.list[[1]] = CreateSeuratObject(counts = data.10x[[1]], min.cells=100, min.features=700, project=samples[1]);
scrna.list[[1]][["DataSet"]] = samples[1];
scrna.list[[2]] = CreateSeuratObject(counts = data.10x[[2]], min.cells=100, min.features=700, project=samples[2]);
scrna.list[[2]][["DataSet"]] = samples[2];
scrna.list[[3]] = CreateSeuratObject(counts = data.10x[[3]], min.cells=100, min.features=700, project=samples[3]);
scrna.list[[3]][["DataSet"]] = samples[3];

Aside: Note that you can do this more efficiently, especially if you have many samples, using a ‘for’ loop:

for (i in 1:length(data.10x)) {
    scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=100, min.features=700, project=samples[i]);
    scrna.list[[i]][["DataSet"]] = samples[i];
}

Finally, remove the raw data to save memory (these objects get large!):

rm(data.10x);

Step 4. Merge the Seurat objects into a single object

We will call this object scrna. We also give it a project name (here, “CSHL”), and prepend the appropriate data set name to each cell barcode. For example, if a barcode from data set “B” is originally AATCTATCTCTC, it will now be B_AATCTATCTCTC. Then clean up some space by removing scrna.list. Finally, save the merged object as an RDS file. Should you need to load this file into R at any time, it can be done using the readRDS command.

scrna <- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]]), add.cell.ids = c("A","B","C"), project="CSHL");
rm(scrna.list); # save some memory
str(scrna@meta.data) # examine the structure of the Seurat object meta data
saveRDS(scrna, file = sprintf("%s/MergedSeuratObject.rds", outdir));

Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data

Meta data can be used to hold the following information (and more) for your data set:

You can access and query the meta data using commands such as:

scrna[[]];
scrna@meta.data;
str(scrna@meta.data); # Examine structure and contents of meta data
head(scrna@meta.data$nFeature_RNA); # Access genes (“Features”) for each cell
head(scrna@meta.data$nCount_RNA); # Access number of UMIs for each cell:
levels(x=scrna); # List the items in the current default cell identity class
length(unique(scrna@meta.data$seurat_clusters)); # How many clusters are there? Note that there will not be any clusters in the meta data until you perform clustering.
unique(scrna@meta.data$Batch); # What batches are included in this data set?
scrna$NewIdentity <- vector_of_annotations; # Assign new cell annotations to a new "identity class" in the meta data

Step 5. Quality control plots

Plot the distributions of several quality-control variables in order to choose appropriate filtering thresholds. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. However, you will need to manually calculate the mitochondrial transcript percentage and ribosomal transcript percentage for each cell, and add them to the Seurat object meta data, as shown below.

Calculate the mitochondrial transcript percentage for each cell:

mito.genes <- grep(pattern = "^MT-", x = rownames(x = scrna), value = TRUE);
percent.mito <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[mito.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.mito']] <- percent.mito;

Calculate the ribosomal transcript percentage for each cell:

ribo.genes <- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = scrna), value = TRUE);
percent.ribo <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[ribo.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.ribo']] <- percent.ribo;

Plot as violin plots, which will be located in, for example, ~/workspace/scrna/VlnPlot.pdf All figures can be downloaded using the scp command, or viewed on the AWS server.

pdf(sprintf("%s/VlnPlot.pdf", outdir), width = 13, height = 6);
vln <- VlnPlot(object = scrna, features = c("percent.mito", "percent.ribo"), pt.size=0, ncol = 2, group.by="DataSet");
print(vln);
dev.off();

pdf(sprintf("%s/VlnPlot.nCount.25Kmax.pdf", outdir), width = 10, height = 10)
vln <- VlnPlot(object = scrna, features = "nCount_RNA", pt.size=0, group.by="DataSet", y.max=25000)
print(vln)
dev.off();

pdf(sprintf("%s/VlnPlot.nFeature.pdf", outdir), width = 10, height = 10)
vln <- VlnPlot(object = scrna, features = "nFeature_RNA", pt.size=0, group.by="DataSet")
print(vln)
dev.off()

QUESTIONS:

  1. Excessive mitochondrial transcripts can indicate the presence of dead cells, which tend to cluster together. Based on the distribution of mitochondrial transcripts, what filter threshold would you set for mitochondrial transcripts? One approach is to start with a lenient threshold, work through the analysis, and determine later whether your data still contains clusters of dead cells.
  2. Compare the distribution of ribosomal transcripts, total transcripts, and genes in each sample. Are differences in these parameters necessarily a technical artifact, or might they contain information about the biology of the samples?

Next, we will use Seurat’s FeatureScatter function to create scatterplots of the relationships among QC variables. This can be helpful in selecting filtering thresholds. More generally, this is a very useful wrapper function that can be used to visualize relationships between any pair of quantitative variables in the Seurat object (including expression levels, etc).

pdf(sprintf("%s/Scatter1.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.mito", pt.size=0.1)
print(scatter);
dev.off();

pdf(sprintf("%s/Scatter2.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.ribo", pt.size=0.1)
print(scatter);
dev.off();

pdf(sprintf("%s/Scatter3.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size=0.1)
print(scatter);
dev.off();

Step 6. Calculate a cell cycle score for each cell

This can be used to determine whether heterogeneity in cell cycle phase is driving the tSNE/UMAP layout and/or clustering. This may or may not be obscuring the signal you care about, depending on your analysis goals and the nature of the data. (If necessary, it can be removed in a later step.) It is also useful for determining whether certain populations of cells are more proliferative than others. The list of cell cycle genes, and the scoring method, was taken from Tirosh I, et al. (2016).

cell.cycle.tirosh <- read.csv("http://genomedata.org/rnaseq-tutorial/scrna/CellCycleTiroshSymbol2ID.csv", header=TRUE); # read in the list of genes
s.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G1/S")]; # create a vector of S-phase genes
g2m.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G2/M")]; # create a vector of G2/M-phase genes
scrna <- CellCycleScoring(object=scrna, s.features=s.genes, g2m.features=g2m.genes, set.ident=FALSE)

Step 7. Filter the cells to remove debris, dead cells, and probable doublets

QUESTION: How many cells are there in each sample before filtering? The ‘table’ function may come in handy.

First calculate some basic statistics on the various QC parameters, which can be helpful for choosing cutoffs. For example:

min <- min(scrna@meta.data$nFeature_RNA);
m <- median(scrna@meta.data$nFeature_RNA)
max <- max(scrna@meta.data$nFeature_RNA)    
s <- sd(scrna@meta.data$nFeature_RNA)
min1 <- min(scrna@meta.data$nCount_RNA)
max1 <- max(scrna@meta.data$nCount_RNA)
m1 <- mean(scrna@meta.data$nCount_RNA)
s1 <- sd(scrna@meta.data$nCount_RNA)
Count93 <- quantile(scrna@meta.data$nCount_RNA, 0.93) # calculate value in the 93rd percentile
print(paste("Feature stats:",min,m,max,s));
print(paste("UMI stats:",min1,m1,max1,s1,Count93));

Now, filter the data using the subset function and your chosen thresholds. Note that for large data sets with diverse samples, it may be beneficial to use sample-specific thresholds for some parameters. If you are not sure what thresholds to use, the following will work well for the purposes of this course:

scrna <- subset(x = scrna, subset = nFeature_RNA > 700  & nCount_RNA < Count93 & percent.mito < 0.1)

QUESTION: How many cells are there in each sample after filtering?

Step 8. [Optional] Subset the data

If necessary, you can subset the data set to N cells (2000, 5000, etc) to make it more manageable:

subcells <- sample(Cells(scrna), size=N, replace=F)
scrna <- subset(scrna, cells=subcells)

Step 9. Normalize the data, detect variable genes, and scale the data

Normalize the data:

scrna <- NormalizeData(object = scrna, normalization.method = "LogNormalize", scale.factor = 1e6);

QUESTION: What does LogNormalize do mathematically? Are there other normalization options available?

Now identify and plot the most variable genes, which will be used for downstream analyses. This is a critical step that reduces the contribution of noise. Consider adjusting the cutoffs if you think (often based on prior knowledge of your experimental system) that important genes are being excluded.

scrna <- FindVariableFeatures(object = scrna, selection.method = 'vst', mean.cutoff = c(0.1,8), dispersion.cutoff = c(1, Inf))
print(paste("Number of Variable Features: ",length(x = VariableFeatures(object = scrna))));

pdf(sprintf("%s/VG.pdf", outdir), useDingbats=FALSE)
vg <- VariableFeaturePlot(scrna)
print(vg);
dev.off()

Scale and center the data:

scrna <- ScaleData(object = scrna, features = rownames(x = scrna), verbose=FALSE);

Alternatively, you can scale the data and simultaneously remove unwanted signal associated with variables such as cell cycle phase, ribosomal transcript content, etc. (This is slow, and cannot be done in the time allotted for this course.) To remove cell cycle signal, for instance:

# scrna <- ScaleData(object = scrna, features = rownames(x = scrna), vars.to.regress = c("S.Score","G2M.Score"), display.progress=FALSE);

Save the normalized, scaled Seurat object:

saveRDS(scrna, file = sprintf("%s/VST.rds", outdir));

DIGRESSION: How can you use Seurat-processed data with packages that are not compatible with Seurat? Other packages may require the data to be normalized in a specific way, and often require an expression matrix (not a Seurat object) as input. As an example, here we prepare an expression data matrix for use with the popular CNV-detection package CONICSmat:

scrna.cnv <- NormalizeData(object = scrna, normalization.method = "RC", scale.factor = 1e5);
data.cnv <- GetAssayData(object=scrna.cnv, slot="data"); # get the normalized data
log2data = log2(data.cnv+1); # add 1 then take log2
df <- as.data.frame(as.matrix(log2data)); # convert it to a data frame
cells <- as.data.frame(colnames(df));
genes <- as.data.frame(rownames(df));
# save as text files:
fwrite(x = genes, file = "genes.csv", col.names=FALSE);
fwrite(x = cells, file = "cells.csv", col.names=FALSE);
fwrite(x = df, file = "exp.csv", col.names=FALSE);

Step 10. Reduce the dimensionality of the data using Principal Component Analysis

Subsequent calculations, such as those used to derive the tSNE and UMAP projections, and the k-Nearest Neighbor graph used for clustering, are performed in a new space with fewer dimensions, namely, the principal components. Here, specify a relatively large number of principal components – more than you anticipate using for downstream analyses. Then use several techniques to characterize the components and estimate the number of principal components that captures the signal of interest while minimizing noise.

Perform Principal Component Analysis (PCA), and save the first 100 components:

scrna <- RunPCA(object = scrna, npcs = 100, verbose = FALSE);

OPTIONAL: Then run ProjectDim, which scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. This is not used elsewhere in this pipeline, but it can be useful for exploring genes that are not among the 2000 most highly variable genes selected above.

scrna <- ProjectDim(object = scrna)

QUESTION: What do the principal components “mean” from a biological standpoint? What genes contribute to the principal components? Do they represent biological processes of interest, or technical variables (such as mitochondrial transcripts) that suggest the data may need to be filtered differently?

There are several easy ways to investigate these questions. First, visualize the PCA “loadings.” Each “component” identified by PCA is a linear combination, or weighted sum, of the genes in the data set. Here, the “loadings” represent the weights of the genes in any given component. These plots tell you which genes contribute most to each component:

pdf(sprintf("%s/VizDimLoadings.pdf", outdir), width = 8, height = 30);
vdl <- VizDimLoadings(object = scrna, dims = 1:3)
print(vdl);
dev.off();

Second, use the DimHeatmap function to generate heatmaps that summarize the expression of the most highly weighted genes in each principal component. As noted in the Seurat documentation, “both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.

pdf(sprintf("%s/PCA.heatmap.multi.pdf", outdir), width = 8.5, height = 24);
hm.multi <- DimHeatmap(object = scrna, dims = 1:12, cells = 500, balanced = TRUE);
print(hm.multi);
dev.off();

Finally, you can generate ranked lists of the genes in each principal component and perform functional enrichment or Gene Set Enrichment Analysis. (This tool offers a quick and easy way to determine functional enrichment from a list of genes.) For example, for the first principal component:

PClist_1 <- names(sort(Loadings(object=scrna, reduction="pca")[,1], decreasing=TRUE));

Now, decide how many components to use in downstream analyses. This number usually varies from 5-50, depending on the number of cells and the complexity of the data set. Although there is no “correct” answer, using too few components risks missing meaningful signal, and using too many risks diluting meaningful signal with noise.

There are several ways to make an informed decision. The first is to use the principal component heatmaps generated above. Components that generate noisy heatmaps likely correspond to noise. The second method is to examine a plot of the standard deviations of the principle components, and to choose a cutoff to the left of the bend in this so-called “elbow plot.”

Generate an elbow plot of principal component standard deviations:

elbow <- ElbowPlot(object = scrna)
pdf(sprintf("%s/PCA.elbow.pdf", outdir), width = 6, height = 8);
print(elbow);
dev.off();

Next, use a bootstrapping technique called Jackstraw analysis to estimate a p-value for each component, print out a plot, and save the p-values to a file:

scrna <- JackStraw(object = scrna, num.replicate = 100, dims=30); # takes around 4 minutes
scrna <- ScoreJackStraw(object = scrna, dims = 1:30)
pdf(sprintf("%s/PCA.jackstraw.pdf", outdir), width = 10, height = 6);
js <- JackStrawPlot(object = scrna, dims = 1:30)
print(js);
dev.off();
pc.pval <- scrna@reductions$pca@jackstraw@overall.p.values; # get p-value for each PC
write.table(pc.pval, file=sprintf("%s/PCA.jackstraw.scores.xls", outdir, date), quote=FALSE, sep='\t', col.names=TRUE);

Use the number of principal components (nPC) you selected above.

nPC = 10;
scrna <- RunUMAP(object = scrna, reduction = "pca", dims = 1:nPC);
scrna <- RunTSNE(object = scrna, reduction = "pca", dims = 1:nPC);

Now, plot the tSNE and UMAP plots next to each other in one figure, and color each data set separately:

pdf(sprintf("%s/UMAP.%d.pdf", outdir, nPC), width = 10, height = 8);
p1 <- DimPlot(object = scrna, reduction = "tsne", group.by = "DataSet", pt.size=0.1)
p2 <- DimPlot(object = scrna, reduction = "umap", group.by = "DataSet", pt.size=0.1)
print(plot_grid(p1, p2));
dev.off();

QUESTIONS:

  1. How do the data sets compare to each other? (We will further investigate these differences in subsequent steps.)
  2. How does the number of principal components used affect the layout?
  3. What are the chief sources of variation in this data, as suggested by the t-SNE and UMAP layouts? Are there confounding technical variables that may be driving the layouts? What are some likely technical variables?

Color the t-SNE and UMAP plots by some potential confounding variables. Here’s an example in which we color each cell according to the number of UMIs it contains:

feature.pal = rev(colorRampPalette(brewer.pal(11,"Spectral"))(50)); # a useful color palette
pdf(sprintf("%s/umap.%d.colorby.UMI.pdf", outdir, nPC), width = 10, height = 8);
fp <- FeaturePlot(object = scrna, features = c("nCount_RNA"), cols = feature.pal, pt.size=0.1, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); # the text after the ‘+’ simply removes the axis using ggplot syntax
print(fp);
dev.off();

QUESTION: What is the relationship between the principal components and the t-SNE/UMAP layout?

To investigate this, plot several principal components on the t-SNE/UMAP, for example the following code plots the first principal component and prints the plot to a file:

pdf(sprintf("%s/UMAP.%d.colorby.PCs.pdf", outdir, nPC), width = 12, height = 6);
redblue=c("blue","gray","red"); # another useful color scheme
fp1 <- FeaturePlot(object = scrna, features = 'PC_1', cols=redblue, pt.size=0.1, reduction = "umap")+ theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp1);
dev.off();

Step 12: Infer cell types

There are many sophisticated methods for doing this (e.g. SingleR). But the simplest and most common approach is to plot the expression levels of marker genes for known cell types. Markers for bone-marrow-relevant cell types are provided in the file ~/workspace/scRNA_data/gene_lists_human_180502.csv. To plot three genes of your choice, GENE1, GENE2, and GENE3:

pdf(sprintf("%s/geneplot.pdf", outdir), height=6, width=6);
fp <- FeaturePlot(object = scrna, features = c(GENE1, GENE2, GENE3), cols = c("gray","red"), ncol=2, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp);
dev.off();

Now use the code that we downloaded from here to color the UMAP according to the expression of the markers in gene_lists_human_180502.csv:

source("~/workspace/scrna/PlotMarkers.r")

During the differential expression analysis in Step 14, which will take about 10 minutes to run, use these plots to make inferences about cell type.

Step 13: Cluster the cells using a graph-based clustering algorithm

The first step is to generate the k-Nearest Neighbor (KNN) graph using the number of principal components chosen above (nPC). The second step is to partition the graph into “cliques” or clusters using the Louvain modularity optimization algorithm. At this step, the cluster resolution (cluster.res) may be specified. (Larger numbers generate more clusters.) While there is no “correct” number of clusters, it can be preferable to err on the side of too many clusters. For this exercise, please use the following:

nPC = 10;
cluster.res = 0.2;
scrna <- FindNeighbors(object=scrna, dims=1:nPC);
scrna <- FindClusters(object=scrna, resolution=cluster.res);

The output of FindClusters is saved in scrna@meta.data$seurat_clusters. Note that this is reset each time clustering is performed. To ensure that each clustering result is saved, save the result as a new identity class, and give it a custom name that reflects the clustering resolution and number of principal components:

scrna[[sprintf("ClusterNames_%.1f_%dPC", cluster.res, nPC)]] <- Idents(object = scrna);

Inspect the structure of the meta data, then save the Seurat object, which now contains t-SNE and UMAP coordinates, and clustering results:

str(scrna@meta.data);
saveRDS(scrna, file = sprintf("%s/VST.PCA.UMAP.TSNE.CLUST.rds", outdir));

QUESTION: How many graph-based clusters are there? (This number is ‘n.graph’ and is used below.) How do they relate to the 2-D layouts? How does this depend on the number of components and the clustering resolution?

First plot graph-based clusters on 2-D layouts:

n.graph = length(unique(scrna[[sprintf("ClusterNames_%.1f_%dPC",cluster.res, nPC)]][,1])); # automatically get the number of clusters from a specific clustering run

Or more simply, use the most recent default clustering result:

n.graph = length(unique(scrna@meta.data$seurat_clusters));

rainbow.colors = rainbow(n.graph, s=0.6, v=0.9); # color palette
pdf(sprintf("%s/UMAP.clusters.pdf", outdir), width = 10, height = 6);
p <- DimPlot(object = scrna, reduction = "umap", group.by = "seurat_clusters", cols = rainbow.colors, pt.size=0.1, label=TRUE) + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(p);
dev.off();

QUESTION: Are there sample-specific clusters? How many cells are in each cluster and each sample?

cluster.breakdown <- table(scrna@meta.data$DataSet, scrna@meta.data$seurat_clusters);

QUESTION: What do the clusters represent? How do they differ from each other? Start with a differential expression analysis:

Step 14: Interpret the clustering using a differential gene expression (DEG) analysis

Perform DEG analysis on all clusters simultaneously using the default differential expression test (Wilcoxon), then save the results to a file. This will take 10-15 minutes using the parameters here, which were chosen to make this step run faster, and are not necessarily ideal for all situations. (While this is running, use the plots generated in Step 12 to figure out what types of cells are present in this data set.) Now compute the DEGs and save to a file:

DEGs <- FindAllMarkers(object=scrna, logfc.threshold=1, min.diff.pct=.2);
write.table(DEGs, file=sprintf("%s/DEGs.Wilcox.xls", outdir), quote=FALSE, sep="\t", row.names=FALSE);

Examine the contents and structure of DEGs. How many DEGs are there? What values are contained in this output? Now choose the top 10 DEGs in each cluster, and print them to a heatmap using DoHeatmap and a red/white/blue color scheme:

top10 <- DEGs %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC);
pdf(sprintf("%s/heatmap.pdf", outdir), height=20, width=15);
DoHeatmap(scrna, features=top10$gene, slot="scale.data", disp.min=-2, disp.max=2, group.by="ident", group.bar=TRUE) + scale_fill_gradientn(colors = c("blue", "white", "red")) + theme(axis.text.y = element_text(size = 10));
dev.off();

Questions pertaining to cell type inference and DEG analysis:

  1. What cell types are present?
  2. Plot some DEGs using FeaturePlot. What is more reliable or informative: the statistical significance or the log fold-change of the DEG?
  3. Do different clusters correspond to different cell types? (Should they?)
  4. Are the DEGs helpful for identifying cell types?
  5. Does cell type correlate with other parameters (e.g. UMI, number of genes, cell cycle phase, etc?)

Independent exercises, if time permits