Welcome to the blog

Posts

My thoughts and ideas

Quality Assessment | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Quality Assessment

TO DO

  • add images
  • add explantion

Quality assessment

We are going to begin our single cell analysis by loading in the output from CellRanger. We will load in our different samples, create a Seurat object with then, and take a look at the quality of the cells.

#Note, we have provided the raw data for this exercise in your cloud workspace. They are also available at: http://genomedata.org/cri-workshop/counts_gex/

Step 1: Load in Data

First load the needed packages. The most important package for this step is Seurat. Seurat provides a unqiue data structure and tools for qaulity control, analysis, and exploration of sincle cell RNA sequencing data. Seurat is very popular and is considered a standard tool for single-cell RNA analysis. Another example of a toolkit with similar functionality is Scanpy which is implemented in python.

<<<<<<< HEAD
library("Seurat")
library("ggplot2")
library("cowplot")
library("dplyr")
library("Matrix")
library("viridis")
library("hdf5r")

Create and set some necessary directories

library("Seurat") 
library("ggplot2") # creating plots
library("cowplot") # add-on to ggplot, we use the plot_grid function to put mutple violin plots in one image file, TRY WITH FACET WRAP
library("dplyr")   # a set of functions for dataframe manipulation -- core package of Tidyverse (THE R data manipualtion package)
library("Matrix")  # a set of function for operating on matrices
library("viridis") # color maps for graphs that are more readable then default colors

Lets make sure out workspace is configured the way we like. We will set out working directory where files will be saved. IT is always good practive to verify what directory you are in and to assure that you set the directory to where you want your files to be saved.

getwd()                         # list the contents of your current working directory
setwd("/cloud/project/")        # set your working directory to where your data files are located
getwd()                         # verify that your working directory has been set correctly 


outdir="/cloud/project/figures" # a path to where you want you plots to be saved
dir.create(outdir)

Next, we will read in our data matrixs generated by cellranger. Cellranger provides several output files but the one we need sample_filtered_feature_bc_matrix.h5 which is a matrix of the raw read counts for all cells per sample. This file is not human readable but can be visualized as a tabel where the cell barcodes are the column headers and the genes (or features) are the row names.

We will be using 3 replicates, which are separated into paris of ICB and ICB depleted T samples.

We will use a for-loop to accomplish two major steps. The first is the read in the cellranger files and the seocnd is to create a seurat object. So the for-loop can be read: “For every sample in the sample_names list, paste together a path where the h5 file for that sample can be found, print that path for verification, use the function Read10X_h5 to read the h5 file, create a seurat object using the function CreateSeuratObject, and finally all that sample’s suerat object to a list of seurat objects. When the for-loop completes, our sample.data list should contain all 6 samples.

The CreateSeuratObject function is given 4 parameters. The first is the counts which we supply with the data read in from the h5 files. The project parameter is basiclly what you want this seurat obejct’s identification to be, so we pass it the sample name. We also apply a little bit of filtering in this step by saying that we only want to create an object where each gene is found in a minimum of 10 cells and each cell has a minimum of 100 features.

# List of sample names
sample_names <- c("Rep1_ICBdT", "Rep1_ICB", "Rep3_ICBdT", "Rep3_ICB",
                  "Rep5_ICBdT", "Rep5_ICB")

sample.data  = list()
for (sample in sample_names) {
  path = paste("data/", sample, "-sample_filtered_feature_bc_matrix.h5", sep="") # paste together a path name for the h5 file, this is done to save time and have cleaner code 
  print(path)
  data = Read10X_h5(path) # read in the cellranger h5 file
  seurat_obj = CreateSeuratObject(counts = data, project = sample, min.cells = 10, min.features = 100)
  sample.data[[sample]] = seurat_obj
}

Calculate the Percent of Mitochondiral Genes within each cell

This will be used to identify dying cells

for (sample in sample_names) {
  sample.data[[sample]][["percent.mt"]] = 
    PercentageFeatureSet(sample.data[[sample]], pattern = "^mt-", assay = "RNA")
}

View Number of Cells Before Filtering

for (sample in sample_names) {
  print(table(Idents(sample.data[[sample]]))) 
}

Basic QC Plots

for (sample in sample_names) {
  print(sample)
  jpeg(sprintf("outdir/%s_unfilteredQC.jpg", sample), width = 16, height = 5, units = 'in', res = 150)
  p1 <- VlnPlot(sample.data[[sample]], features = c("nCount_RNA"), pt.size = 0) 
  p2 <- VlnPlot(sample.data[[sample]], features = c("nFeature_RNA"), pt.size = 0) + scale_y_continuous(breaks = c(0, 300, 500, 1000, 2000, 4000))
  p3 <- VlnPlot(sample.data[[sample]], features = c("percent.mt"), pt.size = 0) + scale_y_continuous(breaks = c(0, 12.5, 25, 50))
  p <- plot_grid(p1, p2, p3, ncol = 3)
  
  print(p)
  dev.off()
}

Remove low quality cells based off the QC plots

Mark cells before removal Might need to discuss these numbers

for (sample in sample_names) {
  sample.data[[sample]] <- subset(sample.data[[sample]], nFeature_RNA > 1000 & percent.mt <= 12) 
}

Merge the Samples


sample_names <- c("Rep1_ICBdT", "Rep1_ICB", "Rep3_ICBdT", "Rep3_ICB",
                  "Rep5_ICBdT", "Rep5_ICB")

merged <- merge(x = sample.data[["Rep1_ICBdT"]], y = c(sample.data[["Rep1_ICB"]], 
                                                       sample.data[["Rep3_ICBdT"]], sample.data[["Rep3_ICB"]], 
                                                       sample.data[["Rep5_ICBdT"]], sample.data[["Rep5_ICB"]]), 
                add.cell.ids = sample_names)

Post-filtration number of cells

for (sample in sample_names) {
  print(table(Idents(sample.data[[sample]]))) 
}
for (sample in sample_names) {
  print(sample)
  jpeg(sprintf("%s_filteredQC.jpg", sample), width = 16, height = 5, units = 'in', res = 150)
  p1 <- VlnPlot(sample.data[[sample]], features = c("nCount_RNA"), pt.size = 0) 
  p2 <- VlnPlot(sample.data[[sample]], features = c("nFeature_RNA"), pt.size = 0) + scale_y_continuous(breaks = c(0, 300, 500, 1000, 2000, 4000))
  p3 <- VlnPlot(sample.data[[sample]], features = c("percent.mt"), pt.size = 0) + scale_y_continuous(breaks = c(0, 12.5, 25, 50))
  p <- plot_grid(p1, p2, p3, ncol = 3)
  dev.off()
  print(p)
}

Normalize Data and other

As of now the various replicates are in their own layers. They need to be merged into 1 single layer for further analysis. Also add (draft) SingleR labels to the object.

merged
merged <- JoinLayers(merged)
merged
merged <- NormalizeData(merged, assay = "RNA", normalization.method = "LogNormalize", scale.factor = 10000)

merged <- FindVariableFeatures(merged, assay = "RNA", selection.method = "vst", nfeatures = 2000, mean.cutoff = c(0.1, 8), dispersion.cutoff = c(1,Inf))

cell.cycle.tirosh <- read.csv("http://genomedata.org/rnaseq-tutorial/scrna/CellCycleTiroshSymbol2ID.csv", header=TRUE); # read in the list of genes
s.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G1/S")]; # create a vector of S-phase genes
g2m.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G2/M")]; # create a vector of G2/M-phase genes
merged <- CellCycleScoring(object=merged, s.features=s.genes, g2m.features=g2m.genes, set.ident=FALSE) # NOT WORKING???

merged <- ScaleData(merged, verbose = TRUE) 
merged <- RunPCA(merged, npcs = 50, assay = "RNA") 
# merged <- JackStraw(merged, num.replicate = 100, dims = 30) # this takes a while to run (add photo)
# merged <- ScoreJackStraw(merged, dims = 1:30)

Jackstaw is an older and computationally expensive way to analyze the variability of PCs. However, it could be useful for further understanding what PCs to use. Usually p-values over -100 are indicative of useful PCs.

# plot <- JackStrawPlot(merged, dims = 1:30)
# jpeg(sprintf("JackStraw.jpg",), width = 8, height = 6, units = 'in', res = 150)
# print(plot)
# dev.off()

jpeg(sprintf("outdir/DimHm1_12.jpg"), width = 10, height = 20, units = 'in', res = 150)
DimHeatmap(merged, dims = 1:12, balanced = TRUE, cells = 500)
dev.off()

jpeg(sprintf("outdir/DimHm13_24.jpg"), width = 10, height = 20, units = 'in', res = 150)
DimHeatmap(merged, dims = 13:24, balanced = TRUE, cells = 500)
dev.off()

jpeg(sprintf("outdir/DimHm25_36.jpg"), width = 10, height = 20, units = 'in', res = 150)
DimHeatmap(merged, dims = 25:36, balanced = TRUE, cells = 500)
dev.off()
elbow <- ElbowPlot(merged)
jpeg(sprintf("outdir/Elbow.jpg"), width = 8, height = 6, units = 'in', res = 150)
print(elbow)
dev.off()

Determine how many PCA should be used for clustering

someone should explain how this works

View the Elbow plot, jackstraw plot, and heatmaps. Look for where the elbow plot levels out. See what PCAs the jackstraw plot pvalues show as significant (pvalue over -100), and also view the heatmaps to see what genes are driving the PCAs.

ndims = length(which(merged@reductions$pca@stdev > 2)) # determines which PCs are important (stdev>2) 
ndims

After looking at the elbow plot, the PC heatmaps, and JackStraw plot (maybe this takes a long time to run so we might want to get rid of it) we decided to select 26 PCs.

Visualize Cell Clustering

PC = 26
merged <- FindNeighbors(merged, dims = 1:PC)

merged <- FindClusters(merged, resolution = 1.2, cluster.name = 'seurat_clusters_res1.2')

merged <- RunUMAP(merged, dims = 1:PC)
jpeg("outdir/UMAP.jpg", width = 5, height = 4, units = 'in', res = 150)
DimPlot(merged, label = TRUE, group.by = 'seurat_clusters_res1.2')
dev.off()

# UMAP by sample/timepoint
jpeg("outdir/UMAP_orig.ident.jpg", width = 5, height = 4, units = 'in', res = 150)
DimPlot(merged, label = TRUE, group.by = "orig.ident")
dev.off()

# UMAP with one day highlighted (not saved)
highlighted_cells <- WhichCells(merged, expression = orig.ident == "Rep1_ICBdT")
DimPlot(merged, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)

Choosing cluster resolution is somewhat arbitrary and effects the number of clusters called (higher resolution calls more clusters). The shape of the UMAP does not change if you change the cluster resolution.

merged <- FindClusters(merged, resolution = 0.8, cluster.name = 'seurat_clusters_res0.8')

merged <- FindClusters(merged, resolution = 0.5, cluster.name = 'seurat_clusters_res0.5')

jpeg("outdir/UMAP_compare_res.jpg", width = 5, height = 4, units = 'in', res = 150)
DimPlot(merged, label = TRUE, group.by = 'seurat_clusters_res0.5') +
  DimPlot(merged, label = TRUE, group.by = 'seurat_clusters_res0.8') + 
  DimPlot(merged, label = TRUE, group.by = 'seurat_clusters_res1.2') 
dev.off()

The shape of the UMAP is determined by the number of PCs used to create the UMAP.

merged <- RunUMAP(merged, dims = 1:5)

jpeg("UMAP_5PCs.jpg", width = 5, height = 4, units = 'in', res = 150)
DimPlot(merged, label = TRUE, group.by = 'seurat_clusters_res1.2')
dev.off()

jpeg("DimHm1_5.jpg", width = 10, height = 20, units = 'in', res = 150)
DimHeatmap(merged, dims = 1:5, balanced = TRUE, cells = 500)
dev.off()

Introduction to Single-cell RNA-seq | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Introduction to Single-cell RNA-seq

Module 8 - Key concepts

Module 8 - Learning objectives

Module 8 - Lecture