Griffith Lab

Single-cell RNA-seq - CSHL legacy version

0010-01-01T00:00:00+00:00

Exercise: A Complete Seurat Workflow

In this exercise, we will analyze and interpret a small scRNA-seq data set consisting of three bone marrow samples. Two of the samples are from the same patient, but differ in that one sample was enriched for a particular cell type. The goal of this analysis is to determine what cell types are present in the three samples, and how the samples and patients differ. This was drawn in part from the Seurat vignettes at https://satijalab.org/seurat/vignettes.html.

Step 1: Preparation

Working at the linux command line in your home directory (/home/ubuntu/workspace), create a new directory for your output files called “scrna”. The full path to this directory will be /home/ubuntu/workspace/scrna. The command is:

mkdir ~/workspace/scRNA_data
cd ~/workspace/scRNA_data
wget -r -N --no-parent -nH --reject zip -R "index.html*" --cut-dirs=2 http://genomedata.org/rnaseq-tutorial/scrna/
cd ~/workspace
mkdir scrna
cd scrna
wget http://genomedata.org/rnaseq-tutorial/scrna/PlotMarkers.r

Start R, then load some R libraries as follows

library("Seurat");
library("sctransform");
library("dplyr");
library("RColorBrewer");
library("ggthemes");
library("ggplot2");
library("cowplot");
library("data.table");

Create a vector of convenient sample names, such as “A”, “B”, and “C”:

samples = c("A","B","C");

Create a variable called outdir to specify your output directory:

outdir = "/home/ubuntu/workspace/scrna";

Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline

data.10x = list(); # first declare an empty list in which to hold the feature-barcode matrices
data.10x[[1]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_CD34_3pV3/filtered_feature_bc_matrix");
data.10x[[2]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_WBM_3pV3/filtered_feature_bc_matrix");
data.10x[[3]] <- Read10X(data.dir = "~/workspace/scRNA_data/ND050819_WBM_3pV3/filtered_feature_bc_matrix");

Step 3: Convert each feature-barcode matrix to a Seurat object

This simultaneously performs some initial filtering in order to exclude genes that are expressed in fewer than 100 cells, and to exclude cells that contain fewer than 700 expressed genes. Note that min.cells=10 and min.features=100 are more common parameters at this stage, but we are filtering more aggressively in order to make the data set smaller. At this step, we also create a “DataSet” identity for each cell.

scrna.list = list(); # First create an empty list to hold the Seurat objects
scrna.list[[1]] = CreateSeuratObject(counts = data.10x[[1]], min.cells=100, min.features=700, project=samples[1]);
scrna.list[[1]][["DataSet"]] = samples[1];
scrna.list[[2]] = CreateSeuratObject(counts = data.10x[[2]], min.cells=100, min.features=700, project=samples[2]);
scrna.list[[2]][["DataSet"]] = samples[2];
scrna.list[[3]] = CreateSeuratObject(counts = data.10x[[3]], min.cells=100, min.features=700, project=samples[3]);
scrna.list[[3]][["DataSet"]] = samples[3];

Aside: Note that you can do this more efficiently, especially if you have many samples, using a ‘for’ loop:

for (i in 1:length(data.10x)) {
    scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=100, min.features=700, project=samples[i]);
    scrna.list[[i]][["DataSet"]] = samples[i];
}

Finally, remove the raw data to save memory (these objects get large!):

rm(data.10x);

Step 4. Merge the Seurat objects into a single object

We will call this object scrna. We also give it a project name (here, “CSHL”), and prepend the appropriate data set name to each cell barcode. For example, if a barcode from data set “B” is originally AATCTATCTCTC, it will now be B_AATCTATCTCTC. Then clean up some space by removing scrna.list. Finally, save the merged object as an RDS file. Should you need to load this file into R at any time, it can be done using the readRDS command.

scrna <- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]]), add.cell.ids = c("A","B","C"), project="CSHL");
rm(scrna.list); # save some memory
str(scrna@meta.data) # examine the structure of the Seurat object meta data
saveRDS(scrna, file = sprintf("%s/MergedSeuratObject.rds", outdir));

Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data

Meta data can be used to hold the following information (and more) for your data set:

Summary statistics
Sample name
Cluster membership for each cell
Cell cycle phase for each cell
Batch or sample for each cell
Other custom annotations for each cell

You can access and query the meta data using commands such as:

scrna[[]];
scrna@meta.data;
str(scrna@meta.data); # Examine structure and contents of meta data
head(scrna@meta.data$nFeature_RNA); # Access genes (“Features”) for each cell
head(scrna@meta.data$nCount_RNA); # Access number of UMIs for each cell:
levels(x=scrna); # List the items in the current default cell identity class
length(unique(scrna@meta.data$seurat_clusters)); # How many clusters are there? Note that there will not be any clusters in the meta data until you perform clustering.
unique(scrna@meta.data$Batch); # What batches are included in this data set?
scrna$NewIdentity <- vector_of_annotations; # Assign new cell annotations to a new "identity class" in the meta data

Step 5. Quality control plots

Plot the distributions of several quality-control variables in order to choose appropriate filtering thresholds. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. However, you will need to manually calculate the mitochondrial transcript percentage and ribosomal transcript percentage for each cell, and add them to the Seurat object meta data, as shown below.

Calculate the mitochondrial transcript percentage for each cell:

mito.genes <- grep(pattern = "^MT-", x = rownames(x = scrna), value = TRUE);
percent.mito <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[mito.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.mito']] <- percent.mito;

Calculate the ribosomal transcript percentage for each cell:

ribo.genes <- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = scrna), value = TRUE);
percent.ribo <- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[ribo.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts'));
scrna[['percent.ribo']] <- percent.ribo;

Plot as violin plots, which will be located in, for example, ~/workspace/scrna/VlnPlot.pdf All figures can be downloaded using the scp command, or viewed on the AWS server.

pdf(sprintf("%s/VlnPlot.pdf", outdir), width = 13, height = 6);
vln <- VlnPlot(object = scrna, features = c("percent.mito", "percent.ribo"), pt.size=0, ncol = 2, group.by="DataSet");
print(vln);
dev.off();

pdf(sprintf("%s/VlnPlot.nCount.25Kmax.pdf", outdir), width = 10, height = 10)
vln <- VlnPlot(object = scrna, features = "nCount_RNA", pt.size=0, group.by="DataSet", y.max=25000)
print(vln)
dev.off();

pdf(sprintf("%s/VlnPlot.nFeature.pdf", outdir), width = 10, height = 10)
vln <- VlnPlot(object = scrna, features = "nFeature_RNA", pt.size=0, group.by="DataSet")
print(vln)
dev.off()

QUESTIONS:

Excessive mitochondrial transcripts can indicate the presence of dead cells, which tend to cluster together. Based on the distribution of mitochondrial transcripts, what filter threshold would you set for mitochondrial transcripts? One approach is to start with a lenient threshold, work through the analysis, and determine later whether your data still contains clusters of dead cells.
Compare the distribution of ribosomal transcripts, total transcripts, and genes in each sample. Are differences in these parameters necessarily a technical artifact, or might they contain information about the biology of the samples?

Next, we will use Seurat’s FeatureScatter function to create scatterplots of the relationships among QC variables. This can be helpful in selecting filtering thresholds. More generally, this is a very useful wrapper function that can be used to visualize relationships between any pair of quantitative variables in the Seurat object (including expression levels, etc).

pdf(sprintf("%s/Scatter1.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.mito", pt.size=0.1)
print(scatter);
dev.off();

pdf(sprintf("%s/Scatter2.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.ribo", pt.size=0.1)
print(scatter);
dev.off();

pdf(sprintf("%s/Scatter3.pdf", outdir), width = 8, height = 6);
scatter <- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size=0.1)
print(scatter);
dev.off();

Step 6. Calculate a cell cycle score for each cell

This can be used to determine whether heterogeneity in cell cycle phase is driving the tSNE/UMAP layout and/or clustering. This may or may not be obscuring the signal you care about, depending on your analysis goals and the nature of the data. (If necessary, it can be removed in a later step.) It is also useful for determining whether certain populations of cells are more proliferative than others. The list of cell cycle genes, and the scoring method, was taken from Tirosh I, et al. (2016).

cell.cycle.tirosh <- read.csv("http://genomedata.org/rnaseq-tutorial/scrna/CellCycleTiroshSymbol2ID.csv", header=TRUE); # read in the list of genes
s.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G1/S")]; # create a vector of S-phase genes
g2m.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G2/M")]; # create a vector of G2/M-phase genes
scrna <- CellCycleScoring(object=scrna, s.features=s.genes, g2m.features=g2m.genes, set.ident=FALSE)

Step 7. Filter the cells to remove debris, dead cells, and probable doublets

QUESTION: How many cells are there in each sample before filtering? The ‘table’ function may come in handy.

First calculate some basic statistics on the various QC parameters, which can be helpful for choosing cutoffs. For example:

min <- min(scrna@meta.data$nFeature_RNA);
m <- median(scrna@meta.data$nFeature_RNA)
max <- max(scrna@meta.data$nFeature_RNA)    
s <- sd(scrna@meta.data$nFeature_RNA)
min1 <- min(scrna@meta.data$nCount_RNA)
max1 <- max(scrna@meta.data$nCount_RNA)
m1 <- mean(scrna@meta.data$nCount_RNA)
s1 <- sd(scrna@meta.data$nCount_RNA)
Count93 <- quantile(scrna@meta.data$nCount_RNA, 0.93) # calculate value in the 93rd percentile
print(paste("Feature stats:",min,m,max,s));
print(paste("UMI stats:",min1,m1,max1,s1,Count93));

Now, filter the data using the subset function and your chosen thresholds. Note that for large data sets with diverse samples, it may be beneficial to use sample-specific thresholds for some parameters. If you are not sure what thresholds to use, the following will work well for the purposes of this course:

scrna <- subset(x = scrna, subset = nFeature_RNA > 700  & nCount_RNA < Count93 & percent.mito < 0.1)

QUESTION: How many cells are there in each sample after filtering?

Step 8. [Optional] Subset the data

If necessary, you can subset the data set to N cells (2000, 5000, etc) to make it more manageable:

subcells <- sample(Cells(scrna), size=N, replace=F)
scrna <- subset(scrna, cells=subcells)

Step 9. Normalize the data, detect variable genes, and scale the data

Normalize the data:

scrna <- NormalizeData(object = scrna, normalization.method = "LogNormalize", scale.factor = 1e6);

QUESTION: What does LogNormalize do mathematically? Are there other normalization options available?

Now identify and plot the most variable genes, which will be used for downstream analyses. This is a critical step that reduces the contribution of noise. Consider adjusting the cutoffs if you think (often based on prior knowledge of your experimental system) that important genes are being excluded.

scrna <- FindVariableFeatures(object = scrna, selection.method = 'vst', mean.cutoff = c(0.1,8), dispersion.cutoff = c(1, Inf))
print(paste("Number of Variable Features: ",length(x = VariableFeatures(object = scrna))));

pdf(sprintf("%s/VG.pdf", outdir), useDingbats=FALSE)
vg <- VariableFeaturePlot(scrna)
print(vg);
dev.off()

Scale and center the data:

scrna <- ScaleData(object = scrna, features = rownames(x = scrna), verbose=FALSE);

Alternatively, you can scale the data and simultaneously remove unwanted signal associated with variables such as cell cycle phase, ribosomal transcript content, etc. (This is slow, and cannot be done in the time allotted for this course.) To remove cell cycle signal, for instance:

# scrna <- ScaleData(object = scrna, features = rownames(x = scrna), vars.to.regress = c("S.Score","G2M.Score"), display.progress=FALSE);

Save the normalized, scaled Seurat object:

saveRDS(scrna, file = sprintf("%s/VST.rds", outdir));

DIGRESSION: How can you use Seurat-processed data with packages that are not compatible with Seurat? Other packages may require the data to be normalized in a specific way, and often require an expression matrix (not a Seurat object) as input. As an example, here we prepare an expression data matrix for use with the popular CNV-detection package CONICSmat:

scrna.cnv <- NormalizeData(object = scrna, normalization.method = "RC", scale.factor = 1e5);
data.cnv <- GetAssayData(object=scrna.cnv, slot="data"); # get the normalized data
log2data = log2(data.cnv+1); # add 1 then take log2
df <- as.data.frame(as.matrix(log2data)); # convert it to a data frame
cells <- as.data.frame(colnames(df));
genes <- as.data.frame(rownames(df));
# save as text files:
fwrite(x = genes, file = "genes.csv", col.names=FALSE);
fwrite(x = cells, file = "cells.csv", col.names=FALSE);
fwrite(x = df, file = "exp.csv", col.names=FALSE);

Step 10. Reduce the dimensionality of the data using Principal Component Analysis

Subsequent calculations, such as those used to derive the tSNE and UMAP projections, and the k-Nearest Neighbor graph used for clustering, are performed in a new space with fewer dimensions, namely, the principal components. Here, specify a relatively large number of principal components – more than you anticipate using for downstream analyses. Then use several techniques to characterize the components and estimate the number of principal components that captures the signal of interest while minimizing noise.

Perform Principal Component Analysis (PCA), and save the first 100 components:

scrna <- RunPCA(object = scrna, npcs = 100, verbose = FALSE);

OPTIONAL: Then run ProjectDim, which scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. This is not used elsewhere in this pipeline, but it can be useful for exploring genes that are not among the 2000 most highly variable genes selected above.

scrna <- ProjectDim(object = scrna)

QUESTION: What do the principal components “mean” from a biological standpoint? What genes contribute to the principal components? Do they represent biological processes of interest, or technical variables (such as mitochondrial transcripts) that suggest the data may need to be filtered differently?

There are several easy ways to investigate these questions. First, visualize the PCA “loadings.” Each “component” identified by PCA is a linear combination, or weighted sum, of the genes in the data set. Here, the “loadings” represent the weights of the genes in any given component. These plots tell you which genes contribute most to each component:

pdf(sprintf("%s/VizDimLoadings.pdf", outdir), width = 8, height = 30);
vdl <- VizDimLoadings(object = scrna, dims = 1:3)
print(vdl);
dev.off();

Second, use the DimHeatmap function to generate heatmaps that summarize the expression of the most highly weighted genes in each principal component. As noted in the Seurat documentation, “both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.

pdf(sprintf("%s/PCA.heatmap.multi.pdf", outdir), width = 8.5, height = 24);
hm.multi <- DimHeatmap(object = scrna, dims = 1:12, cells = 500, balanced = TRUE);
print(hm.multi);
dev.off();

Finally, you can generate ranked lists of the genes in each principal component and perform functional enrichment or Gene Set Enrichment Analysis. (This tool offers a quick and easy way to determine functional enrichment from a list of genes.) For example, for the first principal component:

PClist_1 <- names(sort(Loadings(object=scrna, reduction="pca")[,1], decreasing=TRUE));

Now, decide how many components to use in downstream analyses. This number usually varies from 5-50, depending on the number of cells and the complexity of the data set. Although there is no “correct” answer, using too few components risks missing meaningful signal, and using too many risks diluting meaningful signal with noise.

There are several ways to make an informed decision. The first is to use the principal component heatmaps generated above. Components that generate noisy heatmaps likely correspond to noise. The second method is to examine a plot of the standard deviations of the principle components, and to choose a cutoff to the left of the bend in this so-called “elbow plot.”

Generate an elbow plot of principal component standard deviations:

elbow <- ElbowPlot(object = scrna)
pdf(sprintf("%s/PCA.elbow.pdf", outdir), width = 6, height = 8);
print(elbow);
dev.off();

Next, use a bootstrapping technique called Jackstraw analysis to estimate a p-value for each component, print out a plot, and save the p-values to a file:

scrna <- JackStraw(object = scrna, num.replicate = 100, dims=30); # takes around 4 minutes
scrna <- ScoreJackStraw(object = scrna, dims = 1:30)
pdf(sprintf("%s/PCA.jackstraw.pdf", outdir), width = 10, height = 6);
js <- JackStrawPlot(object = scrna, dims = 1:30)
print(js);
dev.off();
pc.pval <- scrna@reductions$pca@jackstraw@overall.p.values; # get p-value for each PC
write.table(pc.pval, file=sprintf("%s/PCA.jackstraw.scores.xls", outdir, date), quote=FALSE, sep='\t', col.names=TRUE);

Use the number of principal components (nPC) you selected above.

nPC = 10;
scrna <- RunUMAP(object = scrna, reduction = "pca", dims = 1:nPC);
scrna <- RunTSNE(object = scrna, reduction = "pca", dims = 1:nPC);

Now, plot the tSNE and UMAP plots next to each other in one figure, and color each data set separately:

pdf(sprintf("%s/UMAP.%d.pdf", outdir, nPC), width = 10, height = 8);
p1 <- DimPlot(object = scrna, reduction = "tsne", group.by = "DataSet", pt.size=0.1)
p2 <- DimPlot(object = scrna, reduction = "umap", group.by = "DataSet", pt.size=0.1)
print(plot_grid(p1, p2));
dev.off();

QUESTIONS:

How do the data sets compare to each other? (We will further investigate these differences in subsequent steps.)
How does the number of principal components used affect the layout?
What are the chief sources of variation in this data, as suggested by the t-SNE and UMAP layouts? Are there confounding technical variables that may be driving the layouts? What are some likely technical variables?

Color the t-SNE and UMAP plots by some potential confounding variables. Here’s an example in which we color each cell according to the number of UMIs it contains:

feature.pal = rev(colorRampPalette(brewer.pal(11,"Spectral"))(50)); # a useful color palette
pdf(sprintf("%s/umap.%d.colorby.UMI.pdf", outdir, nPC), width = 10, height = 8);
fp <- FeaturePlot(object = scrna, features = c("nCount_RNA"), cols = feature.pal, pt.size=0.1, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); # the text after the ‘+’ simply removes the axis using ggplot syntax
print(fp);
dev.off();

QUESTION: What is the relationship between the principal components and the t-SNE/UMAP layout?

To investigate this, plot several principal components on the t-SNE/UMAP, for example the following code plots the first principal component and prints the plot to a file:

pdf(sprintf("%s/UMAP.%d.colorby.PCs.pdf", outdir, nPC), width = 12, height = 6);
redblue=c("blue","gray","red"); # another useful color scheme
fp1 <- FeaturePlot(object = scrna, features = 'PC_1', cols=redblue, pt.size=0.1, reduction = "umap")+ theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp1);
dev.off();

Step 12: Infer cell types

There are many sophisticated methods for doing this (e.g. SingleR). But the simplest and most common approach is to plot the expression levels of marker genes for known cell types. Markers for bone-marrow-relevant cell types are provided in the file ~/workspace/scRNA_data/gene_lists_human_180502.csv. To plot three genes of your choice, GENE1, GENE2, and GENE3:

pdf(sprintf("%s/geneplot.pdf", outdir), height=6, width=6);
fp <- FeaturePlot(object = scrna, features = c(GENE1, GENE2, GENE3), cols = c("gray","red"), ncol=2, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(fp);
dev.off();

Now use the code that we downloaded from here to color the UMAP according to the expression of the markers in gene_lists_human_180502.csv:

source("~/workspace/scrna/PlotMarkers.r")

During the differential expression analysis in Step 14, which will take about 10 minutes to run, use these plots to make inferences about cell type.

Step 13: Cluster the cells using a graph-based clustering algorithm

The first step is to generate the k-Nearest Neighbor (KNN) graph using the number of principal components chosen above (nPC). The second step is to partition the graph into “cliques” or clusters using the Louvain modularity optimization algorithm. At this step, the cluster resolution (cluster.res) may be specified. (Larger numbers generate more clusters.) While there is no “correct” number of clusters, it can be preferable to err on the side of too many clusters. For this exercise, please use the following:

nPC = 10;
cluster.res = 0.2;
scrna <- FindNeighbors(object=scrna, dims=1:nPC);
scrna <- FindClusters(object=scrna, resolution=cluster.res);

The output of FindClusters is saved in scrna@meta.data$seurat_clusters. Note that this is reset each time clustering is performed. To ensure that each clustering result is saved, save the result as a new identity class, and give it a custom name that reflects the clustering resolution and number of principal components:

scrna[[sprintf("ClusterNames_%.1f_%dPC", cluster.res, nPC)]] <- Idents(object = scrna);

Inspect the structure of the meta data, then save the Seurat object, which now contains t-SNE and UMAP coordinates, and clustering results:

str(scrna@meta.data);
saveRDS(scrna, file = sprintf("%s/VST.PCA.UMAP.TSNE.CLUST.rds", outdir));

QUESTION: How many graph-based clusters are there? (This number is ‘n.graph’ and is used below.) How do they relate to the 2-D layouts? How does this depend on the number of components and the clustering resolution?

First plot graph-based clusters on 2-D layouts:

n.graph = length(unique(scrna[[sprintf("ClusterNames_%.1f_%dPC",cluster.res, nPC)]][,1])); # automatically get the number of clusters from a specific clustering run

Or more simply, use the most recent default clustering result:

n.graph = length(unique(scrna@meta.data$seurat_clusters));

rainbow.colors = rainbow(n.graph, s=0.6, v=0.9); # color palette
pdf(sprintf("%s/UMAP.clusters.pdf", outdir), width = 10, height = 6);
p <- DimPlot(object = scrna, reduction = "umap", group.by = "seurat_clusters", cols = rainbow.colors, pt.size=0.1, label=TRUE) + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank());
print(p);
dev.off();

QUESTION: Are there sample-specific clusters? How many cells are in each cluster and each sample?

cluster.breakdown <- table(scrna@meta.data$DataSet, scrna@meta.data$seurat_clusters);

QUESTION: What do the clusters represent? How do they differ from each other? Start with a differential expression analysis:

Step 14: Interpret the clustering using a differential gene expression (DEG) analysis

Perform DEG analysis on all clusters simultaneously using the default differential expression test (Wilcoxon), then save the results to a file. This will take 10-15 minutes using the parameters here, which were chosen to make this step run faster, and are not necessarily ideal for all situations. (While this is running, use the plots generated in Step 12 to figure out what types of cells are present in this data set.) Now compute the DEGs and save to a file:

DEGs <- FindAllMarkers(object=scrna, logfc.threshold=1, min.diff.pct=.2);
write.table(DEGs, file=sprintf("%s/DEGs.Wilcox.xls", outdir), quote=FALSE, sep="\t", row.names=FALSE);

Examine the contents and structure of DEGs. How many DEGs are there? What values are contained in this output? Now choose the top 10 DEGs in each cluster, and print them to a heatmap using DoHeatmap and a red/white/blue color scheme:

top10 <- DEGs %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC);
pdf(sprintf("%s/heatmap.pdf", outdir), height=20, width=15);
DoHeatmap(scrna, features=top10$gene, slot="scale.data", disp.min=-2, disp.max=2, group.by="ident", group.bar=TRUE) + scale_fill_gradientn(colors = c("blue", "white", "red")) + theme(axis.text.y = element_text(size = 10));
dev.off();

Questions pertaining to cell type inference and DEG analysis:

What cell types are present?
Plot some DEGs using FeaturePlot. What is more reliable or informative: the statistical significance or the log fold-change of the DEG?
Do different clusters correspond to different cell types? (Should they?)
Are the DEGs helpful for identifying cell types?
Does cell type correlate with other parameters (e.g. UMI, number of genes, cell cycle phase, etc?)

Independent exercises, if time permits

Perform a sample-wise differential expression analysis. Then make a heatmap and perform functional enrichment analysis of the differentially expressed genes. Overall, how do the samples differ from each other?
Experiment with the number of components, the clustering resolution, and the DEG filtering thresholds to understand how these parameters affect the results. What set of parameters provides the closest correspondence between cell type and cluster?
Perform batch correction using the sample code provided during the lecture. Assign each sample to its own batch, and repeat the analysis. How does batch correction affect the result?
Subset the T-cells, assign them to a new Seurat object, and re-analyze them in isolation. Does this improve your ability to resolve T-cell subsets?

Log into Compute Canada

0009-12-02T00:00:00+00:00

Signing into Compute Canada for the course

In order to sign into your Compute Canada instance, you will need a valid user ID and password for Compute Canada. These should have been provided to you by the instructors.

Logging in with ssh (Mac/Linux)

ssh user#@login1.CBW.calculquebec.cloud

user# is the name of a user on the system you are logging into. login1.CBW.calculquebec.cloud is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password.

Logging in with putty (Windows)

To log in on windows, you must first install putty. Once you have putty installed, you can log in using the following parameters. If you would like photos of where to input these parameters, please refer here.

Session-hostname: login1.CBW.calculquebec.cloud

Connection-Data-Auto-login username: user#

Copying files to your computer

To copy files from an instance, use scp in a similar fashion (in this case to copy a file called nice_alignments.bam):

scp user#@login1.CBW.calculquebec.cloud:nice_alignments.bam .

Using Jupyter Notebook or JupyterLab

Everything created in your workspace on the cloud is also available by a web server using Jupyter Notebooks or JupyterLab. You can also perform python/R analysis and access an interactive command-line terminal via JupyterLab. Simply go to the following in your browser and choose Jupyter Notebook (or JupyterLab) in the User Interface dropdown menu. For simply browsing and downloading of files you can select Number of cores = 1 and Memory (MB) = 3200. For analysis in JupyterLab you select Number of cores = 4 and Memory (MB) = 32000. NOTE: Be aware that if you request resources from both your terminal/putty (e.g., salloc requests) and also via Jupyter. These are additive. Make sure to terminate any terminal or Jupyter session not in use. It is important to log out once you finish Jupyter session to release the resources. If you only close the browser window, your Jupyter session is still running and using the resources.

https://jupyter.cbw.calculquebec.cloud/

File system layout

When you log in, you will be in your home directory (e.g., /home/user##). You will notice that you have three directories: “CourseData”, “projects”, and “scratch”. For the purposes of this course, we will mostly be working in your home directory and making use of some data files in the CourseData directory.

How to request and use a compute node

After you log into the cluster, you will be on the login node. This has very limited compute and memory resources. Do NOT run anything on the login node. You can access a compute node with an interactive session using salloc command. For example, salloc --mem 24000M -c 4 -t 8:0:0

--mem: the real memory (in megabytes) required per node.
-c | --cpus-per-task: number of processors required.
-t | --time: limit on the total run time of the job allocation.

The above command requests an interactive session with 4 cores and 32000M memory for 8 hours. Once the job is allocated, you will be on one of the compute nodes.

After you have received your compute node, you will need to load the software that we will be using for this workshop.

This can be done with the following command.

module load samtools/1.10 bam-readcount/0.8.0 hisat2/2.2.0 stringtie/2.1.0 gffcompare/0.11.6 tophat/2.1.1 kallisto/0.46.1 fastqc/0.11.8 multiqc/1.8 picard/2.20.6 flexbar/3.5.0 RSeQC/3.0.1 bedops/2.4.39 ucsctools/399 r/4.0.0 python/3.7.4 bam-readcount/0.8.0 HTSeq/1.18.1 regtools/0.5.2

Getting information on your compute jobs

The following command allow you to see all current jobs requested by your user and cancel a job if needed. This could be needed if you get connected from your compute session and you wind up with “zombie” jobs that you are no longer connected to. The first command can be used to find the job id needed for the second command.

squeue -u $user
scancel $jobid

When you are done with the compute node, make sure to type exit to exit the node and free up the resources you allocated for the node.

Strand Settings

0009-12-01T00:00:00+00:00

There are various strand-related settings for RNA-seq tools that must be adjusted to account for library construction strategy. The following table provides read orientation codes and software settings for commonly used RNA-seq analysis tools including: IGV, TopHat, HISAT2, HTSeq, Picard, Kallisto, StringTie, and others. Each of these explanations/settings is provided for several commonly used RNA-seq library construction kits that produce either stranded or unstranded data.

NOTE: A useful tool to infer strandedness of your raw sequence data is the check_strandedness tool. We provide a tutorial for using this tool here.

NOTE: In the table below, the list of methods/kits for specific strand settings assumes that these kits are used as specified by their manufacturer. It is very possible that a sequencing provider/core may make modifications to these kits. For example, in one case we obtained RNAseq data processed with NEBNext Ultra II Directional kit (dUTP method). However instead of using the NEB hairpin adapters, IDT xGen UDI-UMI adapters were substituted, and this results in the insert strandedness being flipped (from RF/fr-firststrand to FR/fr-secondstrand). Because this level of detail is not always provided it is highly recommended to confirm your data’s strandedness empirically.

Tool	RF/fr-firststrand stranded (dUTP)	FR/fr-secondstrand stranded (Ligation)	Unstranded
check_strandedness (output)	RF/fr-firststrand	FR/fr-secondstrand	unstranded
IGV (5p to 3p read orientation code)	F2R1	F1R2	F2R1 or F1R2
TopHat (`--library-type` parameter)	`fr-firststrand`	`fr-secondstrand`	`fr-unstranded`
HISAT2 (`--rna-strandness` parameter)	`R/RF`	`F/FR`	NONE
HTSeq (`--stranded`/`-s` parameter)	`reverse`	`yes`	no
STAR	n/a (STAR doesn’t use library strandedness info for mapping)	NONE	NONE
Picard CollectRnaSeqMetrics (`STRAND_SPECIFICITY parameter`)	`SECOND_READ_TRANSCRIPTION_STRAND`	`FIRST_READ_TRANSCRIPTION_STRAND`	NONE
Kallisto quant (parameter)	`--rf-stranded`	`--fr-stranded`	NONE
StringTie (parameter)	`--rf`	`--fr`	NONE
FeatureCounts (`-s` parameter)	`2`	`1`	`0`
RSEM (`–forward-prob` parameter)	`0`	`1`	`0.5`
Salmon (`--libType` parameter)	`ISR` (assuming paired-end with inward read orientation)	`ISF` (assuming paired-end with inward read orientation)	`IU` (assuming paired-end with inward read orientation)
Trinity (`–SS_lib_type` parameter)	`RF`	`FR`	NONE
MGI CWL YAML (`strand` parameter)	`first`	`second`	NONE
WASHU WDL YAML (`strand` parameter)	`first`	`second`	`unstranded`
RegTools (`strand` parameter)	`-s RF`	`-s FR`	`-s XS`
Example kits	Example methods/kits: dUTP, NSR, NNSR, Illumina TruSeq Strand Specific Total RNA, NEBNext Ultra II Directional	Example methods/kits: Ligation, Standard SOLiD, NuGEN Encore, 10X 5’ scRNA data	Example kits/data: Standard Illumina, NuGEN OvationV2, SMARTer universal low input RNA kit (TaKara), GDC normalized TCGA data

Notes

To identify which --library-type setting to use with TopHat, Illumina specifically documents the types in the ‘RNA Sequencing Analysis with TopHat’ Booklet. For the TruSeq RNA Sample Prep Kit, the appropriate library type is fr-unstranded. For TruSeq stranded sample prep kits, the library type is specified as fr-firststrand. These posts are also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)? and How to determine if a library Is strand-specific and Strandness in RNASeq by Hong Zheng. Another suggestion is to view aligned reads in IGV and determine the read orientation by one of two methods. First, you can have IGV color alignments according to strand using the ‘Color alignments’ by ‘First-of-pair strand’ setting. Second, to get more detailed information you can hover your cursor over a read aligned to an exon. ‘F2 R1’ means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5’ to 3’) this would denote a fr-firststrand setting in TopHat, i.e. “the right-most end of the fragment (in transcript coordinates) is the first sequenced”. For a negative DNA strand transcript (3’ to 5’) this would denote a fr-secondstrand setting in TopHat. ‘F1 R2’ means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but its simply the inverse for ‘F1 R2’ mapping. Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically ‘RF’ orientation is reserved for large-insert mate-pair libraries. Other orientations like ‘FF’ and ‘RR’ seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference. Additional details are provided in the TopHat manual.

For HTSeq, the htseq-count manual indicates that for the --stranded option, stranded=no means that a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

For the ‘CollectRnaSeqMetrics’ sub-command of Picard, the Picard manual indicates that one should use FIRST_READ_TRANSCRIPTION_STRAND if the reads are expected to be on the transcription strand.

Example data providers

Examples (from check_strandedness) that we have observed from different providers (note that these could be changed by the provider at any time, so you should always check your own data):

Boston Gene: RF/fr-firststrand
Personalis: RF/fr-firststrand
WASHU CLE Lab: RF/fr-firststrand
Caris: RF/fr-firststrand
Tempus: FR/fr-secondstrand
IGM @ Nationwide Children’s Hospital: FR/fr-secondstrand

Complete Result Sets

0009-11-01T00:00:00+00:00

Introduction

The following links provide examples of complete result sets for different interations of this coures. These are meant to be the complete set of result files obtained by the instructor running through all the commands of the course. The files are made available in the same file/directory structure as you should get from following the instructions yourself.

CBW June 2020 (Virtual)

Bioinformatics Best Practices

0009-10-01T00:00:00+00:00

Introduction

This best practices guide provides a basic overview of useful practices and tools for managing bioinformatics environments and analysis development.

Managing Your Analysis with Notebooks

Similar to the use of a laboratory notebook, taking notes about the procedures and analysis you performed is critical to reproducible science. There are a number of scientific computing notebooks available, but the most popular by far is the Jupyter Notebook.

Jupyter supports interactive data science and scientific computer across a small number of languages, although the most popular use of Jupyter is with Python, as the Jupyter notebook is built upon the Python-based iPython Notebook.

Example notebooks

A live version of Jupyter is available to try online, and provides several example notebooks in a few different languages. You can also check out a real analysis of Guide to Pharmacology gene family data for incorporation into the Drug-Gene Interaction Database.

Versioning Code with Git and GitHub

Git is a distributed version control system that allows users to make changes to code while simultaneously documenting those changes and preserving a history, allowing code to be rolled back to a previous version quickly and safely. GitHub is a freemium, online repository hosting service. You may use GitHub to track projects, discuss issues, document applications, and review code. GitHub is one of the best ways to share your projects, and should be used from the very onset of a project. Some forethought should be given in creating and managing a repository, however, as GitHub is not a good place to share very large or sensitive data files. See the 10-minute introduction to using GitHub.

Managing Your Compute Environment

One of the most challenging aspects of bioinformatics workflows is reproducibility. In addition to documenting your analysis with a notebook, providing a copy of your compute environment limits variability in results, allowing for future reproduction of results. A world of options exist to handle this, although some of the most common options are presented.

AWS Elastic Cloud Computing is a useful service for creating entire virtual machines that can easily be copied and distributed. This option does require a paid account with Amazon, and the costs of storing the images and running instances may add up over time, especially if every analysis is stored in a separate image. Additionally, this option does not isolate the analysis environment from the system environment, potentially leading to changes in analysis output as system libraries are updated over time. The RNA-seq wiki makes heavy use of AWS as a distribution platform.

VirtualBox is a general-purpose full virtualizer that allows you to emulate a computer, complete with virtual disks, a virtual operating system, and any data and applications stored therein. It has the advantage of creating machines that are stored and run on local hardware (e.g. your personal workstation), but the extra overhead of running a virtual computer on top of a host operating system can considerably slow performance of tools stored on the virtual machine, and thus is best used for testing or demonstration purposes.

Docker packages apps and their dependencies into containers which may be docked to a docker engine running on a computer. Docker engines are available on all major operating systems, and allow software to remain infrastructure independent while sharing a filespace and system resources with other docked containers. This is a much more efficient approach than guest virtual machines, and containers may be docked locally or on cloud-based infrastructure.

Conda is a language-agnostic package, dependency and environment management system. It is included in the data-science-focused distribution of Conda, Anaconda. Anaconda is based on Python and R packages for the analysis of scientific, large-scale data. Bioinformaticians also commonly use Bioconda, which add channels to Conda with bioinformatics tools (such as the popular sequence alignment tool BWA).

POSIT Setup

0009-09-03T00:00:00+00:00

Posit setup for use in CRI 2024 workshop

This tutorial explains how Posit cloud RStudio was configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Posit RStudio.

A Posit workspace was already created by the workshop organizers. We used Posit projects with 16GB RAM and 2 cores for the workshop with OS Ubuntu 20.04. Using these configurations, we created a template file that has all the raw data files uploaded along with the R packages needed for the workshop. From the student side, the intention is to make copies off this template so that they have an RStudio environment with the raw data files that has the packages pre-installed.

Upload raw data

Folders for uploading raw data were created using the RStudio terminal. Files were either uploaded from a local laptop/ storage1 location using the Upload feature in the bottom right pane of the RStudio window; or downloaded from genomedata.org using wget from the RStudio terminal.

mkdir data
mkdir outdir
mkdir outdir_single_cell_rna
mkdir package_installation

cd data
mkdir single_cell_rna
mkdir bulk_rna

Files in single_cell_rna

CellRanger outputs for reps1,3,5 (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/counts_gex/sample_filtered_feature_bc_matrix.h5.zip)
BCR and TCR clonotypes (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_b_posit.zip and /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_t_posit.zip)
MSigDB M8: cell type signature gene sets (downloaded GMT file from MSigDB website to laptop and then uploaded to single_cell_rna folder)
CONICSmat mm10 chr arms positions file (downloaded file from CONICSmat GitHub - chromosome_full_positions_mm10.txt to laptop and then uploaded to single_cell_rna folder)
VarTrix file with barcodes and tumor calls (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI_Updated_Barcodes.tsv) -> might not need this so may remove.
VarTrix output files (uploaded all matrices and the barcodes files from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/vartrix_outputs_for_CRI.zip - uploaded to a cancer_cell_id folder in data/single_cell_rna/)
Mouse variants VCF file (uploaded file from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/exome/output_updated/final_basic_filtered_annotated.vcf)

Posit requires all files to be zipped prior to uploading and automatically unzips the folder after the upload. After uploading the files, made a folder for the cellranger outputs, and moved the .h5 files there. Will also download inferCNV files using wget

#organize cellranger outputs
cd /cloud/project/data/single_cell_rna
mkdir cellranger_outputs
mv *.h5 cellranger_outputs

#download inferCNV reference files and organize all reference files
mkdir reference_files
mv m8.all.v2023.2.Mm.symbols.gmt reference_files
mv Tumor_Calls_per_Variants_for_CRI.tsv reference_files
cd reference_files
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_id.infercnv_positions
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_name.infercnv_positions

#organize vartrix files
cd /cloud/project/data/single_cell_rna
mkdir cancer_cell_id 
cd cancer_cell_id
wget http://genomedata.org/cri-workshop/somatic_variants_exome/mcb6c-exome-somatic.variants.annotated.clean.tsv

Files in bulk_rna

Batch correction file (downloaded from genomedata - GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv)
DE analysis files (downloaded from genomedata - ENSG_ID2Name.txt and gene_read_counts_table_all_final.tsv)

cd /cloud/project/data/bulk_rna
wget http://genomedata.org/rnaseq-tutorial/batch_correction/GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/ENSG_ID2Name.txt
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/gene_read_counts_table_all_final.tsv

Back-up files

Created folder in outdir/single_cell_rna called backup_files. Ran through QA/QC assessment and celltyping modules and added preprocessed_object.rds Seurat object from there to backup_files.

Installing packages

All package installations are from CRAN or BioConductor or GitHub pages, except for CytoTRACE. That was downloaded to the package_installation folder and then installed using devtools.

#Download CytoTRACE tar.gz file
download.file("https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz", destfile = "package_installation/CytoTRACE_0.3.3.tar.gz")

# Installing package installers
install.packages("devtools")
install.packages("BiocManager")

# Bulk RNA seq libraries
BiocManager::install("genefilter")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("data.table")
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("GO.db")
BiocManager::install("gage")
BiocManager::install("sva")
install.packages("gridExtra")
BiocManager::install("edgeR")
install.packages("UpSetR")
BiocManager::install("DESeq2")
install.packages("gtable")
BiocManager::install("apeglm")

# Intro to R packages
install.packages("tidyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("MASS")
install.packages("ggpubr")

# Single-cell RNA seq libraries
BiocManager::install("sva") #need this for cytotrace
devtools::install_local("package_installation/CytoTRACE_0.3.3.tar.gz")
install.packages("Seurat")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("Matrix")
install.packages("hdf5r")
install.packages("bench") # to mark time
install.packages("viridis")
install.packages("R.utils")
remotes::install_github("satijalab/seurat-wrappers")
BiocManager::install("celldex")
BiocManager::install("SingleR")
devtools::install_github("immunogenomics/presto")
BiocManager::install("EnhancedVolcano")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Mm.eg.db")
install.packages("msigdbr")
BiocManager::install("scRepertoire")
BiocManager::install("BiocGenerics")
BiocManager::install("DelayedArray")
BiocManager::install("DelayedMatrixStats")
BiocManager::install("limma")
BiocManager::install("lme4")
BiocManager::install("S4Vectors")
BiocManager::install("SingleCellExperiment")
BiocManager::install("SummarizedExperiment")
BiocManager::install("batchelor")
BiocManager::install("HDF5Array")
BiocManager::install("terra")
BiocManager::install("ggrastr")
devtools::install_github("cole-trapnell-lab/monocle3")
install.packages("beanplot")
install.packages("mixtools")
install.packages("pheatmap")
install.packages("zoo")
install.packages("squash")
install.packages("showtext")
BiocManager::install("biomaRt")
BiocManager::install("scran")
devtools::install_github("diazlab/CONICS/CONICSmat", dep = FALSE)
install.packages("gprofiler2")
devtools::install_github(repo = "ncborcherding/scRepertoire")

GCP Setup

0009-09-02T00:00:00+00:00

UNDER DEVELOPMENT

Google Cloud Platform setup for use in workshop

This tutorial explains how a Google Cloud Instance can be configured from scratch for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Google GCP.

Create a Google Cloud account

You will need a Google account (personal or institutional)
Use the above email account to log into the Google Cloud Console: https://console.cloud.google.com/. Note: Any GCP account needs to be linked to an actual person/credit card account or institutional billing account.
Create a Google Cloud Project connected to a billing source
Optional - Set up an IAM account. Details to be resolved…
Request limit increases. You need to be able to spin up at least one instance for every student and TA/instructor. To find current limits and request increases in the console, go to: IAM & Admin -> Quotas.
In the GCP console: Go to Compute Engine -> VM instances.

Start with existing base image

Create Instance
Give the Instance a Name (e.g. rnabio-course-2023)
Select Machine Type (e.g. E2 Series: e2-standard-2)
Change the Boot disk to Ubuntu -> Ubuntu 20.04 LTS (x86/64). Change size to 250 GB.
Under Firewall, select these options: Allow HTTP traffic and Allow HTTPS traffic
Hit the Create button

Install the Google Cloud commandline interface following instructions here: https://cloud.google.com/sdk/docs/install
Use the following command and follow instructions to authenticate your user: gcloud auth login
Set the project to the google billing project created above as follows: gcloud config set project $project_name
Check the authentication configuration as follows: gcloud config list
Log into the instance using the instance name chosen above follows

gcloud compute ssh rnabio-course-2023

Set up the ubuntu user:

Logging into a Google VM of Ubuntu is a bitter different from AWS. By default you will login with your Google user name (or is it your username from the host machine you login from?) instead of using the “ubuntu” user.

Set password for the ubuntu user (and make note of this somewhere safe). Then change users to the “ubuntu” user before proceeding with the rest of this setup. Note that later if you login as another sudo user, if you need to, you should be able to reset the password associated with the ubuntu user.

whoami
sudo passwd ubuntu
su ubuntu
cd ~

Perform basic linux configuration

To allow installation of bioinformatics tools some basic dependencies must be installed first.

sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcurl4-openssl-dev
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/Chicago

logout and log back in for changes to take effect.

exit
exit
gcloud compute ssh rnabio-course-2023
su ubuntu
cd ~

Add ubuntu user to docker group

sudo usermod -aG docker ubuntu

Then exit shell and log back into instance.

Install any desired informatics tools

NOTE: R in particular is a slow install.
NOTE:

- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.

Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add export RNA_HOME=~/workspace/rnaseq to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc.
NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc:

export MANPAGER=less

Install RNA-seq software

These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in /home/ubuntu/bin/ and its install location is exported to the $PATH variable for easy access.

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

cd ~/bin
wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
bunzip2 samtools-1.16.1.tar.bz2
tar -xvf samtools-1.16.1.tar
cd samtools-1.16.1
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH

Install bam-readcount

cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.16.1
git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz
tar -xzvf stringtie-2.1.6.tar.gz
cd stringtie-2.1.6
make release
export PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH

Install htseq-count

sudo apt install python3-htseq

Make sure that OpenSSL is on correct version

TopHat will not install if the version of OpenSSL is too old.

To get version:

openssl version

If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps.

cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig

Again, from the terminal issue the command:

openssl version

Your output should be as follows:

OpenSSL 1.1.1g  21 Apr 2020

Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano.

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Intall a particular version of numpy that hopefully works with all the dependencies that rely on it

cd ~/bin
pip install --force-reinstall -v "numpy==1.24.1"

Install MultiQC

cd ~/bin
export PATH=/home/ubuntu/.local/bin:$PATH
pip3 install multiqc
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

Install Flexbar

sudo apt install flexbar

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=/home/ubuntu/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.40
cd bedops_linux_x86_64-v2.4.40
wget -c https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.40.tar.bz2
./bin/bedops
export PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH

Install Cell Ranger

Must register to get download link

cd ~/bin
wget `download_link`
tar -xzvf cellranger-7.1.0.tar.gz
cd cellranger-7.1.0
./bin/cellranger
export PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH

Install TABIX

sudo apt-get install tabix

Install BWA

cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH

Install bedtools

cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -zxvf bedtools-2.30.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH

Install BCFtools

cd ~/bin
wget wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2
bunzip2 bcftools-1.16.tar.bz2
tar -xvf bcftools-1.16.tar
cd bcftools-1.16
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.14:$PATH

Install htslib

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2
bunzip2 htslib-1.16.tar.bz2
tar -xvf htslib-1.16.tar
cd htslib-1.16
make
./htsfile
export PATH=/home/ubuntu/bin/htslib-1.14:$PATH

Install peddy

cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .
python -m peddy -h

Install slivar

cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar
chmod +x ./slivar
./slivar
export PATH=/home/ubuntu/bin:$PATH

Install STRling

cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling
chmod +x ./strling
./strling -h
export PATH=/home/ubuntu/bin:$PATH

Install freebayes

sudo apt install freebayes

Install vcflib

sudo apt install libvcflib-tools libvcflib-dev

Install Anaconda

cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh

Press Enter to review the license agreement. Then press and hold Enter to scroll.

Enter “yes” to agree to the license agreement.

Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3.

Install [VEP]

Describes dependencies for VEP 108, used in this course for variant annotation. When running the VEP installer follow the prompts specified:

Do you want to install any cache files (y/n)? n [ENTER] (select number for homo_sapiens_vep_108_GRCh38.tar.gz) [ENTER]
Do you want to install any FASTA files (y/n)? n [ENTER] (select number for homo_sapiens) [ENTER]
Do you want to install any plugins (y/n)? n [ENTER]

The VEP cache and FASTA files are very large ~25G or more. Probably do NOT want to install these as part of an image, but it should be possible to rerun this tool and install them later as needed.

mkdir ~/workspace
cd ~/bin
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/bin/ensembl-vep:$PATH

Set up Jupyter to render in web brower

Followed this website

First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:

export PATH=/home/ubuntu/anaconda3/bin:$PATH

Then you need to source the .bashrc for changes to take effect.

source .bashrc

We then need to create our Jupyter configuration file. In order to create that file, you need to run:

jupyter notebook --generate-config

After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:

Enter the IPython command line:

ipython

Now follow these steps to generate your password:

from IPython.lib import passwd

passwd()

exit

You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.

Next go into your jupyter config file:

cd ~/.jupyter/

vim jupyter_notebook_config.py

Note: You may need first to run exit in order to exit IPython otherwise the vim command may not be recognized by the terminal.

And add the following code:

conf = get_config()

conf.NotebookApp.ip = '0.0.0.0'
conf.NotebookApp.password = u'YOUR PASSWORD HASH'
conf.NotebookApp.port = 8888
# Note: this code below should be put at the beginning of the document.

We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:

cd ~/workspace
mkdir Jupyter_Notebooks

You can call this folder anything, for this example we call it Notebooks

After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:

jupyter notebook

From there you should be able to access your server by going to:

https://(your GCP public IP):8888/

Note that in order for this to work you need to have allowed external access to this machine over port 8888. In the GCP firewall settings this means adding a new firewall rull:

Name: default-allow-jupyter
Direction of traffic: Ingress
Action on match: Allow
Targets: All instances in the network
Source IP ranges: 0.0.0.0/0
Specified protocols and ports: tcp:8888

Install R

sudo apt-get -y remove r-base-core
sudo apt-get -y remove r-base
sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt install r-base
R --version

#make R library location accessible
sudo chown -R ubuntu:ubuntu /usr/local/lib/R/
chmod -R 775 /usr/local/lib/R

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db"))
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Path setup

Add the following lines to the .bashrc using vim to ensure that all tools install are in the ubuntu user path:

PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH
PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH
PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
PATH=/home/ubuntu/bin/FastQC:$PATH
PATH=/home/ubuntu/.local/bin:$PATH
PATH=/home/ubuntu/bin/regtools/build:$PATH
PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
PATH=/home/ubuntu/bin/genePredToBed:$PATH
PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH
PATH=/home/ubuntu/bin/bwa:$PATH
PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
PATH=/home/ubuntu/bin/htslib-1.14:$PATH
PATH=/home/ubuntu/bin/ensembl-vep:$PATH

For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

Edit config to allow files to be served from outside /usr/share and /var/www

sudo vim /etc/apache2/apache2.conf

Add the following content to apache2.conf

       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted

Edit vhost file

sudo vim /etc/apache2/sites-available/000-default.conf

Change document root in 000-default.conf

DocumentRoot /home/ubuntu/workspace

Restart apache

sudo service apache2 restart

Test by going to your instance’s public IP address in your browser.

Create a public Google cloud image using the GCP console

Under Compute Engine -> Virtual Machines -> VM Instances. Stop the instance.
Under Compute Engine -> Storage -> Images. Create Image.
Provide a name for the image (e.g. rnabio-course-2023-v1).
Select Source -> Disk
Under `Source disk’ -> Choose the name of the stopped instance (e.g. rnabio-course-2023)
Select Location -> Multi-regional
Select location -> us (multiple regions in the United States)
Leave Family blank, but add a description.
Encryption -> Google-managed encryption key.

To make the image fully public execute the following Google SDK command:

gcloud compute images add-iam-policy-binding rnabio-course-2023-v2 --member='allAuthenticatedUsers' --role='roles/compute.imageUser'

To list the image from the command line:

gcloud compute images list --filter="name=rnabio-course-2023-v2"

Current Public Google Images

rnabio-course-2023-v2

Launch student instance using this image

To start a new VM with the public image above one can use the GCP console as was done above to create a new VM with vanilla ubuntu, except this time selecting the pre-configured image with all tool installed already.

We have been unable to get this to work using the Console. It seems listing custom public images is not working there… ?

From the command line you can launch an instance as follows (you should probably personalize the malachi-course-2023 name used in two places of this command):

gcloud compute instances create malachi-course-2023 --zone=us-central1-a --machine-type=e2-standard-4 --network-interface=network-tier=PREMIUM,subnet=default --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,image=projects/griffith-lab/global/images/rnabio-course-2023-v2,mode=rw,size=250,type=pd-balanced,device-name=malachi-course-2023 

You will want to do everything on this VM as the “ubuntu” user. First set the password for that user and then change to it.

gcloud compute ssh ubuntu@malachi-course-2023

Test environment

bwa mem
env

AWS Setup

0009-09-01T00:00:00+00:00

Amazon AWS/AMI setup for use in workshop

This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.

Create AWS account

A helpful tutorial can be found here

Create a new gmail account to use for the course
Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
Sign into AWS Management Console: http://aws.amazon.com/console/
Go to EC2 services

Start with existing community AMI

Launch a fresh Ubuntu Image (Ubuntu Server 22.04 LTS at the time of writing this). Choose an instance type of m6a.xlarge. Increase root volume (e.g., 60GB) and add a second volume (e.g., 500GB). Choose appropriate security group (for 2023 course, choose an existing security group launch-wizard-14). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance.
Change permissions on downloaded key pair with chmod 400 [instructor-key].pem
Login to instance with ubuntu user:

ssh -i [instructor-key].pem ubuntu@[public.ip.address]

Note: for 2023 course, choose security group launch-wizard-14

Perform basic linux configuration

To allow installation of bioinformatics tools some basic dependencies must be installed first.

sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/New_York

logout and log back in for changes to take affect.

Add ubuntu user to docker group

sudo usermod -aG docker ubuntu

Then exit shell and log back into instance.

Set up additional storage for workspace

We first need to setup the additional storage volume that we added when we created the instance.

# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace

# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace

In order to make the workspace volume persistent, we need to edit the etc/fstab file in order. AWS provides instructions for how to do this here.

# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS

# get UUID from sudo lsblk -f
UUID=$(sudo lsblk -f | grep nvme1n1 | awk {'print $4'})
#if want to double check, can do 'echo $UUID' to see the UUID. 

#then add that UUID to /etc/fstab
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\nUUID=$UUID /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab
#'less /etc/fstab' , to see if the new line has been added

# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace

# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace

Install any desired informatics tools

NOTE: R in particular is a slow install.
NOTE:

- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.

Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add export RNA_HOME=~/workspace/rnaseq to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc and http://genomedata.org/rnaseq-tutorial/bashrc_copy.
NOTE: (This didn’t happen during installation for the year 2023, but) In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc:

export MANPAGER=less

Install RNA-seq software

These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in /home/ubuntu/bin/ and its install location is exported to the $PATH variable for easy access.

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

~/bin
wget https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2
bunzip2 samtools-1.18.tar.bz2
tar -xvf samtools-1.18.tar
cd samtools-1.18
make
./samtools
#add the following line to .bashrc
export PATH=/home/ubuntu/bin/samtools-1.18:$PATH
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.18

Install bam-readcount

cd ~/bin
git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz
tar -xzvf stringtie-2.2.1.tar.gz
cd stringtie-2.2.1
make release
export PATH=/home/ubuntu/bin/stringtie-2.2.1:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH

Install htseq-count

sudo apt install python3-htseq
# to check version,type : htseq-count --version

Make sure that OpenSSL is on correct version

TopHat will not install if the version of OpenSSL is too old.

To get version:

openssl version

If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps.

cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig

Again, from the terminal issue the command:

openssl version

Your output should be as follows:

OpenSSL 1.1.1g  21 Apr 2020

Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano.

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

Note: There are a couple of arguments only supported in kallisto legacy versions (version before 0.50.0). Also how_are_we_stranded_here uses kallisto == 0.44.x. Thus, installation steps below if for 1 of the legacy versions. But if run into problem, consider using a more updated version.

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Install MultiQC

cd ~/bin
pip3 install multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

export PICARD=/home/ubuntu/bin/picard.jar

Install Flexbar

sudo apt install flexbar

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.41
cd bedops_linux_x86_64-v2.4.41
wget -c https://github.com/bedops/bedops/releases/download/v2.4.41/bedops_linux_x86_64-v2.4.41.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.41.tar.bz2
./bin/bedops

export PATH=~/bin/bedops_linux_x86_64-v2.4.41/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredtoBed:$PATH 
#note: the path has lowercase 't' at in 'genePredtoBed'
#genePredToBed 

Install how_are_we_stranded_here

pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness

Install Cell Ranger

Must register to get download link

cd ~/bin
wget `download_link`
tar -xzvf cellranger-7.2.0.tar.gz
export PATH=/home/ubuntu/bin/cellranger-7.2.0:$PATH

Install TABIX

sudo apt-get install tabix

Install BWA

cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH
#bwa mem #to call bwa

Install bedtools

cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz
tar -zxvf bedtools-2.31.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH

Install BCFtools

cd ~/bin
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
bunzip2 bcftools-1.18.tar.bz2
tar -xvf bcftools-1.18.tar
cd bcftools-1.18
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.18:$PATH

Install htslib

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
bunzip2 htslib-1.18.tar.bz2
tar -xvf htslib-1.18.tar
cd htslib-1.18
make
sudo make install
#htsfile --help
export PATH=/home/ubuntu/bin/htslib-1.18:$PATH

Install peddy

cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .

Install slivar

cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.3.0/slivar
chmod +x ./slivar

Install STRling

cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.2/strling
chmod +x ./strling

Install freebayes

sudo apt install freebayes

Install vcflib

sudo apt install libvcflib-tools libvcflib-dev

Install Anaconda

cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh 
bash Anaconda3-2023.09-0-Linux-x86_64.sh

Press Enter to review the license agreement. Then press and hold Enter to scroll.

Enter “yes” to agree to the license agreement.

Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3.

Add in bashrc:

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

To see location of conda executable: which conda

Install VEP

Note: Install VEP in workspace because cache file for that takes a lot of space (~25G).

Describes dependencies for VEP 110, used in this course for variant annotation. When running the VEP installer follow the prompts specified:

Do you want to install any cache files (y/n)? n (In case want to install cache file, choose ‘y’ [ENTER] (select number for homo_sapiens_vep_110_GRCh38.tar.gz) [ENTER] )
Do you want to install any FASTA files (y/n)? y [ENTER] (select number for homo_sapiens) [ENTER]
Do you want to install any plugins (y/n)? n [ENTER]

cd ~/workspace
sudo git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
sudo perl -MCPAN -e'install "LWP::Simple"'
sudo perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/workspace/ensembl-vep:$PATH
#vep --help

Set up Jupyter to render in web brower

Followed this website and this website Note: The old jupyter notebook was split into jupyter-server and nbclassic. The steps to set up jupyter on ec2 in the first link therefore have been adapted based on suggestions in the second link to accommodate this migration.

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

Then you need to source the .bashrc for changes to take effect.

source .bashrc

We then need to create our Jupyter configuration file. In order to create that file, you need to run:

jupyter notebook --generate-config

( ~~~~~~~ Optional: After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:

Enter the IPython command line:

ipython

Now follow these steps to generate your password:

from notebook.auth import passwd

passwd()

You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.

Run exit in order to exit IPython. ~~~~~~ )

Next go into your jupyter config file (/home/ubuntu/.jupyter/jupyter_server_config.py) :

cd .jupyter

vim jupyter_notebook_config.py

And add the following code at the beginning of the document:

c = get_config() #add this line if it's not already in jupyter_notebook_config.py

c.ServerApp.ip = '0.0.0.0'
#c.ServerApp.password = u'YOUR PASSWORD HASH' #uncomment this line if decide to use password
c.ServerApp.port = 8888

(~~~~~~ Optional:

We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:

mkdir Notebooks

You can call this folder anything, for this example we call it Notebooks ~~~~~~ )

After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:

jupyter nbclassic

From there you should be able to access your server by going to:

http://(your AWS public dns):8888/ or http://(your AWS public dns):8888/(tree?token=... - in the message generated while running 'jupyter nbclassic')

(Note: if ever run into problem accessing server, double check whether you are using http or https. If you didnt add https port in security group configuration step when create the instance, then you wouldn’t be able to access server with https.)

Install R

Follow this guide website

cd ~/bin

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg

echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" | sudo tee -a /etc/apt/sources.list.d/r-project.list

sudo apt update

sudo apt install --no-install-recommends r-base

Note, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory.

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","sctransform","Seurat","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")

Note: if asked if want to install in personal library, type ‘yes’.

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db"))
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Install softwares for germline analyses

gatk
minimap
NanoPlot
Varscan

Install gatk

(Note: in cshl2023 version of the course, install this gatk 4.2.1.0 instead of an more updated ver since this work with the current Java version - Java ver 11)

cd ~/bin
wget https://github.com/broadinstitute/gatk/releases/download/4.2.1.0/gatk-4.2.1.0.zip
unzip gatk-4.2.1.0.zip

export PATH=/home/ubuntu/bin/gatk-4.2.1.0:$PATH #add to .bashrc
gatk --help
gatk --list

Install minimap2

cd ~/bin
curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.26_x64-linux/minimap2

export PATH=/home/ubuntu/bin/minimap2-2.26_x64-linux:$PATH #add to .bashrc
minimap2 --help

Install NanoPlot

pip install NanoPlot
#which NanoPlot
#NanoPlot -h

Install Varscan

cd ~/bin
curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
java -jar ~/bin/VarScan.v2.4.2.jar

Install packages for single-cell ATAC-seq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name snapatac2_env python=3.11
source activate snapatac2_env
conda activate snapatac2_env

pip install snapatac2
#pip show snapatac2 
pip install scanpy
pip install MACS2
pip install --user magic-impute
pip install deeptools 

conda deactivate

To run virtual environment in jupyter nbclassic, there are a few extra set up steps:

#Step 1: Activate the Conda Environment of interest:
conda activate snapatac2_env
#Step 2: Install Ipykernel: 
conda install ipykernel
#Step 3: Create a Jupyter Kernel for the environment
python -m ipykernel install --user --name=snapatac2_env_kernel
```bash
Then run jupyter notebook as usual:
```bash
jupyter nbclassic

Access server by adding to the browser: http://(your AWS public dns):8888/ We can either create a notebook using desired environment kernel, or just create a notebook using the default ipykernel and change kernel within the notebook itself.

Install ATACseqQC

R
#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
BiocManager::install("ATACseqQC")
quit(save="no")

Install packages for single-cell RNAseq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name scRNAseq_env python=3.11
source activate scRNAseq_env
conda activate scRNAseq_env
pip install 'scanpy[leiden]'
pip install gtfparse==1.2.0
pip install scrublet
pip install fast_matrix_market
pip install harmony-pytorch

conda install ipykernel
python -m ipykernel install --user --name=scRNAseq_env_kernel
conda deactivate

Path setup

For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.

Set up Apache web server

Edit config to allow files to be served from outside /usr/share and /var/www

sudo vim /etc/apache2/apache2.conf

Add the following content to apache2.conf

       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted

Edit vhost file

sudo vim /etc/apache2/sites-available/000-default.conf

Change document root in 000-default.conf to ‘/workspace’

DocumentRoot /workspace

Restart apache

sudo service apache2 restart

To check if the server works, type in browser of choice: http://[public ip address of ec2 instance]. You should see the content within /workspace .

Save a public AMI

Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.

Current Public AMIs

cshl-seqtec-2022 (ami-09b613ae9751a96b1; N. Virginia)
cbw-rnabio-2023 (ami-09b3fd07d90812201; N. Virginia)
cshl-seqtec-2023 (ami-05d41e9b8c7eee2df; N. Virginia)

Create IAM account

From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.

Launch student instance

Go to AWS console. Login. Select EC2.
Launch Instance, search for “cshl-seqtech-2021” in Community AMIs and Select.
Choose “m5.2xlarge” instance type.
Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
Make sure that you see two snapshots (e.g., the 32GB root volume and 80GB EBS volume you set up earlier)
Create a tag with Name=StudentName
Choose existing security group call “SSH_HTTP”. Review and Launch.
Choose an existing key pair (cshl_2021_student.pem)
View instances and wait for them to finish initiating.
Find your instance in console and select it, then hit connect to get your public.ip.address.
Login to node ssh -i cshl_2021_student.pem ubuntu@[public.ip.address].
Optional - set up DNS redirects (see below)

Set up a dynamic DNS service

Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances.

Host necessary files for the course

Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.

Files copied to: /gscmnt/sata102/info/ftp-staging/pub/rnaseq/
Appear here: http://genome.wustl.edu/pub/rnaseq/

After course reminders

Delete the student IAM account created above otherwise students will continue to have EC2 privileges.
Terminate all instances and clean up any unnecessary volumes, snapshots, etc.

Integrated Assignment Answers

0009-08-01T00:00:00+00:00

Integrated Assignment answers

Background: Cell lines are often used to study different experimental conditions and to study the function of specific genes by various perturbation approaches. One such type of study involves knocking down expression of a target of interest by shRNA and then using RNA-seq to measure the impact on gene expression. These eperiments often include use of a control shRNA to account for any expression changes that may occur from just the introduction of these molecules. Differential expression is performed by comparing biological replicates of shRNA knockdown vs shRNA control.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA-seq datasets generated from a cell line (3 transfected with shRNA, and 3 controls). Our goal will be to determine differentially expressed genes.

Experimental information and other things to keep in mind:

The libraries are prepared as paired end.
The samples are sequenced on an Illumina 4000.
Each read is 150 bp long
The dataset is located here: GSE114360
3 samples transfected with target shRNA and 3 samples with control shRNA
Libraries were prepared using standard Illumina protocols
For this exercise we will be using a subset of the reads (first 1,000,000 reads from each pair).
The files are named based on their SRR id’s, and obey the following key:
- SRR7155055 = CBSLR knockdown sample 1 (T1 - aka transfected 1)
- SRR7155056 = CBSLR knockdown sample 2 (T2 - aka transfected 2)
- SRR7155057 = CBSLR knockdown sample 3 (T3 - aka transfected 3)
- SRR7155058 = control sample 1 (C1 - aka control 1)
- SRR7155059 = control sample 2 (C2 - aka control 2)
- SRR7155060 = control sample 3 (C3 - aka control 3)

Experimental descriptions from the study authors:

Experimental details from the paper: “An RNA transcriptome-sequencing analysis was performed in shRNA-NC or shRNA-CBSLR-1 MKN45 cells cultured under hypoxic conditions for 24 h (Fig. 2A).”

Experimental details from the GEO submission: “An RNA transcriptome sequencing analysis was performed in MKN45 cells that were transfected with tcons_00001221 shRNA or control shRNA.”

Note that according to GeneCards and HGNC, CBSLR and tcons_00001221 refer to the same gene.

Part 0 : Obtaining Data and References

Goals:

Obtain the files necessary for data processing
Familiarize yourself with reference and annotation file format
Familiarize yourself with sequence FASTQ format

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_DIR that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME
mkdir -p ~/workspace/rnaseq/integrated_assignment/
export RNA_INT_DIR=~/workspace/rnaseq/integrated_assignment

Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory

Remember: when initiating an environment variable, we do NOT need the $; however, everytime we call the variable, it needs to be preceeded by a $.

echo $RNA_INT_DIR
cd $RNA_INT_DIR
wget http://genomedata.org/rnaseq-tutorial/Integrated_Assignment_RNA_Data.tar.gz
tar -xvf Integrated_Assignment_RNA_Data.tar.gz

Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

A1.) The answer is 10. Review these files so that you are familiar with them. If the reference fasta or gtf was not provided, you could obtain them from the Ensembl website under their downloads > databases.

cd $RNA_INT_DIR/reference/
tree
find . -type f
find . -type f | wc -l

The . tells the find command to look in the current directory and -type f restricts the search to files only. The | uses the output from the find command and wc -l counts the lines of that output

Q2.) How many exons does the gene SOX4 have? Which PCA3 isoform has the most exons?

A2.) SOX4 only has 1 exon, while the longest isoform of PCA3 (ENST00000645704) has 7 exons. Review the GTF file so that you are familiar with it. What downstream steps will we need this gtf file for?

grep -w "SOX4" Homo_sapiens.GRCh38.92.gtf | less -S

grep -w "PCA3" Homo_sapiens.GRCh38.92.gtf | grep -w "exon" | cut -f 9 | cut -d ";" -f 3 | sort | uniq -c

Q3.) How many samples do you see under the data directory?

A3.) The answer is 6 samples. The number of files is 12 because the sequence data is paired (an R1 and R2 file for each sample). The files are named based on their SRA accession number.

cd $RNA_INT_DIR/data/
ls -l
ls -1 | wc -l

NOTE: The fastq files you have copied above contain only the first 1,000,000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

Part 1 : Data preprocessing

Goals:

Run a quality check with fastqc before and after trimming
Familiarize yourself with the options for fastqc to be able to redirect your output
Perform adapter trimming and data cleanup on your data using fastp
Familiarize yourself with the output metrics from adapter trimming
Examine fastqc and/or multiqc reports for the pre- and post-trimmed data

Create a new folder that will house the outputs from FastQC. Use the -h option to view the potential output on the data to determine the quality of the data.

cd $RNA_INT_DIR
mkdir -p qc/raw_fastqc
fastqc $RNA_INT_DIR/data/*.fastq.gz -o qc/raw_fastqc/
cd qc/raw_fastqc
multiqc ./

Q4.) What metrics, if any, have the samples failed? Are the errors related?

A4.) The per base sequence content of the samples don’t show a flat distribution and do have a bias towards certain bases at the beginning of the reads. The reason for this bias could be non-random priming during cDNA synthesis giving rise to non-random bases near the beginning/end of each fragment. The QC reports also flag the presense of adapters in the reads.

Now based on the output of the html summary, proceed to clean up the reads and rerun fastqc to see if an improvement can be made to the data. Make sure to create a directory to hold any processed reads you may create.

cd $RNA_INT_DIR
mkdir trimmed_reads

fastp -i $RNA_INT_DIR/data/SRR7155055_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155055_2.fastq.gz -o trimmed_reads/SRR7155055_1.fastq.gz -O trimmed_reads/SRR7155055_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155055.fastp.json --html trimmed_reads/SRR7155055.fastp.html 2>trimmed_reads/SRR7155055.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155056_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155056_2.fastq.gz -o trimmed_reads/SRR7155056_1.fastq.gz -O trimmed_reads/SRR7155056_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155056.fastp.json --html trimmed_reads/SRR7155056.fastp.html 2>trimmed_reads/SRR7155056.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155057_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155057_2.fastq.gz -o trimmed_reads/SRR7155057_1.fastq.gz -O trimmed_reads/SRR7155057_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155057.fastp.json --html trimmed_reads/SRR7155057.fastp.html 2>trimmed_reads/SRR7155057.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155058_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155058_2.fastq.gz -o trimmed_reads/SRR7155058_1.fastq.gz -O trimmed_reads/SRR7155058_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155058.fastp.json --html trimmed_reads/SRR7155058.fastp.html 2>trimmed_reads/SRR7155058.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155059_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155059_2.fastq.gz -o trimmed_reads/SRR7155059_1.fastq.gz -O trimmed_reads/SRR7155059_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155059.fastp.json --html trimmed_reads/SRR7155059.fastp.html 2>trimmed_reads/SRR7155059.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155060_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155060_2.fastq.gz -o trimmed_reads/SRR7155060_1.fastq.gz -O trimmed_reads/SRR7155060_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155060.fastp.json --html trimmed_reads/SRR7155060.fastp.html 2>trimmed_reads/SRR7155060.fastp.log

Q5.) What average percentage of reads remain after adapter trimming/cleanup with fastp? Why do reads get tossed out?

A5.) At this point, we could look in the log files individually. Alternatively, we could utilize the command line with a command like the one below.

grep -A 1 Read1 trimmed_reads/*.log

Doing this, we find that around 93-95% of reads survive after adapter trimming and cleanup with fastp. The reads that get tossed are due to being too short after trimming. They fall below our threshold of minimum read length of 25 (too short), poor sequence quality, or too many N’s.

Q6.) What sample has the largest number of reads after trimming?

A6.) The control sample 2 (SRR7155060) has the most reads (1,907,336 individual reads). An easy way to figure out the number of reads is to check the output log file from the trimming output. Looking at the “remaining reads” row, we see the reads (each read in a pair counted individually) that survive the trimming. We can also look at this from the command line.

grep "passed" trimmed_reads/*.log

Alternatively, you can make use of the command ‘wc’. This command counts the number of lines in a file. Since fastq files have 4 lines per read, the total number of lines must be divided by 4. Running this command only give you the total number of lines in the fastq file (Note that because the data is compressed, we need to use zcat to unzip it and print it to the screen, before passing it on to the wc command):

zcat $RNA_INT_DIR/data/SRR7155059_1.fastq.gz | wc -l
zcat $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz | wc -l

We could also run fastqc and multiqc on the trimmed data and visualize the remaining reads that way.

cd $RNA_INT_DIR
mkdir -p qc/trimmed_fastqc
fastqc $RNA_INT_DIR/trimmed_reads/*.fastq.gz -o qc/trimmed_fastqc/
cd qc/trimmed_fastqc
multiqc ./

Part 2: Data alignment

Goals:

Familiarize yourself with HISAT2 alignment options
Perform alignments using hisat2 and the trimmed version of the raw sequence data above
Sort your alignments and convert into compressed bam format using samtools sort
Obtain alignment summary information using samtools flagstat

To create HISAT2 alignment commands for all of the six samples and run alignments:

Create a directory to store the alignment results

echo $RNA_INT_DIR/alignments
mkdir -p $RNA_INT_DIR/alignments
cd $RNA_INT_DIR/alignments

Run alignment commands for each sample

hisat2 -p 8 --rg-id=T1 --rg SM:Transfected1 --rg LB:Transfected1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155055_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155055_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155055.sam
hisat2 -p 8 --rg-id=T2 --rg SM:Transfected2 --rg LB:Transfected2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155056_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155056_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155056.sam
hisat2 -p 8 --rg-id=T3 --rg SM:Transfected3 --rg LB:Transfected3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155057_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155057_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155057.sam
hisat2 -p 8 --rg-id=C1 --rg SM:Control1 --rg LB:Control1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155058_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155058_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155058.sam
hisat2 -p 8 --rg-id=C2 --rg SM:Control2 --rg LB:Control2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155059_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155059.sam
hisat2 -p 8 --rg-id=C3 --rg SM:Control3 --rg LB:Control3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155060_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155060_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155060.sam

Next, convert sam alignments to bam.

cd $RNA_INT_DIR/alignments
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155055.bam $RNA_INT_DIR/alignments/SRR7155055.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155056.bam $RNA_INT_DIR/alignments/SRR7155056.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155057.bam $RNA_INT_DIR/alignments/SRR7155057.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155058.bam $RNA_INT_DIR/alignments/SRR7155058.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155059.bam $RNA_INT_DIR/alignments/SRR7155059.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155060.bam $RNA_INT_DIR/alignments/SRR7155060.sam

Q7.) How can we obtain summary statistics for each aligned file?

A7.) There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat.

cd $RNA_INT_DIR/alignments
samtools flagstat SRR7155055.bam > SRR7155055.flagstat.txt
samtools flagstat SRR7155056.bam > SRR7155056.flagstat.txt
samtools flagstat SRR7155057.bam > SRR7155057.flagstat.txt
samtools flagstat SRR7155058.bam > SRR7155058.flagstat.txt
samtools flagstat SRR7155059.bam > SRR7155059.flagstat.txt
samtools flagstat SRR7155060.bam > SRR7155060.flagstat.txt

Pull out summaries of mapped reads from the flagstat files

grep "mapped (" *.flagstat.txt

Q8.) Approximately how much space is saved by converting the sam to a bam format?

A8.) We get about a 5.5x compression by using the bam format instead of the sam format. This can be seen by adding the -lh option when listing the files in the aligntments directory.

ls -lh $RNA_INT_DIR/alignments/

To specifically look at the sizes of the sam and bam files, we could use du -h, which shows us the disk space they are utilizing in human readable format.

du -h $RNA_INT_DIR/alignments/*.sam
du -h $RNA_INT_DIR/alignments/*.bam

In order to make visualization easier, you should now merge each of your replicate sample bams into one combined BAM for each condition. Make sure to index these bams afterwards to be able to view them on IGV.

cd $RNA_INT_DIR/alignments
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam

To visualize these merged bam files in IGV, we’ll need to index them. We can do so with the following commands.

cd $RNA_INT_DIR/alignments
samtools index $RNA_INT_DIR/alignments/control.bam
samtools index $RNA_INT_DIR/alignments/transfected.bam

Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this:

Load up IGV
Change the reference genome to “Human hg38” in the top-left category
Click on File > Load from URL, and in the File URL enter: “http:///rnaseq/integrated_assignment/alignments/transfected.bam". Repeat this step and enter "http:///rnaseq/integrated_assignment/alignments/control.bam" to load the other bam.
Right-click on the alignments track in the middle, and Group alignments by “Library”
Jump to TP53 by typing it into the search bar above

Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data?

A9.) The reads all pile up on the exonic regions of the gene since we’re dealing with RNA-Sequencing data. Not all exons have equal coverage, and this is due to different isoforms of the gene being sequenced. If the data was from a whole-genome experiment, we would ideally expect to see equal coverage across the whole gene length.

Right-click in the middle of the page, and click on “Expanded” to view the reads more easily.

Q10.) What are the lines connecting the reads trying to convey?

A10.) The lines show a connected read, where one part of the read begins mapping to one exon, while the other part maps to the next exon. This is important in RNA-Sequencing alignment as aligners must be aware to take this partial alignment strategy into account.

Part 3: Expression Estimation

Goals:

Familiarize yourself with Stringtie options and how to run Stringtie in “reference-only” mode
Create an expression results directory, run stringtie on all 6 samples, and store the results in appropriately named subdirectories in this results dir
Obtain expression values for the gene SOX4

cd $RNA_INT_DIR/
mkdir -p $RNA_INT_DIR/expression

stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected1/transcripts.gtf -A expression/transfected1/gene_abundances.tsv alignments/SRR7155055.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected2/transcripts.gtf -A expression/transfected2/gene_abundances.tsv alignments/SRR7155056.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected3/transcripts.gtf -A expression/transfected3/gene_abundances.tsv alignments/SRR7155057.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control1/transcripts.gtf -A expression/control1/gene_abundances.tsv alignments/SRR7155058.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control2/transcripts.gtf -A expression/control2/gene_abundances.tsv alignments/SRR7155059.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control3/transcripts.gtf -A expression/control3/gene_abundances.tsv alignments/SRR7155060.bam

Q11.) How can you obtain the expression of the gene SOX4 across the transfected and control samples?

A11.) To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file

grep SOX4 $RNA_INT_DIR/expression/*/transcripts.gtf | cut -f 1,9 | grep FPKM

Part 4: Differential Expression Analysis

Goals:

Perform differential analysis between the transfected and control samples

mkdir -p $RNA_INT_DIR/ballgown/
cd $RNA_INT_DIR/ballgown/

Perform transfected vs. control comparison, using all samples, for known transcripts:

Adapt the R tutorial code that was used in Differential Expression section. Modify it to work on these data (which are also a 3x3 replicate comparison of two conditions).

First, start an R session:

Run the following R commands in your R session.

# load the required libraries
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

# Create phenotype data needed for ballgown analysis. Recall that:
# "T1-T3" refers to "transfected" (CBSLR shRNA knockdown) replicates
# "C1-C3" refers to "control" (shRNA control) replicates

ids=c("transfected1","transfected2","transfected3","control1","control2","control3")
type=c("Tranfected","Tranfected","Tranfected","Control","Control","Control")
results="/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/"
path=paste(results,ids,sep="")
pheno_data=data.frame(ids,type,path)

pheno_data

# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data)

# Display a description of this object
bg

# Load all attributes including gene name
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])
bg_transcript_names = unique(bg_table[,c(1,6)])

# Save the ballgown object to a file for later use
save(bg, file='bg.rda')

# Perform differential expression (DE) analysis with no filtering, at both gene and transcript level
results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Save a tab delimited file for both the transcript and gene results
write.table(results_transcripts, "Transfected_vs_Control_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "Transfected_vs_Control_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])
bg_filt_transcript_names = unique(bg_filt_table[,c(1,6)])

# Perform DE analysis now using the filtered data
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_filt_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Output the filtered list of genes and transcripts and save to tab delimited files
write.table(results_transcripts, "Transfected_vs_Control_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "Transfected_vs_Control_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Identify the significant genes with p-value < 0.05
sig_transcripts = subset(results_transcripts, results_transcripts$pval<0.05)
sig_genes = subset(results_genes, results_genes$pval<0.05)

sig_transcripts_ordered = sig_transcripts[order(sig_transcripts$pval),]
sig_genes_ordered = sig_genes[order(sig_genes$pval),]

# Output the significant gene results to a pair of tab delimited files
write.table(sig_transcripts_ordered, "Transfected_vs_Control_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(sig_genes_ordered, "Transfected_vs_Control_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Exit the R session
quit(save="no")

Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case?

A12.) Yes, there are about 523 significantly differntially expressed genes. Due to the fact that we’re using a subset of the fully sequenced library for each sample, the SOX4 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all the reads from each sample in the original data set, which will give you greater resolution of the expression of each gene to build mean and variance estimates for eacch gene’s expression.

Part 5: Differential Expression Analysis Visualization

Q13.) What plots can you generate to help you visualize this gene expression profile

A13.) The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of transfect vs. control FPKM values.

Make sure we are in the directory with our DE results

cd $RNA_INT_DIR/ballgown/

Restart an R session:

The following R commands create summary visualizations of the DE results from Ballgown

#Load libraries
library(ggplot2)
library(gplots)
library(GenomicRanges)
library(ballgown)
library(ggrepel)

#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline
load('bg.rda')

# View a summary of the ballgown object
bg

# Load gene names for lookup later in the tutorial
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Pull the gene_expression data frame from the ballgown object
gene_expression = as.data.frame(gexpr(bg))

#Set min value to 1
min_nonzero=1

# Set the columns for finding FPKM and create shorter names for figures
data_columns=c(1:6)
short_names=c("T1","T2","T3","C1","C2","C3")

#Calculate the FPKM sum for all 6 libraries
gene_expression[,"sum"]=apply(gene_expression[,data_columns], 1, sum)

#Identify genes where the sum of FPKM across all samples is above some arbitrary threshold
i = which(gene_expression[,"sum"] > 5)

#Calculate the correlation between all pairs of data
r=cor(gene_expression[i,data_columns], use="pairwise.complete.obs", method="pearson")

#Print out these correlation values
r

# Open a PDF file where we will save some plots. 
# We will save all figures and then view the PDF at the end
pdf(file="transfected_vs_control_figures.pdf")

data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3")

#Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries
#This step calculates 2-dimensional coordinates to plot points for each library
#Libraries with similar expression patterns (highly correlated to each other) should group together

#note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability
d=1-r
mds=cmdscale(d, k=2, eig=TRUE)
par(mfrow=c(1,1))
plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.01,0.01), ylim=c(-0.01,0.01))
points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16)
text(mds$points[,1], mds$points[,2], short_names, col=data_colors)

# Calculate the differential expression results including significance
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id"))

# Plot - Display the grand expression values from UHR and HBR and mark those that are significantly differentially expressed

sig=which(results_genes$pval<0.05)
results_genes[,"de"] = log2(results_genes[,"fc"])

gene_expression[,"Transfected"]=apply(gene_expression[,c(1:3)], 1, mean)
gene_expression[,"Control"]=apply(gene_expression[,c(4:6)], 1, mean)

x=log2(gene_expression[,"Transfected"]+min_nonzero)
y=log2(gene_expression[,"Control"]+min_nonzero)
plot(x=x, y=y, pch=16, cex=0.25, xlab="Transfected FPKM (log2)", ylab="Control FPKM (log2)", main="Transfected vs Control FPKMs")
abline(a=0, b=1)
xsig=x[sig]
ysig=y[sig]
points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5)
legend("topleft", "Significant", col="magenta", pch=16)

#Get the gene symbols for the top N (according to corrected p-value) and display them on the plot
topn = order(abs(results_genes[sig,"fc"]), decreasing=TRUE)[1:25]
topn = order(results_genes[sig,"qval"])[1:25]
text(x[topn], y[topn], results_genes[topn,"gene_name"], col="black", cex=0.75, srt=45)

#Plot - Volcano plot

# set default for all genes to "no change"
results_genes$diffexpressed <- "No"

# if log2Foldchange > 2 and pvalue < 0.05, set as "Up regulated"
results_genes$diffexpressed[results_genes$de > 0.6 & results_genes$pval < 0.05] <- "Up"

# if log2Foldchange < -2 and pvalue < 0.05, set as "Down regulated"
results_genes$diffexpressed[results_genes$de < -0.6 & results_genes$pval < 0.05] <- "Down"

results_genes$gene_label <- NA

# write the gene names of those significantly upregulated/downregulated to a new column
results_genes$gene_label[results_genes$diffexpressed != "No"] <- results_genes$gene_name[results_genes$diffexpressed != "No"]

ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) +
             xlab("log2Foldchange") +
             scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) +
             geom_point() +
             theme_minimal() +
             geom_text_repel() +
             geom_vline(xintercept=c(-0.6, 0.6), col="red") +
             geom_hline(yintercept=-log10(0.05), col="red") +
             guides(colour = guide_legend(override.aes = list(size=5))) +
             geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black")


dev.off()

# Exit the R session
quit(save="no")

Team Assignment - ExpressionDE Answers

0009-07-01T00:00:00+00:00

The solutions below are for team A. Other team solutions will be very similar but each for their own unique chromosome dataset.

Estimate expression levels

Use stringtie to estimate gene/transcript abundance levels

cd $RNA_HOME/team_exercise
mkdir expression

stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample1/transcripts.gtf -A expression/KO_sample1/gene_abundances.tsv alignments/SRR10045016.bam
stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample2/transcripts.gtf -A expression/KO_sample2/gene_abundances.tsv alignments/SRR10045017.bam
stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample3/transcripts.gtf -A expression/KO_sample3/gene_abundances.tsv alignments/SRR10045018.bam

stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample1/transcripts.gtf -A expression/Rescue_sample1/gene_abundances.tsv alignments/SRR10045019.bam
stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample2/transcripts.gtf -A expression/Rescue_sample2/gene_abundances.tsv alignments/SRR10045020.bam
stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample3/transcripts.gtf -A expression/Rescue_sample3/gene_abundances.tsv alignments/SRR10045021.bam

Q1. Based on your stringtie results, what are the top 5 genes with highest average expression levels across all knockout samples? What about in your rescue samples? (Hint: You can use R, command-line tools, or download files to your desktop for this analysis)

A1. TO BE COMPLETED

Perform differential expression analysis

Use ballgown to identify differentially expressed genes between KO and Rescue samples

cd $RNA_HOME/team_exercise
mkdir de
cd de

First, start an R session:

Run the following R commands in your R session.

# load the required libraries
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

# Create phenotype data needed for ballgown analysis.
ids=c("KO_sample1","KO_sample2","KO_sample3","Rescue_sample1","Rescue_sample2","Rescue_sample3")
type=c("KO","KO","KO","Rescue","Rescue","Rescue")
results="/home/ubuntu/workspace/rnaseq/team_exercise/expression/"
path=paste(results,ids,sep="")
pheno_data=data.frame(ids,type,path)

pheno_data

# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data)

# Display a description of this object
bg

# Load all attributes including gene name
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])
bg_transcript_names = unique(bg_table[,c(1,6)])

# Save the ballgown object to a file for later use
save(bg, file='bg.rda')

# Perform differential expression (DE) analysis with no filtering
results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Save a tab delimited file for both the transcript and gene results
write.table(results_transcripts, "KO_vs_Rescue_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "KO_vs_Rescue_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])
bg_filt_transcript_names = unique(bg_filt_table[,c(1,6)])

# Perform differential expression (DE) analysis with no filtering, at both gene and transcript level
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_filt_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Output the filtered list of genes and transcripts and save to tab delimited files
write.table(results_transcripts, "KO_vs_Rescue_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "KO_vs_Rescue_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Identify the significant genes with p-value < 0.05
sig_transcripts = subset(results_transcripts, results_transcripts$pval<0.05)
sig_genes = subset(results_genes, results_genes$pval<0.05)

sig_transcripts_ordered = sig_transcripts[order(sig_transcripts$pval),]
sig_genes_ordered = sig_genes[order(sig_genes$pval),]

# Output the significant gene results to a pair of tab delimited files
write.table(sig_transcripts_ordered, "KO_vs_Rescue_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(sig_genes_ordered, "KO_vs_Rescue_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Exit the R session
quit(save="no")

Q2. How many significant differentially expressed genes do you observe?

A2. TO BE COMPLETED

Q3. By referring back to the supplementary tutorial in the DE Visualization Module, can you construct a volcano plot showcasing the significantly de genes?

A3. See below.

Perform differential expression analysis visualization

Make sure we are in the directory with our DE results

cd $RNA_HOME/team_exercise/de

Restart an R session:

The following R commands create summary visualizations of the DE results from Ballgown

#Load libraries
library(ggplot2)
library(gplots)
library(GenomicRanges)
library(ballgown)
library(ggrepel)

#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline
load('bg.rda')

# View a summary of the ballgown object
bg

# Load gene names for lookup later in the tutorial
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Pull the gene_expression data frame from the ballgown object
gene_expression = as.data.frame(gexpr(bg))

#Set min value to 1
min_nonzero=1

# Set the columns for finding FPKM and create shorter names for figures
data_columns=c(1:6)
short_names=c("KO1","KO2","KO3","R1","R2","R3")

#Calculate the FPKM sum for all 6 libraries
gene_expression[,"sum"]=apply(gene_expression[,data_columns], 1, sum)

#Identify genes where the sum of FPKM across all samples is above some arbitrary threshold
i = which(gene_expression[,"sum"] > 5)

#Calculate the correlation between all pairs of data
r=cor(gene_expression[i,data_columns], use="pairwise.complete.obs", method="pearson")

#Print out these correlation values
r

# Open a PDF file where we will save some plots. 
# We will save all figures and then view the PDF at the end
pdf(file="KO_vs_rescue_figures.pdf")

data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3")

#Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries
#This step calculates 2-dimensional coordinates to plot points for each library
#Libraries with similar expression patterns (highly correlated to each other) should group together

#note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability
d=1-r
mds=cmdscale(d, k=2, eig=TRUE)
par(mfrow=c(1,1))
plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.01,0.01), ylim=c(-0.01,0.01))
points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16)
text(mds$points[,1], mds$points[,2], short_names, col=data_colors)

# Calculate the differential expression results including significance
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id"))

# Plot - Display the grand expression values from KO and Rescue conditions and mark those that are significantly differentially expressed

sig=which(results_genes$pval<0.05)
results_genes[,"de"] = log2(results_genes[,"fc"])
gene_expression[,"KO"]=apply(gene_expression[,c(1:3)], 1, mean)
gene_expression[,"Rescue"]=apply(gene_expression[,c(4:6)], 1, mean)

x=log2(gene_expression[,"KO"]+min_nonzero)
y=log2(gene_expression[,"Rescue"]+min_nonzero)
plot(x=x, y=y, pch=16, cex=0.25, xlab="KO FPKM (log2)", ylab="Rescue FPKM (log2)", main="Rescue vs KO FPKMs")
abline(a=0, b=1)
xsig=x[sig]
ysig=y[sig]
points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5)
legend("topleft", "Significant", col="magenta", pch=16)

#Get the gene symbols for the top N (according to corrected p-value) and display them on the plot
topn = order(abs(results_genes[sig,"fc"]), decreasing=TRUE)[1:25]
topn = order(results_genes[sig,"qval"])[1:25]
text(x[topn], y[topn], results_genes[topn,"gene_name"], col="black", cex=0.75, srt=45)

#Plot - Volcano plot

# set default for all genes to "no change"
results_genes$diffexpressed <- "No"

# if log2Foldchange > 2 and pvalue < 0.05, set as "Up regulated"
results_genes$diffexpressed[results_genes$de > 0.6 & results_genes$pval < 0.05] <- "Up"

# if log2Foldchange < -2 and pvalue < 0.05, set as "Down regulated"
results_genes$diffexpressed[results_genes$de < -0.6 & results_genes$pval < 0.05] <- "Down"

results_genes$gene_label <- NA

# write the gene names of those significantly upregulated/downregulated to a new column
results_genes$gene_label[results_genes$diffexpressed != "No"] <- results_genes$gene_name[results_genes$diffexpressed != "No"]

ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) +
             xlab("log2Foldchange") +
             scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) +
             geom_point() +
             theme_minimal() +
             geom_text_repel() +
             geom_vline(xintercept=c(-0.6, 0.6), col="red") +
             geom_hline(yintercept=-log10(0.05), col="red") +
             guides(colour = guide_legend(override.aes = list(size=5))) +
             geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black")


dev.off()

# Exit the R session
quit(save="no")

Griffith Lab

Single-cell RNA-seq - CSHL legacy version

Exercise: A Complete Seurat Workflow

Step 1: Preparation

Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline

Step 3: Convert each feature-barcode matrix to a Seurat object

Step 4. Merge the Seurat objects into a single object

Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data

Step 5. Quality control plots

Step 6. Calculate a cell cycle score for each cell

Step 7. Filter the cells to remove debris, dead cells, and probable doublets

Step 8. [Optional] Subset the data

Step 9. Normalize the data, detect variable genes, and scale the data

Step 10. Reduce the dimensionality of the data using Principal Component Analysis

Step 11. Generate 2-dimensional layouts of the data using two related algorithms, t-SNE and UMAP

Step 12: Infer cell types

Step 13: Cluster the cells using a graph-based clustering algorithm

Step 14: Interpret the clustering using a differential gene expression (DEG) analysis

Independent exercises, if time permits

Log into Compute Canada

Signing into Compute Canada for the course

Logging in with ssh (Mac/Linux)

Logging in with putty (Windows)

Copying files to your computer

Using Jupyter Notebook or JupyterLab

File system layout

How to request and use a compute node

Getting information on your compute jobs

Strand Settings

Strand-related settings

Notes

Example data providers

Complete Result Sets

Introduction

Bioinformatics Best Practices

Introduction

Managing Your Analysis with Notebooks

Example notebooks

Versioning Code with Git and GitHub

Managing Your Compute Environment

POSIT Setup

Posit setup for use in CRI 2024 workshop

Upload raw data

Files in single_cell_rna

Files in bulk_rna

Back-up files

Installing packages

GCP Setup

UNDER DEVELOPMENT

Google Cloud Platform setup for use in workshop

Create a Google Cloud account

Start with existing base image

Install the Google Cloud SDK (gcloud), authenticate your user, and login to your VM

Set up the ubuntu user:

Perform basic linux configuration

Add ubuntu user to docker group

Install any desired informatics tools

Install RNA-seq software

Create directory to install software to and setup path variables

Install SAMtools

Install bam-readcount

Install HISAT2

Install StringTie

Install gffcompare

Install htseq-count

Make sure that OpenSSL is on correct version

Install TopHat

Install kallisto

Install FastQC

Intall a particular version of numpy that hopefully works with all the dependencies that rely on it

Install MultiQC

Install Picard

Install Flexbar

Install Regtools

Install RSeQC

Install bedops

Install gtfToGenePred

Install genePredToBed

Install Cell Ranger

Install TABIX