Welcome to the blog

Posts

My thoughts and ideas

DE Visualization with DESeq2 | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

DE Visualization with DESeq2


RNA-seq_Flowchart4


Differential Expression Visualzation

# ma plot
plotMA(resLFC, ylim=c(-2,2))

# ggplot ma plot

# volcano plot

# ggplot volcano plot

# counts plot
plotCounts(dds, gene=which.min(res$padj), intgroup="Disease")

# ggplot counts plot

# pheatmap using VST for normalized values 

# pairwise heatmap of samples to explore sample relatedness

# dispersion plot
plotDispEsts(dds)
Differential Expression with DESeq2 | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differential Expression with DESeq2


RNA-seq_Flowchart4


Differential Expression mini lecture

If you would like a brief refresher on differential expression analysis, please refer to the mini lecture.

DESeq2 DE Analysis

In this tutorial you will:

First, create a directory for results:

cd $RNA_HOME/
mkdir -p de/htseq_counts
cd de/htseq_counts

Note that the htseq-count results provide counts for each gene but uses only the Ensembl Gene ID (e.g. ENSG00000054611). This is not very convenient for biological interpretation. This next step creates a mapping file that will help us translate from ENSG IDs to Symbols. It does this by parsing the GTF transcriptome file we got from Ensembl. That file contains both gene names and IDs. Unfortunately, this file is a bit complex to parse. Furthermore, it contains the ERCC transcripts, and these have their own naming convention which also complicated the parsing.

perl -ne 'if ($_ =~ /gene_id\s\"(ENSG\S+)\"\;/) { $id = $1; $name = undef; if ($_ =~ /gene_name\s\"(\S+)"\;/) { $name = $1; }; }; if ($id && $name) {print "$id\t$name\n";} if ($_=~/gene_id\s\"(ERCC\S+)\"/){print "$1\t$1\n";}' $RNA_REF_GTF | sort | uniq > ENSG_ID2Name.txt
head ENSG_ID2Name.txt

Determine the number of unique Ensembl Gene IDs and symbols. What does this tell you?

#count unique gene ids
cut -f 1 ENSG_ID2Name.txt | sort | uniq | wc -l
#count unique gene names
cut -f 2 ENSG_ID2Name.txt | sort | uniq | wc -l

#show the most repeated gene names
cut -f 2 ENSG_ID2Name.txt | sort | uniq -c | sort -r | head

Launch R:

R

R code has been provided below. If you wish to have a script with all of the code, it can be found here. Run the R commands below.


# # Install the latest version of DEseq2
# if (!requireNamespace("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# BiocManager::install("DESeq2", version = "3.8")

#Define working dir paths
datadir = "/cloud/project/data/bulk_rna"
outdir = "/cloud/project/outdir"

# load the library
library(DESeq2)
library(data.table)

#Set working directory to data dir
setwd(datadir)

# read in gene mappings
mapping <- fread("ENSG_ID2Name.txt", header=F)
setnames(mapping, c('ensemblID', 'Symbol'))

# read in counts
htseqCounts <- fread("gene_read_counts_table_all_final.tsv")
htseqCounts <- as.matrix(htseqCounts)
rownames(htseqCounts) <- htseqCounts[,"GeneID"]
htseqCounts <- htseqCounts[, colnames(htseqCounts) != "GeneID"]
class(htseqCounts) <- "integer"

#Set working directory to output dir
setwd(outdir)

# run filtering i.e. 1/6 samples must have counts greater than 10
# get index of rows with meet this criterion
htseqCounts <- htseqCounts[which(rowSums(htseqCounts >= 10) >=1),]

# construct mapping of meta data
metaData <- data.frame('Condition'=c('UHR', 'UHR', 'UHR', 'HBR', 'HBR', 'HBR'))
metaData$Condition <- factor(metaData$Condition, levels=c('HBR', 'UHR'))
rownames(metaData) <- colnames(htseqCounts)

# check that htseq count cols match meta data rows
all(rownames(metaData) == colnames(htseqCounts))

# make deseq2 data sets
dds <- DESeqDataSetFromMatrix(countData = htseqCounts, colData = metaData, design = ~Condition)

# run DE analysis () note look at results for direction of log2 fold-change
dds <- DESeq(dds)
#res <- results(dds) # not really need, should use shrinkage below

# shrink log2 fold change estimates
resultsNames(dds)
resLFC <- lfcShrink(dds, coef="Condition_UHR_vs_HBR", type="apeglm")

# merge on gene names #Note - should also merge original raw count values onto final dataframe
resLFC$ensemblID <- rownames(resLFC)
resLFC <- as.data.table(resLFC)
resLFC <- merge(resLFC, mapping, by='ensemblID', all.x=T)
fwrite(resLFC, file='DE_genes_DESeq2.tsv', sep="\t")


#To exit R type the following
quit(save="no")