Welcome to the blog


My thoughts and ideas

Team Assignment - Alignment Answers | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Team Assignment - Alignment Answers

The solutions below are for team A. Other team solutions will be very similar but each for their own unique chromosome dataset.

Download data

Create a new folder for raw (untrimmed) data, download, and unpack

mkdir -p team_exercise/untrimmed
cd team_exercise/untrimmed
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_A/dataset.tar.gz
tar -xzvf dataset.tar.gz

Download reference files

Download chromosome-specific reference fasta, gtf and also Illumina adaptor sequences

mkdir -p $RNA_HOME/team_exercise/references
cd $RNA_HOME/team_exercise/references

wget -c http://genomedata.org/seq-tec-workshop/references/RNA/chr11.fa
wget -c http://genomedata.org/seq-tec-workshop/references/RNA/chr11_Homo_sapiens.GRCh38.95.gtf
wget -c http://genomedata.org/seq-tec-workshop/references/RNA/illumina_multiplex.fa

Trim provided data

Create a new folder for trimmed data and run fastp to trim for adaptor and read ends

cd $RNA_HOME/team_exercise
mkdir trimmed

fastp -i untrimmed/SRR10045016_1.fastq.gz -I untrimmed/SRR10045016_2.fastq.gz -o trimmed/SRR10045016_1.fastq.gz -O trimmed/SRR10045016_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045016.fastp.json --html trimmed/SRR10045016.fastp.html 2>trimmed/SRR10045016.fastp.log
fastp -i untrimmed/SRR10045017_1.fastq.gz -I untrimmed/SRR10045017_2.fastq.gz -o trimmed/SRR10045017_1.fastq.gz -O trimmed/SRR10045017_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045017.fastp.json --html trimmed/SRR10045017.fastp.html 2>trimmed/SRR10045017.fastp.log
fastp -i untrimmed/SRR10045018_1.fastq.gz -I untrimmed/SRR10045018_2.fastq.gz -o trimmed/SRR10045018_1.fastq.gz -O trimmed/SRR10045018_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045018.fastp.json --html trimmed/SRR10045018.fastp.html 2>trimmed/SRR10045018.fastp.log

fastp -i untrimmed/SRR10045019_1.fastq.gz -I untrimmed/SRR10045019_2.fastq.gz -o trimmed/SRR10045019_1.fastq.gz -O trimmed/SRR10045019_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045019.fastp.json --html trimmed/SRR10045019.fastp.html 2>trimmed/SRR10045019.fastp.log
fastp -i untrimmed/SRR10045020_1.fastq.gz -I untrimmed/SRR10045020_2.fastq.gz -o trimmed/SRR10045020_1.fastq.gz -O trimmed/SRR10045020_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045020.fastp.json --html trimmed/SRR10045020.fastp.html 2>trimmed/SRR10045020.fastp.log
fastp -i untrimmed/SRR10045021_1.fastq.gz -I untrimmed/SRR10045021_2.fastq.gz -o trimmed/SRR10045021_1.fastq.gz -O trimmed/SRR10045021_2.fastq.gz -l 25 --adapter_fasta $RNA_HOME/team_exercise/references/illumina_multiplex.fa --trim_tail1 5 --trim_tail2 5 --json trimmed/SRR10045021.fastp.json --html trimmed/SRR10045021.fastp.html 2>trimmed/SRR10045021.fastp.log

Run FastQC and multiQC on trimmed and untrimmed data

Create FastQC and MultiQC reports for all trimmed and untrimmed fastq files

cd $RNA_HOME/team_exercise/untrimmed
fastqc *.fastq.gz
multiqc ./

cd $RNA_HOME/team_exercise/trimmed
fastqc *.fastq.gz
multiqc ./

Create HISAT2 index

Create HISAT2 index specific to chromosome under analysis

cd $RNA_HOME/team_exercise/references
hisat2_extract_splice_sites.py chr11_Homo_sapiens.GRCh38.95.gtf > splicesites.tsv
hisat2_extract_exons.py chr11_Homo_sapiens.GRCh38.95.gtf > exons.tsv
hisat2-build -p 4 --ss splicesites.tsv --exon exons.tsv chr11.fa chr11

Perform alignment

Create new alignments folder and run HISAT2 on all fastq files

cd $RNA_HOME/team_exercise/
mkdir alignments
cd alignments  

hisat2 -p 4 --rg-id=KO_sample_1 --rg SM:KO --rg LB:KO_sample_1 --rg PL:ILLUMINA --rg PU:SRR10045016 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045016_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045016_2.fastq.gz -S ./SRR10045016.sam 2> SRR10045016_alignment_metrics.txt
hisat2 -p 4 --rg-id=KO_sample_2 --rg SM:KO --rg LB:KO_sample_2 --rg PL:ILLUMINA --rg PU:SRR10045017 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045017_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045017_2.fastq.gz -S ./SRR10045017.sam 2> SRR10045017_alignment_metrics.txt
hisat2 -p 4 --rg-id=KO_sample_3 --rg SM:KO --rg LB:KO_sample_3 --rg PL:ILLUMINA --rg PU:SRR10045018 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045018_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045018_2.fastq.gz -S ./SRR10045018.sam 2> SRR10045018_alignment_metrics.txt

hisat2 -p 4 --rg-id=Rescue_sample_1 --rg SM:Rescue --rg LB:Rescue_sample_1 --rg PL:ILLUMINA --rg PU:SRR10045019 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045019_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045019_2.fastq.gz -S ./SRR10045019.sam 2> SRR10045019_alignment_metrics.txt
hisat2 -p 4 --rg-id=Rescue_sample_2 --rg SM:Rescue --rg LB:Rescue_sample_2 --rg PL:ILLUMINA --rg PU:SRR10045020 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045020_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045020_2.fastq.gz -S ./SRR10045020.sam 2> SRR10045020_alignment_metrics.txt
hisat2 -p 4 --rg-id=Rescue_sample_3 --rg SM:Rescue --rg LB:Rescue_sample_3 --rg PL:ILLUMINA --rg PU:SRR10045021 -x $RNA_HOME/team_exercise/references/chr11 --dta --rna-strandness RF -1 $RNA_HOME/team_exercise/untrimmed/SRR10045021_1.fastq.gz -2 $RNA_HOME/team_exercise/untrimmed/SRR10045021_2.fastq.gz -S ./SRR10045021.sam 2> SRR10045021_alignment_metrics.txt

Sort the alignments and convert to bam

Use samtools to sort sam files and output to bam

cd $RNA_HOME/team_exercise/alignments 
samtools sort -@ 4 -o SRR10045016.bam SRR10045016.sam
samtools sort -@ 4 -o SRR10045017.bam SRR10045017.sam
samtools sort -@ 4 -o SRR10045018.bam SRR10045018.sam
samtools sort -@ 4 -o SRR10045019.bam SRR10045019.sam
samtools sort -@ 4 -o SRR10045020.bam SRR10045020.sam
samtools sort -@ 4 -o SRR10045021.bam SRR10045021.sam

Perform post-alignment QC with FastQC and multiQC

Create FastQC and MultiQC reports for all bam files

cd $RNA_HOME/team_exercise/alignments 
fastqc *.bam
mkdir fastqc
mv *_fastqc* fastqc
cd fastqc 
multiqc ./

Merge the alignments

Use Picard to merge individual bam files into single bam file for each condition

cd $RNA_HOME/team_exercise/alignments 
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT KO_merged.bam -INPUT SRR10045016.bam -INPUT SRR10045017.bam -INPUT SRR10045018.bam
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT RESCUE_merged.bam -INPUT SRR10045019.bam -INPUT SRR10045020.bam -INPUT SRR10045021.bam

Index the alignments

Use samtools to index file to allow loading in IGV

cd $RNA_HOME/team_exercise/alignments
samtools index KO_merged.bam
samtools index RESCUE_merged.bam
Practical Exercise Solutions | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Practical Exercise Solutions

Practical Exercise 1 - Software installation

To install bedtools:

 cd $RNA_HOME/student_tools/
 wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz
 tar -zxvf bedtools-2.31.0.tar.gz
 cd bedtools2/


Practical Exercise 2 - Reference Genomes

cd $RNA_HOME/refs
#first produce a fasta with only the chr22 sequence (i.e. remove the ERCC sequences).
cat chr22_with_ERCC92.fa | perl -ne 'if ($_ =~ /\>22/){$chr22=1}; if ($_ =~ /\>ERCC/){$chr22=0}; if ($chr22){print "$_";}' > chr22_only.fa

#determine the count of all repeat bases
#skip the header lines containing the sequence names, count the lower case letters, count the total length, at the end summarize totals.
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $r+= $_ =~ tr/a/A/; $r += $_ =~ tr/c/C/; $r += $_ =~ tr/g/G/; $r += $_ =~ tr/t/T/; $l += length($_); if (eof){$p = sprintf("%.2f", ($r/$l)*100); print "\nrepeat bases = $r\ntotal bases = $l\npercent repeat bases = $p%\n\n"}'

#determine the occurence of an arbitrary short sequence. don't forget to remove all end of line characters before searching for the string of interest.
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $s = uc($_); print $_;' | perl -ne '$c += $_ =~ s/GAATTC/XXXXXX/g; if (eof){print "\nEcoRI site (GAATTC) count = $c\n\n";}'


Practical Exercise 3 - Data

mkdir -p practice/data
cd $RNA_HOME/practice/data
wget http://genomedata.org/rnaseq-tutorial/practical.tar
tar -xvf practical.tar
ll -1 *.fastq.gz | wc -l
zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 1
zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 2 | tail -n 1 | perl -ne '$_ = s/T/X/g; print "\n\n$_\n\n"'

zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 2 | tail -n 1 | grep -o T | wc -l


Practical Exercise 4 - Data QC

Use FastQC, Fastp, and/or MultiQC to summarize data QC

cd $RNA_HOME/practice/data
fastqc *.fastq.gz
multiqc ./

Then, go to the following url in your browser:


Practical Exercise 5 - Trim

cd $RNA_HOME/practice/data/
mkdir trimmed
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa

fastp -i hcc1395_normal_rep1_r1.fastq.gz -I hcc1395_normal_rep1_r2.fastq.gz -o trimmed/hcc1395_normal_rep1_r1.fastq.gz -O trimmed/hcc1395_normal_rep1_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep1.fastp.json --html trimmed/hcc1395_normal_rep1.fastp.html 2>trimmed/hcc1395_normal_rep1.fastp.log
fastp -i hcc1395_normal_rep2_r1.fastq.gz -I hcc1395_normal_rep2_r2.fastq.gz -o trimmed/hcc1395_normal_rep2_r1.fastq.gz -O trimmed/hcc1395_normal_rep2_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep2.fastp.json --html trimmed/hcc1395_normal_rep2.fastp.html 2>trimmed/hcc1395_normal_rep2.fastp.log
fastp -i hcc1395_normal_rep3_r1.fastq.gz -I hcc1395_normal_rep3_r2.fastq.gz -o trimmed/hcc1395_normal_rep3_r1.fastq.gz -O trimmed/hcc1395_normal_rep3_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep3.fastp.json --html trimmed/hcc1395_normal_rep3.fastp.html 2>trimmed/hcc1395_normal_rep3.fastp.log

fastp -i hcc1395_tumor_rep1_r1.fastq.gz -I hcc1395_tumor_rep1_r2.fastq.gz -o trimmed/hcc1395_tumor_rep1_r1.fastq.gz -O trimmed/hcc1395_tumor_rep1_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep1.fastp.json --html trimmed/hcc1395_tumor_rep1.fastp.html 2>trimmed/hcc1395_tumor_rep1.fastp.log
fastp -i hcc1395_tumor_rep2_r1.fastq.gz -I hcc1395_tumor_rep2_r2.fastq.gz -o trimmed/hcc1395_tumor_rep2_r1.fastq.gz -O trimmed/hcc1395_tumor_rep2_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep2.fastp.json --html trimmed/hcc1395_tumor_rep2.fastp.html 2>trimmed/hcc1395_tumor_rep2.fastp.log
fastp -i hcc1395_tumor_rep3_r1.fastq.gz -I hcc1395_tumor_rep3_r2.fastq.gz -o trimmed/hcc1395_tumor_rep3_r1.fastq.gz -O trimmed/hcc1395_tumor_rep3_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep3.fastp.json --html trimmed/hcc1395_tumor_rep3.fastp.html 2>trimmed/hcc1395_tumor_rep3.fastp.log

Compare these files using FastQC and/or MultiQC:

cd $RNA_HOME/practice/data/trimmed/
fastqc *.fastq.gz
multiqc ./


Practical Exercise 6 - Alignment

Perform alignments:

export RNA_PRACTICE_DATA_DIR=$RNA_HOME/practice/data
cd $RNA_HOME/practice/

mkdir -p alignments/hisat2
cd alignments/hisat2

hisat2 -p 8 --rg-id=HCC1395_normal_rep1 --rg SM:HCC1395_normal_rep1 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r2.fastq.gz -S ./HCC1395_normal_rep1.sam
hisat2 -p 8 --rg-id=HCC1395_normal_rep2 --rg SM:HCC1395_normal_rep2 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r2.fastq.gz -S ./HCC1395_normal_rep2.sam
hisat2 -p 8 --rg-id=HCC1395_normal_rep3 --rg SM:HCC1395_normal_rep3 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r2.fastq.gz -S ./HCC1395_normal_rep3.sam

hisat2 -p 8 --rg-id=HCC1395_tumor_rep1 --rg SM:HCC1395_tumor_rep1 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r2.fastq.gz -S ./HCC1395_tumor_rep1.sam
hisat2 -p 8 --rg-id=HCC1395_tumor_rep2 --rg SM:HCC1395_tumor_rep2 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r2.fastq.gz -S ./HCC1395_tumor_rep2.sam
hisat2 -p 8 --rg-id=HCC1395_tumor_rep3 --rg SM:HCC1395_tumor_rep3 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r2.fastq.gz -S ./HCC1395_tumor_rep3.sam

Sort and convert SAM to BAM:

samtools sort -@ 8 -o HCC1395_normal_rep1.bam HCC1395_normal_rep1.sam
samtools sort -@ 8 -o HCC1395_normal_rep2.bam HCC1395_normal_rep2.sam
samtools sort -@ 8 -o HCC1395_normal_rep3.bam HCC1395_normal_rep3.sam
samtools sort -@ 8 -o HCC1395_tumor_rep1.bam HCC1395_tumor_rep1.sam
samtools sort -@ 8 -o HCC1395_tumor_rep2.bam HCC1395_tumor_rep2.sam
samtools sort -@ 8 -o HCC1395_tumor_rep3.bam HCC1395_tumor_rep3.sam

Merge HISAT2 BAM files

java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT HCC1395_normal.bam -INPUT HCC1395_normal_rep1.bam -INPUT HCC1395_normal_rep2.bam -INPUT HCC1395_normal_rep3.bam
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT HCC1395_tumor.bam -INPUT HCC1395_tumor_rep1.bam -INPUT HCC1395_tumor_rep2.bam -INPUT HCC1395_tumor_rep3.bam


Practical Exercise 7 - Visualize

cd $RNA_HOME/practice/alignments/hisat2
samtools index HCC1395_normal.bam
samtools index HCC1395_tumor.bam

Start IGV on your laptop. Load the HCC1395_normal.bam & HCC1395_tumor.bam files in IGV. You can load the necessary files in IGV directly from your web accessible amazon workspace (see below) using ‘File’ -> ‘Load from URL’.

HCC1395BL (normal) alignment: http://YOUR_DNS_NAME/workspace/rnaseq/practice/alignments/hisat2/HCC1395_normal.bam

HCC1395 tumor alignment: http://YOUR_DNS_NAME/workspace/rnaseq/practice/alignments/hisat2/HCC1395_tumor.bam


Practical Exercise 8 - Expression

cd $RNA_HOME/practice/
mkdir -p expression/stringtie/ref_only/
cd expression/stringtie/ref_only/

stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep1/transcripts.gtf -A HCC1395_tumor_rep1/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep1.bam
stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep2/transcripts.gtf -A HCC1395_tumor_rep2/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep2.bam
stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep3/transcripts.gtf -A HCC1395_tumor_rep3/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep3.bam

stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep1/transcripts.gtf -A HCC1395_normal_rep1/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep1.bam
stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep2/transcripts.gtf -A HCC1395_normal_rep2/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep2.bam
stringtie --rf -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep3/transcripts.gtf -A HCC1395_normal_rep3/gene_abundances.tsv $RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep3.bam

Practical Exercise 9 - Differential expression

Create a new folder for DE results and then start an R session

mkdir -p $RNA_HOME/practice/de/ballgown/ref_only/
cd $RNA_HOME/practice/de/ballgown/ref_only/


Run the following R commands in your R session.

# load the required libraries

# Create phenotype data needed for ballgown analysis

# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data)

# Display a description of this object

# Load all attributes including gene name
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Save the ballgown object to a file for later use
save(bg, file='bg.rda')

# Perform differential expression (DE) analysis with no filtering
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Save a tab delimited file for gene results
write.table(results_genes, "HCC1395_Tumor_vs_Normal_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])

# Perform DE analysis now using the filtered data
results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Identify the significant genes with p-value < 0.05
sig_genes = subset(results_genes, results_genes$pval<0.05)

# Output the significant gene results to a tab delimited file
write.table(sig_genes, "HCC1395_Tumor_vs_Normal_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Exit the R session

Practical Exercise 10 - Volcano plot

Navigate to previously created folder for DE results and then start an R session

cd $RNA_HOME/practice/de/ballgown/ref_only/


Run the following R commands in your R session.

#Load libraries


#If X11 not available, open a pdf device for output of all plots

#### Import the gene expression data from the practical exercises (HISAT2/StringTie/Ballgown practicals)

#Set working directory where results files exist
working_dir = "~/workspace/rnaseq/practice/de/ballgown/ref_only"

# List the current contents of this directory

#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline

# View a summary of the ballgown object

# Load gene names for lookup later in the tutorial
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Calculate the differential expression results including significance
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id"))
results_genes[,"de"] = log2(results_genes[,"fc"])

#### Plot a volcano plot

# Create a new diffexpressed column and set all genes to "no change"
results_genes$diffexpressed <- "No"

# if log2Foldchange > 2 and pvalue < 0.05, set as "Up regulated"
results_genes$diffexpressed[results_genes$de > 0.6 & results_genes$pval < 0.05] <- "Up"

# if log2Foldchange < -2 and pvalue < 0.05, set as "Down regulated"
results_genes$diffexpressed[results_genes$de < -0.6 & results_genes$pval < 0.05] <- "Down"

# Create a new gene label column and populate with NA values
results_genes$gene_label <- NA

# write the gene names of those significantly upregulated/downregulated to a new column
results_genes$gene_label[results_genes$diffexpressed != "No"] <- results_genes$gene_name[results_genes$diffexpressed != "No"]

ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) +
             xlab("log2Foldchange") +
             scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) +
             geom_point() +
             theme_minimal() +
             geom_text_repel() +
             geom_vline(xintercept=c(-0.6, 0.6), col="red") +
             geom_hline(yintercept=-log10(0.05), col="red") +
             guides(colour = guide_legend(override.aes = list(size=5))) +
             geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black")



To view your plot, go to the following url in your browser: