Welcome to the blog

# Posts

My thoughts and ideas

Integrated Assignment | Griffith Lab

## RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

# Integrated Assignment

Preamble: Note that the following integrated assignment asks you to work on new RNA-seq data and apply the concepts you have learned up to this point. To complete this assignment you will need to review commands we performed in many of the earlier sections. Try to construct these commands on your own and get all the way to the end of the assignment. If you get very stuck or would like to compare your solutions to those suggested by the instructors, refer to the answers page. The integrated assignment answers page is an expanded version of this page with all of the questions plus detailed code solutions to all problems. The answer page is available in the git repository for this wiki. It is slightly hidden to reduce temptation to look at it without trying on your own. Ask an instructor if you have trouble finding it.

Background: The use of cell lines are often implemented in order to study different experimental conditions. One such kind of study is the effects of shRNA on expression profiles, to determine whether these effects target specific genes. Experimental models for these include using control shRNA to account for any expression changes that may occur from just the introduction of these molecules.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA sequence files on the SGC-7901 gastric cancer cell line, (3 transfected with tcons_00001221 shRNA, and 3 control shRNA), and determine the number of differentially expressed genes.

Experimental information and other things to keep in mind:

• The libraries are prepared as paired end.
• The samples are sequenced on a Illumina 4000.
• Each read is 150 bp long
• The dataset is located here: GSE114360
• 3 samples transfected with target shRNA and 3 samples with control shRNA
• Libraries were prepared using standard Illumina protocols
• For this exercise we will be using all a subset of the reads (first 1000000 reads from each pair).
• The files are named based on their SRR id’s, and obey the following key:
• SRR7155055 = transfected sample 1
• SRR7155056 = transfected sample 2
• SRR7155057 = transfected sample 3
• SRR7155058 = control sample 1
• SRR7155059 = control sample 2
• SRR7155060 = control sample 3

## PART 0 : Obtaining Data and References

Goals:

• Obtain the files necessary for data processing
• Familiarize yourself with reference and annotation file format
• Familiarize yourself with sequence FASTQ format

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_ASSIGNMENT that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME mkdir -p ~/workspace/rnaseq/integrated_assignment/ export RNA_INT_ASSIGNMENT=~/workspace/rnaseq/integrated_assignment  You will also need the following environment variables througout the assignment: export RNA_INT_DATA_DIR=$RNA_INT_ASSIGNMENT/top_1mil
export RNA_INT_REFS_DIR=$RNA_INT_ASSIGNMENT/reference export RNA_INT_ILL_ADAPT=$RNA_INT_ASSIGNMENT/adapter
export RNA_INT_REF_INDEX=$RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 export RNA_INT_REF_FASTA=$RNA_INT_REF_INDEX.dna.primary_assembly.fa
export RNA_INT_REF_GTF=$RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf export RNA_INT_ALIGN_DIR=$RNA_INT_ASSIGNMENT/hisat2


Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory Note: when initiating an environment variable, we do not need the $; however, everytime we call the variable, it needs to be preceeded by a$.

echo $RNA_INT_ASSIGNMENT cd$RNA_INT_ASSIGNMENT
ln -s ~/CourseData/RNA_data/Integrative_Assignment_RNA/reference/
=======
ln -s ~/CourseData/CG_data/Integrative_Assignment_RNA/reference/
=======
ln -s ~/CourseData/RNA_data/Integrative_Assignment_RNA/reference/


Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

Q2.) How many exons does the gene SOX4 have? How about the longest isoform of PCA3?

Q3.) How many samples do you see under the data directory?

NOTE: The fastq files you have copied above contain only the first 1000000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

## Part 1 : Data preprocessing

Goals:

• Run quality check before and after cleaning up your data
• Familiarize yourself with the options for Fastqc to be able to redirect your output
• Familiarize yourself with the output metrics from adapter trimming

Q4.) What metrics, if any, have the samples failed? Are the errors related?

Q6.) What sample has the largest number of reads after trimming?

## PART 2: Data alignment

Goals:

• Familiarize yourself with HISAT2 alignment options
• Perform alignments
• Obtain alignment summary
• Convert your alignment into compressed bam format

A useful option to add to the end of your commands is 2>, which redirects the stdout from any command into a specific file. This can be used to redirect your stdout into a summary file, and can be used as follows: My_alignment_script 2> alignment_metrics.txt. The advantage of this is being able to view the alignment metrics later on.

Q7.) How would you obtain summary statistics for each aligned file?

Q8.) Approximatly how much space is saved by converting the sam to a bam format?

In order to make visualization easier, we’re going to merge each of our bams into one using the following commands. Make sure to index these bams afterwards to be able to view them on IGV.

#merge the bams for visulization purposes
cd $RNA_INT_ALIGN_DIR java -Xmx2g -jar$PICARD MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam
java -Xmx2g -jar PICARD MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam  Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this: • Load up IGV • Change the reference genome to “Human hg38” in the top-left category • Click on File > Load from URL, and in the File URL enter: “http://##.oicrcbw.ca/rnaseq/integrated_assignment/hisat2/transfected.bam”. Repeat this step and enter “http://##.oicrcbw.ca/rnaseq/integrated_assignment/hisat2/control.bam” to load the other bam, where ## is your student number for the AWS instance. • Right-click on the alignments track in the middle, and Group alignments by “Library” • Jump to TP53 by typing it into the search bar above Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data? Q10.) What are the lines connecting the reads trying to convey? ## PART 3: Expression Estimation Goals: • Familiarize yourself with Stringtie options • Run Stringtie to obtain expression values • Obtain expression values for the gene SOX4 • Create an expression results directory, run Stringtie on all samples, and store the results in appropriately named subdirectories in this results dir Q11.) How do you get the expression of the gene SOX4 across the transfect and control samples? ## PART 4: Differential Expression Analysis Goals: • Perform differential analysis between the transfected and control samples • Check if is differentially expressed First create a file that lists our 6 expression files, then view that file, then start an R session. Adapt the R tutorial file has been provided in the github repo for part 1 of the tutorial: Tutorial_Part1_ballgown.R. Modify it to fit the goals of this assignment then run it. Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case? Q13.) What plots can you generate to help you visualize this gene expression profile? Team Assignment - Expression and DE | Griffith Lab ## RNA-seq Bioinformatics Introduction to bioinformatics for RNA sequence analysis # Team Assignment - Expression and DE In the previous exercise, teams have aligned their RNAseq data and performed QC evaluations. Using their aligned data, students will now apply the concepts they have learned regarding expression estimation and differential expression analysis. To complete this assignment, students will need to review commands we performed in earlier sections. Before starting this team exercise, first find the folder containing your 6 aligned bam files (along with the index files). Note: In the previous exercise, you merged bams files for easy visualization in IGV, we will not be using that for expression and de analysis. ## Expression Estimation Goals: • Familiarize yourself with Stringtie options • Run Stringtie to obtain expression values • Run provided stringtie helper perl script to combine results into a single file Teams can now use Stringtie to estimate the gene expression levels in their sample and answer the following questions: ### Remember to do this in a new directory under team_exercises mkdir -pRNA_HOME/team_exercise/expression/stringtie/ref_only
cd $RNA_HOME/team_exercise/expression/stringtie/ref_only  <L4> Q1. Based on your stringtie results, what are the top 5 genes with highest average expression levels across all knockout samples? What about in your rescue samples? How large is the overlap between the two sets of genes? (Hint: You can use R for this analysis) Here’s some R code to start you off: ### Start R R ### load your data into R exp_table=read.table('gene_fpkm_all_samples.tsv', header=TRUE) ### Can you explain what these two lines are doing? Check your data before and after running these commands. exp_table[,'mean_ko'] = apply(exp_table[,c(2:4)], 1, mean) exp_table[,'mean_rescue'] = apply(exp_table[,c(5:7)], 1, mean) ### What about the following two commands? exp_table[order(exp_table$mean_ko,decreasing=T)[1:5],]
exp_table[order(exp_table$mean_rescue,decreasing=T)[1:5],]  ## Differential Expression Goals: • Perform differential analysis between the knockout and rescued samples • Check which genes are differentially expressed with statistical significance • Visualize DE results Teams will now use ballgown to perform differential analysis followed by visualization of their results. <L3> Q2. Follow through the ballgown differential expression section by making modifications using your respective sample names. Hint: You will need to create a separate directory under your team_exercises folder for your ballgown outputs. You will also need to change the respective sample names and paths following the printf command. mkdir -p$RNA_HOME/team_exercise/de/ballgown/ref_only/
cd $RNA_HOME/team_exercise/de/ballgown/ref_only/ printf "\"ids\",\"type\",\"path\"\n\"KO_Rep1\",\"KO\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/KO_Rep1\"\n\"KO_Rep2\",\"KO\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/KO_Rep2\"\n\"KO_Rep3\",\"KO\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/KO_Rep3\"\n\"RESCUE_Rep1\",\"RESCUE\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/RESCUE_Rep1\"\n\"RESCUE_Rep2\",\"RESCUE\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/RESCUE_Rep2\"\n\"RESCUE_Rep3\",\"RESCUE\",\"$RNA_HOME/team_exercise/expression/stringtie/ref_only/RESCUE_Rep3\"\n" > KO_vs_RESCUE.csv  <L3> Q3. How many significant differentially expressed genes do you observe? <L5> Q4. By referring back to the supplementary tutorial in the DE Visualization Module, can you construct a heatmap showcasing the significantly de genes? The code for Q4 is provided below, can you make sense of the following code? Do the code step by step and add in your own comments in a separate text file to explain to the TAs what each step is doing: #Load libraries library(ggplot2) library(gplots) library(GenomicRanges) library(ballgown) #Set your output pdf name pdf(file="YOUR_CHOICE_OF_FILENAME.pdf") #Set working directory where results files exist working_dir = "PATH_TO_FILES" setwd(working_dir) dir() #Load bg object load('bg.rda') bg_table = texpr(bg, 'all') bg_gene_names = unique(bg_table[, 9:10]) gene_expression = as.data.frame(gexpr(bg)) data_columns=c(1:6) results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id")) results_genes[,"de"] = log2(results_genes[,"fc"]) sigpi = which(results_genes[,"pval"]<0.05) sigp = results_genes[sigpi,] sigde = which(abs(sigp[,"de"]) >= 2) sig_tn_de = sigp[sigde,] mydist=function(c) {dist(c,method="euclidian")} myclust=function(c) {hclust(c,method="average")} main_title="sig DE Transcripts" par(cex.main=0.8) sig_genes_de=sig_tn_de[,"id"] sig_gene_names_de=sig_tn_de[,"gene_name"] data=log2(as.matrix(gene_expression[as.vector(sig_genes_de),data_columns])+1) heatmap.2(data, hclustfun=myclust, distfun=mydist, na.rm = TRUE, scale="none", dendrogram="both", margins=c(10,4), Rowv=TRUE, Colv=TRUE, symbreaks=FALSE, key=TRUE, symkey=FALSE, density.info="none", trace="none", main=main_title, cexRow=0.3, cexCol=1, labRow=sig_gene_names_de,col=rev(heat.colors(75))) dev.off() quit()  Now. try playing around with the de filter to include more/less genes in your heatmap. Try to determine the best cutoff for your specific dataset. <L5> (OPTIONAL) Q5. Pick one of the significantly differentially expressed genes and visualize gene expression levels across the 6 samples as well as individual transcript expression levels for those corresponding to your gene of interest. (Hint: How can you modify the transcript expression plot in the DE Visualization section to showcase gene expression levels instead of transcript expression levels?) Below are hints and one version of the answer for Q5. Also note that part of the answers are specific to how the sample names and file names were constructed and will require appropriate modification. ### On the command line, you will need to first figure out which genes are most differentially expressed ### We can do this by looking at the gene_result_sig.tsv file grep -v feature KO_vs_RESCUE_gene_results_sig.tsv | sort -rnk 3 | head ### With that gene name, now lets use R to plot the expression levels of that gene across different samples. R library(ballgown) library(genefilter) library(dplyr) library(devtools) outfile="<PATH_TO_OUTPUT_PDF>/<YOUR_CHOICE_OF_GENENAME>_expression_level.pdf" pheno_data = read.csv("KO_vs_RESCUE.csv") pheno_data load('bg.rda') bg_table = texpr(bg, 'all') gene_expression = gexpr(bg) pdf(file=outfile) gene_name = "<YOUR_CHOICE_OF_GENENAME>" gene_id = bg_table[bg_table$gene_name == gene_name,]$gene_id[1] transcript_id = bg_table[bg_table$gene_id == gene_id,]$t_id boxplot(gene_expression[gene_id,] ~ pheno_data$type, border=c(2,3), main=paste('Gene Name: ', gene_name),pch=19, xlab="Type", ylab='gene_expression')
points(gene_expression[gene_id,] ~ jitter(c(1,1,1,2,2,2)), col=c(1,1,1,2,2,2)+1, pch=16)

plotTranscripts(gene_id, bg, main=c('Gene in sample RESCUE_S1'), sample=c('RESCUE_S1'))
plotTranscripts(gene_id, bg, main=c('Gene in sample RESCUE_S2'), sample=c('RESCUE_S2'))
plotTranscripts(gene_id, bg, main=c('Gene in sample RESCUE_S3'), sample=c('RESCUE_S3'))
plotTranscripts(gene_id, bg, main=c('Gene in sample KO_S1'), sample=c('KO_S1'))
plotTranscripts(gene_id, bg, main=c('Gene in sample KO_S2'), sample=c('KO_S2'))
plotTranscripts(gene_id, bg, main=c('Gene in sample KO_S3'), sample=c('KO_S3'))

dev.off()


Additionally, students should feel free to explore other visualization methods, including those they may have used in past research experiences and share with the class.

## OPTIONAL: Bonus questions

• Run htseq to get raw counts and then use edgeR for differential expression
• Compare results between ballgown de and edgeR

<L4> Q6. After obtaining your edgeR results, how does it agree with your previously obtained de results using ballgown?