Welcome to the blog


My thoughts and ideas

Integrated Assignment | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Integrated Assignment

Preamble: Note that the following integrated assignment asks you to work on new RNA-seq data and apply the concepts you have learned up to this point. To complete this assignment you will need to review commands we performed in many of the earlier sections. Try to construct these commands on your own and get all the way to the end of the assignment. If you get very stuck or would like to compare your solutions to those suggested by the instructors, refer to the answers page. The integrated assignment answers page is an expanded version of this page with all of the questions plus detailed code solutions to all problems. Avoid the temptation to look at it without trying on your own.

Background: Cell lines are often used to study different experimental conditions and to study the function of specific genes by various perturbation approaches. One such type of study involves knocking down expression of a target of interest by shRNA and then using RNA-seq to measure the impact on gene expression. These eperiments often include use of a control shRNA to account for any expression changes that may occur from just the introduction of these molecules. Differential expression is performed by comparing biological replicates of shRNA knockdown vs shRNA control.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA-seq datasets generated from a cell line (3 transfected with shRNA, and 3 controls). Our goal will be to determine differentially expressed genes.

Experimental information and other things to keep in mind:

  • The libraries are prepared as paired end.
  • The samples are sequenced on an Illumina 4000.
  • Each read is 150 bp long
  • The dataset is located here: GSE114360
  • 3 samples transfected with target shRNA and 3 samples with control shRNA
  • Libraries were prepared using standard Illumina protocols
  • For this exercise we will be using a subset of the reads (first 1,000,000 reads from each pair).
  • The files are named based on their SRR id’s, and obey the following key:
    • SRR7155055 = CBSLR knockdown sample 1 (T1 - aka transfected 1)
    • SRR7155056 = CBSLR knockdown sample 2 (T2 - aka transfected 2)
    • SRR7155057 = CBSLR knockdown sample 3 (T3 - aka transfected 3)
    • SRR7155058 = control sample 1 (C1 - aka control 1)
    • SRR7155059 = control sample 2 (C2 - aka control 2)
    • SRR7155060 = control sample 3 (C3 - aka control 3)

Experimental descriptions from the study authors:

Experimental details from the paper: “An RNA transcriptome-sequencing analysis was performed in shRNA-NC or shRNA-CBSLR-1 MKN45 cells cultured under hypoxic conditions for 24 h (Fig. 2A).”

Experimental details from the GEO submission: “An RNA transcriptome sequencing analysis was performed in MKN45 cells that were transfected with tcons_00001221 shRNA or control shRNA.”

Note that according to GeneCards and HGNC, CBSLR and tcons_00001221 refer to the same gene.

PART 0 : Obtaining Data and References


  • Obtain the files necessary for data processing
  • Familiarize yourself with reference and annotation file format
  • Familiarize yourself with sequence FASTQ format

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_ASSIGNMENT that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
mkdir -p ~/workspace/rnaseq/integrated_assignment/
export RNA_INT_DIR=~/workspace/rnaseq/integrated_assignment

Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory

Remember: when initiating an environment variable, we do NOT need the $; however, everytime we call the variable, it needs to be preceeded by a $.


wget http://genomedata.org/rnaseq-tutorial/Integrated_Assignment_RNA_Data.tar.gz

Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

Q2.) How many exons does the gene SOX4 have? How about the longest isoform of PCA3?

Q3.) How many samples do you see under the data directory?

NOTE: The fastq files you have copied above contain only the first 1,000,000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

Part 1 : Data preprocessing


  • Run a quality check with fastqc before and after trimming
  • Familiarize yourself with the options for fastqc to be able to redirect your output
  • Perform adapter trimming and data cleanup on your data using fastp
  • Familiarize yourself with the output metrics from adapter trimming
  • Examine fastqc and/or multiqc reports for the pre- and post-trimmed data

Q4.) What metrics, if any, have the samples failed? Are the errors related?

Q5.) What average percentage of reads remain after adapter trimming? Why do reads get tossed out?

Q6.) What sample has the largest number of reads after trimming/cleanup?

Part 2: Data alignment


  • Familiarize yourself with HISAT2 alignment options
  • Perform alignments
  • Obtain alignment summary
  • Convert your alignment into compressed bam format

Q7.) How would you obtain summary statistics for each aligned file?

Q8.) Approximately how much space is saved by converting the sam to a bam format?

In order to make visualization easier, you should now merge each of your replicate sample bams into one combined BAM for each condition. Make sure to index these bams afterwards to be able to view them on IGV.

Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this:

  • Load up IGV
  • Change the reference genome to “Human hg38” in the top-left category
  • Click on File > Load from URL, and in the File URL enter: “http:///rnaseq/integrated_assignment/alignments/transfected.bam". Repeat this step and enter "http:///rnaseq/integrated_assignment/alignments/control.bam" to load the other bam.
  • Right-click on the alignments track in the middle, and Group alignments by “Library”
  • Jump to TP53 by typing it into the search bar above

Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data?

Q10.) What are the lines connecting the reads trying to convey?

Part 3: Expression Estimation


  • Familiarize yourself with Stringtie options
  • Run Stringtie to obtain expression values
  • Obtain expression values for the gene SOX4
  • Create an expression results directory, run Stringtie on all samples, and store the results in appropriately named subdirectories in this results dir

Q11.) How can you obtain the expression of the gene SOX4 across the transfected and control samples?

Part 4: Differential Expression Analysis


  • Perform differential analysis between the transfected and control samples

Adapt the R tutorial code that was used in Differential Expression section. Modify it to work on these data (which are also a 3x3 replicate comparison of two conditions).

Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case?

Part 5: Differential Expression Analysis Visualization

Q13.) What plots can you generate to help you visualize this gene expression profile?

Hint, a volcano plot is popular approach to provide a high level summary of a differential expression analysis. Refer to the DE Visualization section for example R code.

Team Assignment - Expression and DE | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Team Assignment - Expression and DE

In the previous Team Assignment, teams have trimmed, aligned, visualized and performed QC evaluations of their RNAseq data. Using this aligned data, students will now apply the concepts they have learned regarding expression estimation and differential expression analysis. To complete this assignment, students will need to review commands we performed in earlier sections.

Before starting this team exercise, first find the folder containing your 6 aligned bam files (along with the index files). Note: In the previous exercise, you merged bams files for easy visualization in IGV, we will not be using that for expression and de analysis.

Expression Estimation


Remember to do this in a new directory under team_exercises

mkdir -p $RNA_HOME/team_exercise/expression/
cd $RNA_HOME/team_exercise/expression/

Teams can now use Stringtie to estimate the gene expression levels in their sample and answer the following questions:

Q1. Based on your stringtie results, what are the top 5 genes with highest average expression levels across all knockout samples? What about in your rescue samples? (Hint: You can use R, command-line tools, or download files to your desktop for this analysis)

Differential Expression


Teams will now use ballgown to perform differential analysis followed by visualization of their results.

Hint: You will should create a separate directory under your team_exercises folder for your ballgown outputs.

Q2. How many significant differentially expressed genes do you observe?

Q3. By referring back to the supplementary tutorial in the DE Visualization Module, can you construct a volcano plot showcasing the significantly de genes?

Additionally, students should feel free to explore other visualization methods, including those they may have used in past research experiences and share with the class.

Presenting Your Results

At the end of this team exercise, students will show how they visualized their differential expression results to the class.