Welcome to the blog

# Posts

My thoughts and ideas

Integrated Assignment | Griffith Lab

## RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

# Integrated Assignment

Preamble: Note that the following integrated assignment asks you to work on new RNA-seq data and apply the concepts you have learned up to this point. To complete this assignment you will need to review commands we performed in many of the earlier sections. Try to construct these commands on your own and get all the way to the end of the assignment. If you get very stuck or would like to compare your solutions to those suggested by the instructors, refer to the answers page. The integrated assignment answers page is an expanded version of this page with all of the questions plus detailed code solutions to all problems. The answer page is available in the git repository for this wiki. It is slightly hidden to reduce temptation to look at it without trying on your own. Ask an instructor if you have trouble finding it.

Background: The use of cell lines are often implemented in order to study different experimental conditions. One such kind of study is the effects of shRNA on expression profiles, to determine whether these effects target specific genes. Experimental models for these include using control shRNA to account for any expression changes that may occur from just the introduction of these molecules.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA sequence files on the SGC-7901 gastric cancer cell line, (3 transfected with tcons_00001221 shRNA, and 3 control shRNA), and determine the number of differentially expressed genes.

Experimental information and other things to keep in mind:

• The libraries are prepared as paired end.
• The samples are sequenced on a Illumina 4000.
• Each read is 150 bp long
• The dataset is located here: GSE114360
• 3 samples transfected with target shRNA and 3 samples with control shRNA
• Libraries were prepared using standard Illumina protocols
• For this exercise we will be using all a subset of the reads (first 1000000 reads from each pair).
• The files are named based on their SRR id’s, and obey the following key:
• SRR7155055 = transfected sample 1
• SRR7155056 = transfected sample 2
• SRR7155057 = transfected sample 3
• SRR7155058 = control sample 1
• SRR7155059 = control sample 2
• SRR7155060 = control sample 3

## PART 0 : Obtaining Data and References

Goals:

• Obtain the files necessary for data processing
• Familiarize yourself with reference and annotation file format
• Familiarize yourself with sequence FASTQ format

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_ASSIGNMENT that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME mkdir -p ~/workspace/rnaseq/integrated_assignment/ export RNA_ASSIGNMENT=~/workspace/rnaseq/integrated_assignment/  You will also need the following environment variables througout the assignment: export RNA_DATA_DIR=$RNA_ASSIGNMENT/raw_reads
export RNA_REFS_DIR=$RNA_ASSIGNMENT/reference export RNA_ILL_ADAPT=$RNA_ASSIGNMENT/adapter
export RNA_REF_INDEX=$RNA_REFS_DIR/Homo_sapiens.GRCh38 export RNA_REF_FASTA=$RNA_REF_INDEX.dna.primary_assembly.fa
export RNA_REF_GTF=$RNA_REFS_DIR/Homo_sapiens.GRCh38.92.gtf export RNA_ALIGN_DIR=$RNA_ASSIGNMENT/hisat2


Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory Note: when initiating an environment variable, we do not need the $; however, everytime we call the variable, it needs to be preceeded by a$.

echo $RNA_ASSIGNMENT cd$RNA_ASSIGNMENT
ln -s ~/CourseData/RNA_data/Integrative_Assignment/reference/


Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

Q2.) How many exons does the gene SOX4 have? How about the longest isoform of PCA3?

Q3.) How many samples do you see under the data directory?

NOTE: The fastq files you have copied above contain only the first 1000000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

## Part 1 : Data preprocessing

Goals:

• Run quality check before and after cleaning up your data
• Familiarize yourself with the options for Fastqc to be able to redirect your output
• Familiarize yourself with the output metrics from adapter trimming

Q4.) What metrics, if any, have the samples failed? Are the errors related?

Q6.) What sample has the largest number of reads after trimming?

## PART 2: Data alignment

Goals:

• Familiarize yourself with HISAT2 alignment options
• Perform alignments
• Obtain alignment summary
• Convert your alignment into compressed bam format

A useful option to add to the end of your commands is 2>, which redirects the stdout from any command into a specific file. This can be used to redirect your stdout into a summary file, and can be used as follows: My_alignment_script 2> alignment_metrics.txt. The advantage of this is being able to view the alignment metrics later on.

Q7.) How would you obtain summary statistics for each aligned file?

Q8.) Approximatly how much space is saved by converting the sam to a bam format?

In order to make visualization easier, we’re going to merge each of our bams into one using the following commands. Make sure to index these bams afterwards to be able to view them on IGV.

#merge the bams for visulization purposes
cd RNA_ALIGN_DIR java -Xmx2g -jar ~/CourseData/RNA_data/Integrative_Assignment/picard.jar MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam java -Xmx2g -jar /usr/local/picard/picard.jar MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam  Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this: • Load up IGV • Change the reference genome to “Human hg38” in the top-left category • Click on File > Load from URL, and in the File URL enter: “http://##.oicrcbw.ca/rnaseq/integrated_assignment/hisat2/transfected.bam”. Repeat this step and enter “http://##.oicrcbw.ca/rnaseq/integrated_assignment/hisat2/control.bam” to load the other bam, where ## is your student number for the AWS instance. • Right-click on the alignments track in the middle, and Group alignments by “Library” • Jump to TP53 by typing it into the search bar above Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data? Q10.) What are the lines connecting the reads trying to convey? ## PART 3: Expression Estimation Goals: • Familiarize yourself with Stringtie options • Run Stringtie to obtain expression values • Obtain expression values for the gene SOX4 • Create an expression results directory, run Stringtie on all samples, and store the results in appropriately named subdirectories in this results dir Q11.) How do you get the expression of the gene SOX4 across the transfect and control samples? ## PART 4: Differential Expression Analysis Goals: • Perform differential analysis between the transfected and control samples • Check if is differentially expressed First create a file that lists our 6 expression files, then view that file, then start an R session. Adapt the R tutorial file has been provided in the github repo for part 1 of the tutorial: Tutorial_Part1_ballgown.R. Modify it to fit the goals of this assignment then run it. Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case? Q13.) What plots can you generate to help you visualize this gene expression profile? Practical Exercise Solutions | Griffith Lab ## RNA-seq Bioinformatics Introduction to bioinformatics for RNA sequence analysis # Practical Exercise Solutions ### Practical Exercise 1 - Software installation To install bedtools:  cdRNA_HOME/tools/
tar -zxvf bedtools-2.27.0.tar.gz
cd bedtools2/
make
./bin/bedtools


• What happens when you run bedtools without any options? The basic usage documentation is printed.

• Where can you find detailed documentation on how to use bedtools? http://bedtools.readthedocs.io/en/latest/

• How many general categories of analysis can you perform with bedtools? What are they? There are 8. They are ‘Genome arithmetic’, ‘Multi-way file comparisons’, ‘Paired-end manipulation’, ‘Format conversion’, ‘Fasta manipulation’, ‘BAM focused tools’, ‘Statistical relationships’, and ‘Miscellaneous tools’.

### Practical Exercise 2 - Reference Genomes

cd $RNA_HOME/refs cat chr22_with_ERCC92.fa | perl -ne 'if ($_ =~ /\>22/){$chr22=1}; if ($_ =~ /\>ERCC/){$chr22=0}; if ($chr22){print "$_";}' > chr22_only.fa cat chr22_only.fa | grep -v ">" | perl -ne 'chomp$_; $r+=$_ =~ tr/a/A/; $r +=$_ =~ tr/c/C/; $r +=$_ =~ tr/g/G/; $r +=$_ =~ tr/t/T/; $l += length($_); if (eof){$p = sprintf("%.2f", ($r/$l)*100); print "\nrepeat bases =$r\ntotal bases = $l\npercent repeat bases =$p%\n\n"}'
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_;$s = uc($_); print$_;' | perl -ne '$c +=$_ =~ s/GAATTC/XXXXXX/g; if (eof){print "\nEcoRI site (GAATTC) count = $c\n\n";}'  Answers • How many bases on chromosome 22 correspond to repetitive elements? What is the percentage of the whole length? Of 50,818,468 total bases on chr 22, there are 21,522,339 that correspond to repetitive elements (42.35%). • How many occurences of the EcoRI restriction site are present in the chromosome 22 sequence? The EcoRI restriction enzyme recognition sequence is 5’-GAATTC-‘3. Since this is a palendrome, the reverse complement is the same and we only have to search for one sequence in our string. After accounting for end of line breaks and case sensitivity we find 3,935 occurences of this sequence. ### Practical Exercise 3 - Data cd$RNA_HOME
mkdir -p practice/data
cd $RNA_HOME/practice/data wget http://genomedata.org/rnaseq-tutorial/practical.tar tar -xvf practical.tar ll -1 *.fastq.gz | wc -l zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 1 zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 2 | tail -n 1 | perl -ne '$_ = s/T/X/g; print "\n\n$_\n\n"' #Alternatively: zcat hcc1395_normal_rep1_r1.fastq.gz | head -n 2 | tail -n 1 | grep -o T | wc  Answers • How many data files were contained in the ‘practical.tar’ archive? What commonly used sequence data file format are they? There are 12 data files in the package. Each is a FASTQ file that has been compressed. • In the first read of the hcc1395, normal, replicate 1, read 1 file, what was the physical location of the read on the flow cell (i.e. lane, tile, x, y)? Lane = 4, tile = 1101, x = 10003, y = 44458. • In the first read of this same file, how many ‘T’ bases are there? 32. ### Practical Exercise 4 - Data QC cd$RNA_HOME/practice/data
fastqc *.fastq.gz


Then, go to the following url in your browser:

• http://YOUR_DNS_NAME/workspace/rnaseq/practice/data/
• Note, you must replace YOUR_DNS_NAME with your own amazon instance IP or DNS (e.g., cbw##.dyndns.info)
• Click on any of the *_fastqc.html files to view the FastQC report (e.g., hcc1395_normal_rep1_r1_fastqc.html)

• How many total sequences are there? 331,958

• What is the range (x - y) of read lengths observed? 151

• What is the most common average sequence quality score? 41

• What does the Adaptor Content warning tell us? There is some evidence of Illumina Universal Adapter at the 3’ ends of reads. This suggests that adapter trimming might be advisable for this data.

### Practical Exercise 5 - Trim

cd $RNA_HOME/practice/data/ mkdir trimmed wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep1_r1.fastq.gz --reads2 hcc1395_normal_rep1_r2.fastq.gz --target trimmed/hcc1395_normal_rep1 flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep2_r1.fastq.gz --reads2 hcc1395_normal_rep2_r2.fastq.gz --target trimmed/hcc1395_normal_rep2 flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_normal_rep3_r1.fastq.gz --reads2 hcc1395_normal_rep3_r2.fastq.gz --target trimmed/hcc1395_normal_rep3 flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep1_r1.fastq.gz --reads2 hcc1395_tumor_rep1_r2.fastq.gz --target trimmed/hcc1395_tumor_rep1 flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep2_r1.fastq.gz --reads2 hcc1395_tumor_rep2_r2.fastq.gz --target trimmed/hcc1395_tumor_rep2 flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters illumina_multiplex.fa --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads hcc1395_tumor_rep3_r1.fastq.gz --reads2 hcc1395_tumor_rep3_r2.fastq.gz --target trimmed/hcc1395_tumor_rep3  Compare these files using FastQC: cd$RNA_HOME/practice/data/trimmed/
fastqc *.fastq.gz

• http://YOUR_DNS_NAME/workspace/rnaseq/practice/data/hcc1395_normal_rep1_r1_fastqc.html
• http://YOUR_DNS_NAME/workspace/rnaseq/practice/data/trimmed/hcc1395_normal_rep1_1_fastqc.html

• After trimming, what is the range of read lengths observed for hcc1395 normal replicate 1, read 1? 25-151

• Which sections of the FastQC report are most informative for observing the effect of trimming? ‘Basic Statistics’, ‘Sequence Length Distribution’, and ‘Adapter Content’

• In the ‘Per base sequence content section’, what pattern do you see? What could explain this pattern? The first 9 base positions show a spiky pattern, suggesting biased representation of each base near the beginning of our reads/fragments. One possible explanation is that random hexamer priming for cDNA synthesis during library prep is happening in a non-random way. i.e. certain random hexamers are favored, therefore the creation of fragments (and ultimately reads) has a non-random pattern near the beginning.

### Practical Exercise 6 - Alignment

Perform alignments:

export RNA_HOME=~/workspace/rnaseq
export RNA_PRACTICE_DATA_DIR=$RNA_HOME/practice/data cd$RNA_HOME/practice/

mkdir -p alignments/hisat2
cd alignments/hisat2

hisat2 -p 8 --rg-id=HCC1395_normal_rep1 --rg SM:HCC1395_normal_rep1 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1$RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r2.fastq.gz -S ./HCC1395_normal_rep1.sam hisat2 -p 8 --rg-id=HCC1395_normal_rep2 --rg SM:HCC1395_normal_rep2 --rg PL:ILLUMINA -x$RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r1.fastq.gz -2$RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r2.fastq.gz -S ./HCC1395_normal_rep2.sam
hisat2 -p 8 --rg-id=HCC1395_normal_rep3 --rg SM:HCC1395_normal_rep3 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1$RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r2.fastq.gz -S ./HCC1395_normal_rep3.sam hisat2 -p 8 --rg-id=HCC1395_tumor_rep1 --rg SM:HCC1395_tumor_rep1 --rg PL:ILLUMINA -x$RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r1.fastq.gz -2$RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r2.fastq.gz -S ./HCC1395_tumor_rep1.sam
hisat2 -p 8 --rg-id=HCC1395_tumor_rep2 --rg SM:HCC1395_tumor_rep2 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1$RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r2.fastq.gz -S ./HCC1395_tumor_rep2.sam hisat2 -p 8 --rg-id=HCC1395_tumor_rep3 --rg SM:HCC1395_tumor_rep3 --rg PL:ILLUMINA -x$RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r1.fastq.gz -2$RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r2.fastq.gz -S ./HCC1395_tumor_rep3.sam


Sort and convert SAM to BAM:

samtools sort -@ 8 -o HCC1395_normal_rep1.bam HCC1395_normal_rep1.sam
samtools sort -@ 8 -o HCC1395_normal_rep2.bam HCC1395_normal_rep2.sam
samtools sort -@ 8 -o HCC1395_normal_rep3.bam HCC1395_normal_rep3.sam
samtools sort -@ 8 -o HCC1395_tumor_rep1.bam HCC1395_tumor_rep1.sam
samtools sort -@ 8 -o HCC1395_tumor_rep2.bam HCC1395_tumor_rep2.sam
samtools sort -@ 8 -o HCC1395_tumor_rep3.bam HCC1395_tumor_rep3.sam


Merge HISAT2 BAM files

java -Xmx2g -jar $RNA_HOME/tools/picard.jar MergeSamFiles OUTPUT=HCC1395_normal.bam INPUT=HCC1395_normal_rep1.bam INPUT=HCC1395_normal_rep2.bam INPUT=HCC1395_normal_rep3.bam java -Xmx2g -jar$RNA_HOME/tools/picard.jar MergeSamFiles OUTPUT=HCC1395_tumor.bam INPUT=HCC1395_tumor_rep1.bam INPUT=HCC1395_tumor_rep2.bam INPUT=HCC1395_tumor_rep3.bam


• What is the difference between a .sam and .bam file? The ‘.sam’ or SAM file is a plain text sequence alignment map file. The ‘.bam’ or BAM file is a binary compressed version of this same information.

• If you sorted the resulting BAM file as we did above, is the result sorted by read name? Or position? It is sorted by position.

• Which columns of the BAM file can be viewed to determine the style of sorting? The first, third and fourth columns contain the read name, chromosome, and position. Try samtools view HCC1395_normal.bam | head | cut -f 1,3,4 to confirm the sorting style.

• What command can you use to view only the BAM header? Try samtools view -H HCC1395_normal.bam

cd RNA_HOME/practice/alignments/hisat2 samtools index HCC1395_normal.bam samtools index HCC1395_tumor.bam  Start IGV on your laptop. Load the HCC1395_normal.bam & HCC1395_tumor.bam files in IGV. You can load the necessary files in IGV directly from your web accessible amazon workspace (see below) using ‘File’ -> ‘Load from URL’. HCC1395BL (normal) alignment: http://YOUR_DNS_NAME/workspace/rnaseq/practice/alignments/hisat2/HCC1395_normal.bam HCC1395 tumor alignment: http://YOUR_DNS_NAME/workspace/rnaseq/practice/alignments/hisat2/HCC1395_tumor.bam Answers • Load your merged normal and tumor BAM files into IGV. Navigate to this location on chromosome 22: ‘chr22:38,466,394-38,508,115’. What do you see here? How would you describe the direction of transcription for the two genes? Does the reported strand for the reads aligned to each of these genes appear to make sense? How do you modify IGV settings to see the strand clearly? This region contains two genes, ‘KDELR3’ and ‘DDX17’. With repect to direction of transcription, these genes are arranged in a tail-to-tail fashion (their transcription end points are coming together). KDELR3 is transcribed from the ‘+ve’ or ‘top’ strand (left to right) and DDX17 is transcribed from the ‘-ve’ or ‘bottom’ strand (right to left). Yes, the reads aligned appear to correspond to the expected strand of transcription. To view this pattern, do an option click within the alignment track and select ‘Color alignments by’ and ‘first-of-pair strand’ from the viewing options. You can do this for both normal and tumor alignment tracks seperately. • How can we modify IGV to color reads by Read Group? How many read groups are there for each sample (tumor & normal)? What are your read group names for the tumor sample? To see the read group of each read cleary, do an option click within the alignment track and select ‘Color alignments by’ and ‘read group’. By viewing the colors of reads and info for individual reads we can see there are 3 read groups for normal, and 3 for tumor. The names will be what you specified during your alignment command. For example: ‘HCC1395_tumor_rep1’, ‘HCC1395_tumor_rep2’, ‘HCC1395_tumor_rep3’. • What are the options for visualizing splicing or alternative splicing patterns in IGV? Navigate to this location on chromosome 22: ‘chr22:40,363,200-40,367,500’. What splicing event do you see? There are two main options for viewing splicing patterns in IGV. You can option click within the alignment track and select ‘Show Splice Junction Track’, or you can select the ‘Sashimi Plot’ option. In this region you should see an alternative splicing pattern for the gene ADSL, where a cassette exon is either included or skipped. The exon skipping isoform appears to be the minor isoform. ### Practical Exercise 8 - Expression cdRNA_HOME/practice/
mkdir -p expression/stringtie/ref_only/
cd expression/stringtie/ref_only/

stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep1/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep1.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep2/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep2.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_tumor_rep3/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_tumor_rep3.bam

stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep1/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep1.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep2/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep2.bam
stringtie -p 8 -G $RNA_REF_GTF -e -B -o HCC1395_normal_rep3/transcripts.gtf$RNA_HOME/practice/alignments/hisat2/HCC1395_normal_rep3.bam

• ## Integrated Assignment

Preamble: Note that the following integrated assignment asks you to work on new RNA-seq data and apply the concepts you...

• ## Practical Exercise Solutions

### Practical Exercise 1 - Software installation

To install bedtools: