Welcome to the blog

Posts

My thoughts and ideas

AWS Setup | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

AWS Setup

Amazon AWS/AMI setup for use in workshop

This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.

Create AWS account

A helpful tutorial can be found here: https://rnabio.org/module-00-setup/0000/04/01/Intro_to_AWS/

  1. Create a new gmail account to use for the course
  2. Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
  3. Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
  4. Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
  5. Sign into AWS Management Console: http://aws.amazon.com/console/
  6. Go to EC2 services

Start with existing community AMI

  1. Launch a fresh Ubuntu Image (Ubuntu Server 18.04 LTS at the time of writing this). Choose an instance type of m5.2xlarge. Increase root volume (e.g., 32GB) and add a second volume (e.g., 250GB). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance.
  2. Change permissions on downloaded key pair with chmod 400 [instructor-key].pem
  3. Login to instance with ubuntu user:

ssh -i [instructor-key].pem ubuntu@[public.ip.address]

Perform basic linux configuration

  • To allow installation of bioinformatics tools some basic dependencies must be installed first.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python-dev python-numpy python3-dev python3-pip gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev libcurl4-openssl-dev apache2 csh ruby-full gnuplot cpanminus r-base libssl-dev gcc-4.8 g++-4.8 gsl-bin libgsl-dev apt-transport-https software-properties-common
sudo timedatectl set-timezone America/New_York
  • logout and log back in for changes to take affect.

Set up additional storage for workspace

We first need to setup the additional storage volume that we added when we created the instance.

# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace

# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace

# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\n/dev/nvme1n1 /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab

# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace

# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace

Install any desired informatics tools

  • NOTE: R in particular is a slow install.
  • NOTE:
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
  • Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add export RNA_HOME=~/workspace/rnaseq to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc.
  • NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc:
export MANPAGER=less

Install RNA-seq software

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

wget https://github.com/samtools/samtools/releases/download/1.11/samtools-1.11.tar.bz2
bunzip2 samtools-1.11.tar.bz2
tar -xvf samtools-1.11.tar
cd samtools-1.11
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.11:$PATH

Install bam-readcount

cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.11
git clone https://github.com/genome/bam-readcount.git
cd bam-readcount
cmake -Wno-dev .
make
./bin/bam-readcount
export PATH=/home/ubuntu/bin/bam-readcount/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/4pMgDq4oAF9QCfA/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.4.Linux_x86_64.tar.gz
tar -xzvf stringtie-2.1.4.Linux_x86_64.tar.gz
cd stringtie-2.1.4.Linux_x86_64
./stringtie -h
export PATH=/home/ubuntu/bin/stringtie-2.1.4.Linux_x86_64:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.1.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.1.Linux_x86_64.tar.gz
cd gffcompare-0.12.1.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.1.Linux_x86_64:$PATH

Install htseq-count

cd ~/bin
git clone https://github.com/htseq/htseq.git
cd htseq/
git fetch --all --tags
git checkout release_0.12.4
python setup.py install --user
chmod +x scripts/htseq-count
./scripts/htseq-count -h
export PATH=/home/ubuntu/bin/htseq:$PATH

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Install MultiQC

cd ~/bin
pip3 install --user multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
python3 -m multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.23.8/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

Install Flexbar

cd ~/bin
wget https://github.com/seqan/flexbar/releases/download/v3.5.0/flexbar-3.5.0-linux.tar.gz
tar -xzvf flexbar-3.5.0-linux.tar.gz
cd flexbar-3.5.0-linux/
export LD_LIBRARY_PATH=~/bin/flexbar-3.5.0-linux:$LD_LIBRARY_PATH
./flexbar
export PATH=/home/ubuntu/bin/flexbar-3.5.0-linux:$PATH

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.39
cd bedops_linux_x86_64-v2.4.39
wget -c https://github.com/bedops/bedops/releases/download/v2.4.39/bedops_linux_x86_64-v2.4.39.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.39.tar.bz2
./bin/bedops
export PATH=~/bin/bedops_linux_x86_64-v2.4.39/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH

Install how_are_we_stranded_here

pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness

Install Cell Ranger

  • Must register to get download link
cd ~/bin
wget `download_link`
tar -xzvf cellranger-4.0.0.tar.gz
export PATH=/home/ubuntu/bin/cellranger-4.0.0:$PATH

Install R

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/'
sudo apt-get update
sudo apt-get install r-base r-base-core r-recommended

Note, if X11 libraries are not available you may need to use --with-x=no during config, on a regular linux system you would not use this option. Also, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory.

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva"))
quit(save="no")

Install CONICSmat

R
install.packages("devtools")
devtools::install_github("diazlab/CONICS/CONICSmat", dep = TRUE)

Install Signac

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# Tell R to also check bioconductor when installing dependencies
setRepositories(ind=1:2)

# Install Signac (GO.db must installed with Bioconductor)
install.packages("devtools")
BiocManager::install(c("GO.db","DirichletMultinomial"))
devtools::install_github("timoast/signac")
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Install TABIX (GEMINI pre-req)

sudo apt-get install tabix

Install BWA

git clone https://github.com/lh3/bwa.git
cd bwa
make

Install bedtools

wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

  • Edit config to allow files to be served from outside /usr/share and /var/www
sudo vim /etc/apache2/apache2.conf
  • Add the following content to apache2.conf
<Directory /home/ubuntu/>
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
</Directory>
  • Edit vhost file
sudo vim /etc/apache2/sites-available/000-default.conf
  • Change document root in 000-default.conf
DocumentRoot /home/ubuntu
  • Restart apache
sudo service apache2 restart

Save a public AMI

Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.

Current Public AMIs

  • cshl-seqtec-2019 (ami-018b3bf40f9926ac5; N. Virginia)
  • cshl-seqtech-2020 (ami-09ecbedc3b79937e3; N. Virginia)

Modification for SeqTech 2020:

cd /
sudo mkdir workspace2
sudo mount /dev/nvme1n1 /workspace2
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\n/dev/nvme0n1 /workspace ext4 defaults,nofail 0 2\n/dev/nvme1n1 /workspace2 ext2 defaults,nofail 0 2" | sudo tee /etc/fstab
cd workspace2/
sudo chown -R ubuntu:ubuntu /workspace2
rsync -av /workspace/* /workspace2/
rm -rf lost+found/
cd
rm -rf workspace
ln -s /workspace2 workspace

Create IAM account

From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.

Launch student instance

  1. Go to AWS console. Login. Select EC2.
  2. Launch Instance, search for “cshl-seqtech-2020” in Community AMIs and Select.
  3. Choose “m5.2xlarge” instance type.
  4. Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
  5. Make sure that you see two snapshots (e.g., the 32GB root volume and 80GB EBS volume you set up earlier)
  6. Create a tag with name=StudentName
  7. Choose existing security group call “SSH_HTTP_8081_IN_ALL_OUT”. Review and Launch.
  8. Choose an existing key pair (CSHL.pem)
  9. View instances and wait for them to finish initiating.
  10. Find your instance in console and select it, then hit connect to get your public.ip.address.
  11. Login to node ssh -i CSHL.pem ubuntu@[public.ip.address].
  12. Optional - set up DNS redirects (see below)

Set up a dynamic DNS service

Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances.

Host necessary files for the course

Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.

After course reminders

  • Delete the student IAM account created above otherwise students will continue to have EC2 privileges.
  • Terminate all instances and clean up any unnecessary volumes, snapshots, etc.
Integrated Assignment Answers | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Integrated Assignment Answers

Integrated Assignment answers

Background: The use of cell lines are often implemented in order to study different experimental conditions. One such kind of study is the effects of shRNA on expression profiles, to determine whether these effects target specific genes. Experimental models for these include using control shRNA to account for any expression changes that may occur from just the introduction of these molecules.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA sequence files on the SGC-7901 gastric cancer cell line, (3 transfected with tcons_00001221 shRNA, and 3 control shRNA), and determine the number of differentially expressed genes.

Experimental information and other things to keep in mind:

PART 0 : Obtaining Data and References

Goals:

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_ASSIGNMENT that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME
mkdir -p ~/workspace/rnaseq/integrated_assignment/
export RNA_INT_ASSIGNMENT=~/workspace/rnaseq/integrated_assignment

You will also need the following environment variables througout the assignment:

export RNA_INT_DATA_DIR=$RNA_INT_ASSIGNMENT/top_1mil
export RNA_INT_REFS_DIR=$RNA_INT_ASSIGNMENT/reference
export RNA_INT_ILL_ADAPT=$RNA_INT_ASSIGNMENT/adapter
export RNA_INT_REF_INDEX=$RNA_INT_REFS_DIR/Homo_sapiens.GRCh38
export RNA_INT_REF_FASTA=$RNA_INT_REF_INDEX.dna.primary_assembly.fa
export RNA_INT_REF_GTF=$RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf
export RNA_INT_ALIGN_DIR=$RNA_INT_ASSIGNMENT/hisat2

Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory Note: when initiating an environment variable, we do not need the $; however, everytime we call the variable, it needs to be preceeded by a $.

echo $RNA_INT_ASSIGNMENT
cd $RNA_INT_ASSIGNMENT
ln -s ~/CourseData/RNA_data/Integrative_Assignment_RNA/reference/
ln -s ~/CourseData/RNA_data/Integrative_Assignment_RNA/raw_reads/top_1mil/
ln -s ~/CourseData/RNA_data/Integrative_Assignment_RNA/adapter/

Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

A1.) The answer is 19. Review these files so that you are familiar with them. If the reference fasta or gtf was not provided, you could obtain them from the Ensembl website under their downloads > databases.

cd $RNA_INT_ASSIGNMENT/reference/
tree
find . -type f
find . -type f | wc -l

The . tells the find command to look in the current directory and -type f restricts the search to files only. The | uses the output from the find command and wc -l counts the lines of that output

Q2.) How many exons does the gene SOX4 have? How about the longest isoform of PCA3?

A2.) SOX4 only has 1 exon, while the longest isoform of PCA3 has 7 exons. Review the GTF file so that you are familiar with it. What downstream steps will we need this gtf file for?

grep -w "SOX4" Homo_sapiens.GRCh38.92.gtf
grep -w "PCA3" Homo_sapiens.GRCh38.92.gtf | grep "exon_number" | cut -f9 | awk '{split($0,a,";"); print a[5]}' | sort -r | head

Q3.) How many samples do you see under the data directory?

A3.) The answer is 6. The samples are paired per file, and are named based on their accession number.

cd $RNA_INT_ASSIGNMENT/raw_reads/
ls -l
ls -l | wc -l

NOTE: The fastq files you have copied above contain only the first 1000000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

Part 1 : Data preprocessing

Goals:

Now create a new folder that will house the outputs from FastQC. Use the -h option to view the potential output on the data to determine the quality of the data.

cd $RNA_INT_ASSIGNMENT
mkdir raw_fastqc
fastqc $RNA_INT_DATA_DIR/* -o raw_fastqc/
python3 -m multiqc .

Q4.) What metrics, if any, have the samples failed? Are the errors related?

A4.) The per base sequence content of the samples don’t show a flat distribution and do have a bias towards certain bases at particular positions. The reason for this is the presense of adapters in the reads, which also shows a warning if not a failure in the html output summary.

Now based on the output of the html summary, proceed to clean up the reads and rerun fastqc to see if an improvement can be made to the data. Make sure to create a directory to hold any processed reads you may create.

mkdir trimmed_reads
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155055_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155055_2.fastq.gz --target trimmed_reads/SRR7155055
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155056_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155056_2.fastq.gz --target trimmed_reads/SRR7155056
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155057_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155057_2.fastq.gz --target trimmed_reads/SRR7155057
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155058_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155058_2.fastq.gz --target trimmed_reads/SRR7155058
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155059_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155059_2.fastq.gz --target trimmed_reads/SRR7155059
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNA_INT_ILL_ADAPT/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 8 --zip-output GZ --reads $RNA_INT_DATA_DIR/SRR7155060_1.fastq.gz --reads2 $RNA_INT_DATA_DIR/SRR7155060_2.fastq.gz --target trimmed_reads/SRR7155060

Q5.) What average percentage of reads remain after adapter trimming? Why do reads get tossed out?

A5.) At this point, we could look in the log files individually. Alternatively, we could utilize the command line with a command like the one below.

tail -n 15 $RNA_INT_ASSIGNMENT/trimmed_reads/*.log

Doing this, we find that around 99% of reads still survive after adapter trimming. The reads that get tossed are due to being too short after trimming. They fall below our threshold of minimum read length of 25.

Q6.) What sample has the largest number of reads after trimming?

A6.) The control sample 2 (SRR7155059) has the most reads (1999678/2 = reads). An easy way to figure out the number of reads is to check the output log file from the trimming output. Looking at the “remaining reads” row, we see the reads (each read in a pair counted individually) that survive the trimming. We can also look at this from the command line.

grep 'Remaining reads' $RNA_INT_ASSIGNMENT/trimmed_reads/*.log

Alternatively, you can make use of the command ‘wc’. This command counts the number of lines in a file. Since fastq files have 4 lines per read, the total number of lines must be divided by 4. Running this command only give you the total number of lines in the fastq file (Note that because the data is compressed, we need to use zcat to unzip it and print it to the screen, before passing it on to the wc command):

zcat $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155059_1.fastq.gz | wc -l

We could also run multiqc and visualize the remaining reads that way.

PART 2: Data alignment

Goals:

A useful option to add to the end of your commands is 2>, which redirects the stdout from any command into a specific file. This can be used to redirect your stdout into a summary file, and can be used as follows: My_alignment_script 2> alignment_metrics.txt. The advantage of this is being able to view the alignment metrics later on.

To create HISAT2 alignment commands for all of the six samples and run alignments:

echo $RNA_INT_ALIGN_DIR
mkdir -p $RNA_INT_ALIGN_DIR
cd $RNA_INT_ALIGN_DIR

hisat2 -p 8 --rg-id=Transfect1 --rg SM:Transfect --rg LB:Transfect1_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155055_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155055_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155055.sam
hisat2 -p 8 --rg-id=Transfect2 --rg SM:Transfect --rg LB:Transfect2_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155056_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155056_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155056.sam
hisat2 -p 8 --rg-id=Transfect3 --rg SM:Transfect --rg LB:Transfect3_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155057_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155057_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155057.sam

hisat2 -p 8 --rg-id=Control1 --rg SM:Control --rg LB:Control1_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155058_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155058_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155058.sam
hisat2 -p 8 --rg-id=Control2 --rg SM:Control --rg LB:Control2_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155059_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155059_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155059.sam
hisat2 -p 8 --rg-id=Control3 --rg SM:Control --rg LB:Control3_sub --rg PL:ILLUMINA -x $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155060_1.fastq.gz -2 $RNA_INT_ASSIGNMENT/trimmed_reads/SRR7155060_2.fastq.gz -S $RNA_INT_ALIGN_DIR/SRR7155060.sam

Next, convert sam alignments to bam.

samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155055.bam $RNA_INT_ALIGN_DIR/SRR7155055.sam
samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155056.bam $RNA_INT_ALIGN_DIR/SRR7155056.sam
samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155057.bam $RNA_INT_ALIGN_DIR/SRR7155057.sam
samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155058.bam $RNA_INT_ALIGN_DIR/SRR7155058.sam
samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155059.bam $RNA_INT_ALIGN_DIR/SRR7155059.sam
samtools sort -@ 8 -o $RNA_INT_ALIGN_DIR/SRR7155060.bam $RNA_INT_ALIGN_DIR/SRR7155060.sam

Q7.) How else could you obtain summary statistics for each aligned file?

A7.) There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat. You can also look for the logs generated by TopHat. These logs provide a summary of the aligned reads.

cd $RNA_INT_ALIGN_DIR/
samtools flagstat SRR7155055.bam > SRR7155055.flagstat.txt
samtools flagstat SRR7155056.bam > SRR7155056.flagstat.txt
samtools flagstat SRR7155057.bam > SRR7155057.flagstat.txt

samtools flagstat SRR7155058.bam > SRR7155058.flagstat.txt
samtools flagstat SRR7155059.bam > SRR7155059.flagstat.txt
samtools flagstat SRR7155060.bam > SRR7155060.flagstat.txt

grep "mapped (" *.flagstat.txt

Q8.) Approximatly how much space is saved by converting the sam to a bam format?

A8.) We get about a 5.5x compression by using the bam format instead of the sam format. This can be seen by adding the -lh option when listing the files in the aligntments directory.

ls -lh $RNA_INT_ALIGN_DIR/

To specifically look at the sizes of the sam and bam files, we could use du -h, which shows us the disk space they are utilizing in human readable format.

du -h $RNA_INT_ALIGN_DIR/*.sam
du -h $RNA_INT_ALIGN_DIR/*.bam

In order to make visualization easier, we’re going to merge each of our bams into one using the following commands. Make sure to index these bams afterwards to be able to view them on IGV.

#merge the bams for visulization purposes
cd $RNA_INT_ALIGN_DIR
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam

To visualize these merged bam files in IGV, we’ll need to index them. We can do so with the following commands.

samtools index $RNA_INT_ALIGN_DIR/control.bam
samtools index $RNA_INT_ALIGN_DIR/transfected.bam

Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this:

Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data?

A9.) The reads all pile up on the exonic regions of the gene since we’re dealing with RNA-Sequencing data. Not all exons have equal coverage, and this is due to different isoforms of the gene being sequenced. If the data was from a whole-genome experiment, we would ideally expect to see equal coverage across the whole gene length.

Right-click in the middle of the page, and click on “Expanded” to view the reads more easily.

Q10.) What are the lines connecting the reads trying to convey?

A10.) The lines show a connected read, where one part of the read begins mapping to one exon, while the other part maps to the next exon. This is important in RNA-Sequencing alignment as aligners must be aware to take this partial alignment strategy into account.

PART 3: Expression Estimation

Goals:

cd $RNA_INT_ASSIGNMENT/
mkdir -p $RNA_INT_ASSIGNMENT/expression
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/transfect1/transcripts.gtf -A Tumor1/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155055.bam
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/transfect2/transcripts.gtf -A Tumor2/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155056.bam
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/transfect3/transcripts.gtf -A Tumor3/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155057.bam
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/control1/transcripts.gtf -A Normal1/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155058.bam
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/control2/transcripts.gtf -A Normal2/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155059.bam
stringtie -p 8 -G $RNA_INT_REFS_DIR/Homo_sapiens.GRCh38.92.gtf -e -B -o $RNA_INT_ASSIGNMENT/expression/control3/transcripts.gtf -A Normal3/gene_abundances.tsv $RNA_INT_ALIGN_DIR/SRR7155060.bam

Q11.) How do you get the expression of the gene SOX4 across the transfect and control samples?

A11.) To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file

grep ENSG00000124766 $RNA_INT_ASSIGNMENT/expression/*/transcripts.gtf | cut -f1,9 | grep FPKM

PART 4: Differential Expression Analysis

Goals:

mkdir -p $RNA_INT_ASSIGNMENT/ballgown/
cd $RNA_INT_ASSIGNMENT/ballgown/

Perform transfect vs. control comparison, using all samples, for known (reference only mode) transcripts: First create a file that lists our 6 expression files, then view that file, then start an R session where we will examine these results:

printf "\"ids\",\"type\",\"path\"\n\"transfect1\",\"Transfect\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/transfect1\"\n\"transfect2\",\"Transfect\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/transfect2\"\n\"transfect3\",\"Transfect\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/transfect3\"\n\"control1\",\"Control\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/control1\"\n\"control2\",\"Control\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/control2\"\n\"control3\",\"Control\",\"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/control3\"\n" > Transfect_vs_Control.csv

cat Transfect_vs_Control.csv

R

*Adapt the R tutorial file that has been provided in the github repo for part 1 of the tutorial: Tutorial_Part1_ballgown.R. Modify it to fit the goals of this assignment then run it.

Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case?

A12.) Yes, there are about 523 significantly differntially expressed genes. Due to the fact that we’re using a subset of the fully sequenced library for each sample, the SOX4 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all the reads from each sample in the original data set, which will give you greater resolution of the expression of each gene to build mean and variance estimates for eacch gene’s expression.

Q13.) What plots can you generate to help you visualize this gene expression profile

A13.) The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of transfect vs. control FPKM values.