Welcome to the blog

Posts

My thoughts and ideas

AWS Setup Before Course Start | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

AWS Setup Before Course Start

Preamble - Amazon AWS/AMI setup for use in workshop

This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.

A helpful introduction of AWS can be found here

Table of Contents

  1. Create AWS account
    1. Set up security group (if needed)
  2. Start with existing community AMI
  3. Perform basic linux configuration
  4. Add ubuntu user to docker group
  5. Set up additional storage for workspace
  6. Install any desired informatics tools
    1. Install RNA-seq software
      1. Create directory to install software to and setup path variables
      2. Install SAMtools
      3. Install bam-readcount
      4. Install HISAT2
      5. Install StringTie
      6. Install gffcompare
      7. Install HTSeq
      8. Make sure that OpenSSL is on correct version
      9. Install TopHat
      10. Install kallisto
      11. Install FastQC
      12. Install Fastp
      13. Install MultiQC
      14. Install Picard
      15. Install Flexbar
      16. Install Regtools
      17. Install RSeQC
      18. Install bedops
      19. Install gtfToGenePred
      20. Install genePredToBed
      21. Install how_are_we_stranded_here
      22. Install Cell Ranger
      23. Install TABIX
      24. Install BWA
      25. Install bedtools
      26. Install BCFtools
      27. Install htslib
      28. Install peddy
      29. Install slivar
      30. Install STRling
    2. Install freebayes
    3. Install vcflib
    4. Install Anaconda
    5. Install VEP
    6. Set up Jupyter to render in web brower
    7. Install R
      1. R Libraries
      2. Bioconductor libraries
      3. Install Sleuth
    8. Install softwares for germline analyses
      1. Install gatk
      2. Install minimap2
      3. Install NanoPlot
      4. Install Varscan
      5. Install faSplit
    9. Install packages for single-cell ATAC-seq lab
      1. Install ATACseqQC
    10. Install packages for single-cell RNAseq lab
    11. Install packages for Variant annotation and python visualization lab
  7. Path setup
  8. Set up Apache web server
  9. Save a public AMI
  10. Current Public AMIs
  11. Create IAM account
  12. Launch student instance
  13. Set up a dynamic DNS service
  14. Host necessary files for the course
  15. After course reminders

Create AWS account

  1. Create a new gmail account to use for the course
  2. Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
  3. Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
  4. Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
  5. Sign into AWS Management Console: http://aws.amazon.com/console/
  6. Go to EC2 services

Set up security group (if needed)

In general if no new web server is needed, you may pick an existing security group. The security group used for 2025 was “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. It has the following inbound and outbound rules that allows for connection to the Jupyter and R studio servers:

Rule type IP version Type Protocol Port range Source / Destination
Inbound Rule IPv4 Custom TCP TCP 8888 0.0.0.0/0
  IPv4 Custom TCP TCP 8787 0.0.0.0/0
  IPv4 HTTP TCP 80 0.0.0.0/0
  IPv4 HTTPS TCP 443 0.0.0.0/0
  IPv6 HTTP TCP 80 ::/0
  IPv4 SSH TCP 22 0.0.0.0/0
Outbound Rule IPv4 All traffic All All 0.0.0.0/0

Start with existing community AMI

  1. Set up a Ubuntu instance 1) Launch a fresh Ubuntu Instance (Ubuntu Server 22.04 LTS at the time of writing this). 2) Choose an instance type of m6a.xlarge. 3) Increase root volume (e.g., 60GB)(type:gp3) and add a second volume (e.g., 500GB)(type:gp3). 4) Choose appropriate security group (for 2025 course, choose security group “SSH/HTTP/Jupyter/Rstudio - with outbound rule”). 5) If necessary, create a new key pair, name and save it locally somewhere safe. 5) Review and Launch. Select ‘View Instances’. Take note of public IP address of newly launched instance.
  2. Change permissions on downloaded key pair with chmod 400 [instructor-key].pem
  3. Login to instance with ubuntu user:

ssh -i [instructor-key].pem ubuntu@[public.ip.address]

Note for TAs when setting up these instances

Usually the instances are setup in the following sequence: 1) Email instructors to request any packages/programs that needs to be installed. 2) Make a draft instance with the instructions above, configure and install everything. 3) Make an AMI. 4) Make an instructor instance by launching a new Ubuntu instance with the same specifications above, but using the AMI. 5) Inform instructors and TAs to test their code on the instructor instance, send them the instructor key pem file. 6) Modify the draft instance as needed. 7) Once finalized, create an AMI. This AMI will be distributed to students for the course.

Perform basic linux configuration

  • To allow installation of bioinformatics tools some basic dependencies must be installed first.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcairo2-dev
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/New_York
  • logout and log back in for changes to take affect.

Add ubuntu user to docker group

sudo usermod -aG docker ubuntu

Then exit shell and log back into instance.

Set up additional storage for workspace

We first need to setup the additional storage volume that we added when we created the instance.

# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace

# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace

In order to make the workspace volume persistent, we need to edit the etc/fstab file in order. AWS provides instructions for how to do this here.

# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS

# get UUID from sudo lsblk -f
UUID=$(sudo lsblk -f | grep nvme1n1 | awk {'print $4'})
#if want to double check, can do 'echo $UUID' to see the UUID. 

#then add that UUID to /etc/fstab
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\nUUID=$UUID /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab
#'less /etc/fstab' , to see if the new line has been added

# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace

# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace

Install any desired informatics tools

  • NOTE: R in particular is a slow install.
  • NOTE:
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
  • Paths to pre-installed tools can be added to the .bashrc file.
  • NOTE: (This didn’t happen during installation for the year 2023, but) In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc:
export MANPAGER=less

Install RNA-seq software

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

~/bin
wget https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2
bunzip2 samtools-1.18.tar.bz2
tar -xvf samtools-1.18.tar
cd samtools-1.18
make
./samtools
#add the following line to .bashrc
export PATH=/home/ubuntu/bin/samtools-1.18:$PATH
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.18

Install bam-readcount

cd ~/bin
git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz
tar -xzvf stringtie-2.2.1.tar.gz
cd stringtie-2.2.1
make release
export PATH=/home/ubuntu/bin/stringtie-2.2.1:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH

Install HTSeq

pip install HTSeq

# to check version of HTSeq
# pip show HTSeq

Make sure that OpenSSL is on correct version

TopHat will not install if the version of OpenSSL is too old.

To get version:

openssl version

If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps.

cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig

Again, from the terminal issue the command:

openssl version

Your output should be as follows:

OpenSSL 1.1.1g  21 Apr 2020

Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano.

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

Note: There are a couple of arguments only supported in kallisto legacy versions (version before 0.50.0). Also how_are_we_stranded_here uses kallisto == 0.44.x. Thus, installation steps below if for 1 of the legacy versions. But if run into problem, consider using a more updated version.

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

sudo apt-get install -y default-jre
cd ~/bin
sudo wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
sudo unzip fastqc_v0.12.1.zip
sudo chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Install Fastp

cd /home/ubuntu/bin
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp
./fastp

# Add fastp to bashrc
export PATH=/home/ubuntu/bin/fastp:$PATH

Install MultiQC

cd ~/bin
pip3 install multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

export PICARD=/home/ubuntu/bin/picard.jar

Install Flexbar

sudo apt install flexbar

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.41
cd bedops_linux_x86_64-v2.4.41
wget -c https://github.com/bedops/bedops/releases/download/v2.4.41/bedops_linux_x86_64-v2.4.41.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.41.tar.bz2
./bin/bedops

export PATH=~/bin/bedops_linux_x86_64-v2.4.41/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredtoBed:$PATH 
#note: the path has lowercase 't' at in 'genePredtoBed'
#genePredToBed 

Install how_are_we_stranded_here

pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness

Install Cell Ranger

  • Check if it has been updated
  • Must register to get download link
cd ~/bin
wget -O cellranger-9.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.1.tar.gz?Expires=1761476807&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=EzblHLl1V4c8zY~f1gqgkJDfwXHdpXeUUsoqiipDGTTRc1cDXifB5CZdV2te0aGah4VoEUssh8ERdRpmLzMNZzSBdsjmX9t6FE3PwZ83c-cOKAVJK3-v7z8GhMH9HZMqVxEb3w6SwztkZmipGhCyG95yT2fv-sNQZssJmUEzx8Wnc4t69iS~u0uMrd4rj9EDkyw6CYCfkRGGoxCclUAyPqMBQGRbcCogVlSoVk6sc~UsD5vpXXoxlgoVRrThdEZ4DpUUtInLST8cvtS127nrWIJcJ1e1Jk8dSndpKvAHEwTGB~U21oyiOb8lZhXVsY7VCXKsvivRKaRXWj0kh8whOA__"
tar -xzvf cellranger-9.0.1.tar.gz
export PATH=/home/ubuntu/bin/cellranger-9.0.1:$PATH

Install TABIX

sudo apt-get install tabix

Install BWA

cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH
#bwa mem #to call bwa

Install bedtools

cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz
tar -zxvf bedtools-2.31.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH

Install BCFtools

cd ~/bin
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
bunzip2 bcftools-1.18.tar.bz2
tar -xvf bcftools-1.18.tar
cd bcftools-1.18
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.18:$PATH

Install htslib

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
bunzip2 htslib-1.18.tar.bz2
tar -xvf htslib-1.18.tar
cd htslib-1.18
make
sudo make install
#htsfile --help
export PATH=/home/ubuntu/bin/htslib-1.18:$PATH

Install peddy

cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .

Install slivar

cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.3.0/slivar
chmod +x ./slivar

Install STRling

cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.2/strling
chmod +x ./strling

Install freebayes

sudo apt install freebayes

Install vcflib

sudo apt install libvcflib-tools libvcflib-dev

Install Anaconda

cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh 
bash Anaconda3-2023.09-0-Linux-x86_64.sh

Press Enter to review the license agreement. Then press and hold Enter to scroll.

Enter “yes” to agree to the license agreement.

Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3.

Add in bashrc:

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

To see location of conda executable: which conda

Install VEP

Note: Install VEP in workspace because cache file for that takes a lot of space (~25G).

Describes dependencies for VEP 110, used in this course for variant annotation. When running the VEP installer follow the prompts specified:

  1. Do you want to install any cache files (y/n)? n (In case want to install cache file, choose ‘y’ [ENTER] (select number for homo_sapiens_vep_110_GRCh38.tar.gz) [ENTER] )
  2. Do you want to install any FASTA files (y/n)? y [ENTER] (select number for homo_sapiens) [ENTER]
  3. Do you want to install any plugins (y/n)? n [ENTER]
cd ~/workspace
sudo git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
sudo perl -MCPAN -e'install "LWP::Simple"'
sudo perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/workspace/ensembl-vep:$PATH
#vep --help

Set up Jupyter to render in web brower

Followed this website and this website Note: The old jupyter notebook was split into jupyter-server and nbclassic. The steps to set up jupyter on ec2 in the first link therefore have been adapted based on suggestions in the second link to accommodate this migration.

First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

Then you need to source the .bashrc for changes to take effect.

source .bashrc

We then need to create our Jupyter configuration file. In order to create that file, you need to run:

jupyter notebook --generate-config

Optional: After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:

Enter the IPython command line:

ipython

Now follow these steps to generate your password:

from notebook.auth import passwd

passwd()

You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.

Run exit in order to exit IPython.


Next, go into your jupyter config file (/home/ubuntu/.jupyter/jupyter_server_config.py):

cd .jupyter

vim jupyter_notebook_config.py

And add the following code at the beginning of the document:

c = get_config() #add this line if it's not already in jupyter_notebook_config.py

c.ServerApp.ip = '0.0.0.0'
#c.ServerApp.password = u'YOUR PASSWORD HASH' #uncomment this line if decide to use password
c.ServerApp.port = 8888

Optional: We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:

mkdir Notebooks
# You can call this folder anything, for this example we call it `Notebooks`.

After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:

jupyter nbclassic
# to open jupyter lab on the instance
jupyter lab

From there you should be able to access your server by going to:

http://(your AWS public dns):8888/ or http://(your AWS public dns):8888/(tree?token=... - in the message generated while running 'jupyter nbclassic')

(Note: if ever run into problem accessing server, double check whether you are using http or https. If you didnt add https port in security group configuration step when create the instance, then you wouldn’t be able to access server with https.)

Install R

Follow this guide from cran website to install R ver 4.4 website

# update indices
sudo apt update -qq
# install two helper packages we need
sudo apt install --no-install-recommends software-properties-common dirmngr
# add the signing key (by Michael Rutter) for these repos
# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc 
# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"

#install R and its dependencies
sudo apt install --no-install-recommends r-base

Note, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory.

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","sctransform","Seurat","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR","tidyverse"),repos="http://cran.us.r-project.org")
quit(save="no")

Note: if asked if want to install in personal library, type ‘yes’.

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db","DESeq2","apeglm","clusterProfiler","enrichplot","pathview"))
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Install softwares for germline analyses

  • gatk
  • minimap
  • NanoPlot
  • Varscan
  • faSplit

Install gatk

(Note: in cshl2023 version of the course, install this gatk 4.2.1.0 instead of an more updated ver since this work with the current Java version - Java ver 11)

cd ~/bin
wget https://github.com/broadinstitute/gatk/releases/download/4.2.1.0/gatk-4.2.1.0.zip
unzip gatk-4.2.1.0.zip

export PATH=/home/ubuntu/bin/gatk-4.2.1.0:$PATH #add to .bashrc
gatk --help
gatk --list

Install minimap2

cd ~/bin
curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.26_x64-linux/minimap2

export PATH=/home/ubuntu/bin/minimap2-2.26_x64-linux:$PATH #add to .bashrc
minimap2 --help

Install NanoPlot

pip install NanoPlot
#which NanoPlot
#NanoPlot -h

Install Varscan

cd ~/bin
curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
java -jar ~/bin/VarScan.v2.4.2.jar

Install faSplit

conda create -n fasplit_env bioconda::ucsc-fasplit
source activate fasplit_env
conda activate fasplit_env
#faSplit
conda deactivate

Install packages for single-cell ATAC-seq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name snapatac2_env python=3.11
source activate snapatac2_env
conda activate snapatac2_env

pip install snapatac2
#pip show snapatac2 
pip install scanpy
pip install MACS2
pip install --user magic-impute
pip install deeptools 

conda deactivate

To run virtual environment in jupyter nbclassic, there are a few extra set up steps:

#Step 1: Activate the Conda Environment of interest:
conda activate snapatac2_env
#Step 2: Install Ipykernel: 
conda install ipykernel
#Step 3: Create a Jupyter Kernel for the environment
python -m ipykernel install --user --name=snapatac2_env_kernel
```bash
Then run jupyter notebook as usual:
```bash
jupyter nbclassic

Access server by adding to the browser: http://(your AWS public dns):8888/ We can either create a notebook using desired environment kernel, or just create a notebook using the default ipykernel and change kernel within the notebook itself.

Install ATACseqQC

R
#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
BiocManager::install("ATACseqQC")
quit(save="no")

Install packages for single-cell RNAseq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name scRNAseq_env python=3.11
source activate scRNAseq_env
conda activate scRNAseq_env
pip install 'scanpy[leiden]'
pip install gtfparse==1.2.0
pip install scrublet
pip install fast_matrix_market
pip install harmony-pytorch

conda install ipykernel
python -m ipykernel install --user --name=scRNAseq_env_kernel
conda deactivate

Install packages for Variant annotation and python visualization lab

Packages:

#pyenv
#note: after installation, add pyenv configs in .bashrc (see below)
cd ~/bin
curl https://pyenv.run | bash

#virtualenv
sudo apt install pipx
pipx install virtualenv

#beautifulsoup4
pip install beautifulsoup4

#requests
pip install requests

#vcfpy
#installation note for vcfpy: https://github.com/KarchinLab/open-cravat/issues/98
conda install -c bioconda open-cravat
pip install vcfpy

#vcftools
#installation note: https://github.com/vcftools/vcftools/issues/188
cd ~/bin
sudo apt-get install autoconf
git clone https://github.com/vcftools/vcftools.git
cd vcftools
./autogen.sh
./configure
make
sudo make install

#pysam
#conda config --show channels #to see if 3 needed channels are already configured in the conda environment. if not, add:
#conda config --add channels defaults
#conda config --add channels conda-forge
#conda config --add channels bioconda
#conda install pysam
pip install pysam
#installed at: ./bin/anaconda3/lib/python3.11/site-packages

#civicpy
pip install civicpy

#pandas
pip install pandas

#jq
sudo apt-get install jq

add pyenv configs in .bashrc

## pyenv configs
export PYENV_ROOT="/home/ubuntu/bin/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"

if command -v pyenv 1>/dev/null 2>&1; then
  eval "$(pyenv init -)"
fi

Path setup

For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

  • Edit config to allow files to be served from outside /usr/share and /var/www
sudo vim /etc/apache2/apache2.conf

Add the following content to apache2.conf

<Directory /workspace>
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
</Directory>
  • Edit vhost file
    sudo vim /etc/apache2/sites-available/000-default.conf
    

    Change document root in 000-default.conf to ‘/workspace’

    DocumentRoot /workspace
    
  • Restart apache
sudo service apache2 restart

To check if the server works, type in browser of choice: http://[public ip address of ec2 instance]. You should see the content within /workspace .

Save a public AMI

Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.

Current Public AMIs

  • cshl-seqtec-2022 (ami-09b613ae9751a96b1; N. Virginia)
  • cbw-rnabio-2023 (ami-09b3fd07d90812201; N. Virginia)
  • cshl-seqtec-2023 (ami-05d41e9b8c7eee2df; N. Virginia)
  • cshl-seqtec-2024 (ami-00029a06cacbe647c; N. Virginia)
  • cshl-seqtec-2025 (also named cshl_2025_AMI_final; ami-027b72b97520101bd; N. Virginia)

Create IAM account

From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.

Launch student instance

  1. Go to AWS console. Login. Select EC2.
  2. Launch Instance, search for “cshl-seqtec-2025” in Community AMIs and Select.
  3. Choose “m6a.xlarge” instance type.
  4. Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
  5. Make sure that you see two snapshots (e.g., the 60GB root volume (gp3) and 500GB EBS volume (gp3) you set up earlier). Tick the boxes for “Delete on termination” for both.
  6. Create a tag with Name=StudentName
  7. Choose existing security group call “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. Review and Launch.
  8. Choose an existing key pair (cshl_2025_student.pem)
  9. View instances and wait for them to finish initiating.
  10. Find your instance in console and select it, then hit connect to get your public.ip.address.
  11. Login to node ssh -i cshl_2025_student.pem ubuntu@[public.ip.address].
  12. Optional - set up DNS redirects (see below)

Set up a dynamic DNS service

Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances.

Host necessary files for the course

Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.

After course reminders

  • Delete the student IAM account created above otherwise students will continue to have EC2 privileges.
  • Terminate all instances and clean up any unnecessary volumes, snapshots, etc.
Integrated Assignment Answers | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Integrated Assignment Answers

Integrated Assignment answers

Background: Cell lines are often used to study different experimental conditions and to study the function of specific genes by various perturbation approaches. One such type of study involves knocking down expression of a target of interest by shRNA and then using RNA-seq to measure the impact on gene expression. These eperiments often include use of a control shRNA to account for any expression changes that may occur from just the introduction of these molecules. Differential expression is performed by comparing biological replicates of shRNA knockdown vs shRNA control.

Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA-seq datasets generated from a cell line (3 transfected with shRNA, and 3 controls). Our goal will be to determine differentially expressed genes.

Experimental information and other things to keep in mind:

Experimental descriptions from the study authors:

Experimental details from the paper: “An RNA transcriptome-sequencing analysis was performed in shRNA-NC or shRNA-CBSLR-1 MKN45 cells cultured under hypoxic conditions for 24 h (Fig. 2A).”

Experimental details from the GEO submission: “An RNA transcriptome sequencing analysis was performed in MKN45 cells that were transfected with tcons_00001221 shRNA or control shRNA.”

Note that according to GeneCards and HGNC, CBSLR and tcons_00001221 refer to the same gene.

Part 0 : Obtaining Data and References

Goals:

Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_DIR that stores this path for convenience in later commands.

export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME
mkdir -p ~/workspace/rnaseq/integrated_assignment/
export RNA_INT_DIR=~/workspace/rnaseq/integrated_assignment

Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory

Remember: when initiating an environment variable, we do NOT need the $; however, everytime we call the variable, it needs to be preceeded by a $.

echo $RNA_INT_DIR
cd $RNA_INT_DIR
wget http://genomedata.org/rnaseq-tutorial/Integrated_Assignment_RNA_Data.tar.gz
tar -xvf Integrated_Assignment_RNA_Data.tar.gz

Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?

A1.) The answer is 10. Review these files so that you are familiar with them. If the reference fasta or gtf was not provided, you could obtain them from the Ensembl website under their downloads > databases.

cd $RNA_INT_DIR/reference/
tree
find . -type f
find . -type f | wc -l

The . tells the find command to look in the current directory and -type f restricts the search to files only. The | uses the output from the find command and wc -l counts the lines of that output

Q2.) How many exons does the gene SOX4 have? Which PCA3 isoform has the most exons?

A2.) SOX4 only has 1 exon, while the longest isoform of PCA3 (ENST00000645704) has 7 exons. Review the GTF file so that you are familiar with it. What downstream steps will we need this gtf file for?

grep -w "SOX4" Homo_sapiens.GRCh38.92.gtf | less -S

grep -w "PCA3" Homo_sapiens.GRCh38.92.gtf | grep -w "exon" | cut -f 9 | cut -d ";" -f 3 | sort | uniq -c

Q3.) How many samples do you see under the data directory?

A3.) The answer is 6 samples. The number of files is 12 because the sequence data is paired (an R1 and R2 file for each sample). The files are named based on their SRA accession number.

cd $RNA_INT_DIR/data/
ls -l
ls -1 | wc -l

NOTE: The fastq files you have copied above contain only the first 1,000,000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.

Part 1 : Data preprocessing

Goals:

Create a new folder that will house the outputs from FastQC. Use the -h option to view the potential output on the data to determine the quality of the data.

cd $RNA_INT_DIR
mkdir -p qc/raw_fastqc
fastqc $RNA_INT_DIR/data/*.fastq.gz -o qc/raw_fastqc/
cd qc/raw_fastqc
multiqc ./

Q4.) What metrics, if any, have the samples failed? Are the errors related?

A4.) The per base sequence content of the samples don’t show a flat distribution and do have a bias towards certain bases at the beginning of the reads. The reason for this bias could be non-random priming during cDNA synthesis giving rise to non-random bases near the beginning/end of each fragment. The QC reports also flag the presense of adapters in the reads.

Now based on the output of the html summary, proceed to clean up the reads and rerun fastqc to see if an improvement can be made to the data. Make sure to create a directory to hold any processed reads you may create.

cd $RNA_INT_DIR
mkdir trimmed_reads

fastp -i $RNA_INT_DIR/data/SRR7155055_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155055_2.fastq.gz -o trimmed_reads/SRR7155055_1.fastq.gz -O trimmed_reads/SRR7155055_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155055.fastp.json --html trimmed_reads/SRR7155055.fastp.html 2>trimmed_reads/SRR7155055.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155056_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155056_2.fastq.gz -o trimmed_reads/SRR7155056_1.fastq.gz -O trimmed_reads/SRR7155056_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155056.fastp.json --html trimmed_reads/SRR7155056.fastp.html 2>trimmed_reads/SRR7155056.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155057_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155057_2.fastq.gz -o trimmed_reads/SRR7155057_1.fastq.gz -O trimmed_reads/SRR7155057_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155057.fastp.json --html trimmed_reads/SRR7155057.fastp.html 2>trimmed_reads/SRR7155057.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155058_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155058_2.fastq.gz -o trimmed_reads/SRR7155058_1.fastq.gz -O trimmed_reads/SRR7155058_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155058.fastp.json --html trimmed_reads/SRR7155058.fastp.html 2>trimmed_reads/SRR7155058.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155059_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155059_2.fastq.gz -o trimmed_reads/SRR7155059_1.fastq.gz -O trimmed_reads/SRR7155059_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155059.fastp.json --html trimmed_reads/SRR7155059.fastp.html 2>trimmed_reads/SRR7155059.fastp.log
fastp -i $RNA_INT_DIR/data/SRR7155060_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155060_2.fastq.gz -o trimmed_reads/SRR7155060_1.fastq.gz -O trimmed_reads/SRR7155060_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155060.fastp.json --html trimmed_reads/SRR7155060.fastp.html 2>trimmed_reads/SRR7155060.fastp.log

Q5.) What average percentage of reads remain after adapter trimming/cleanup with fastp? Why do reads get tossed out?

A5.) At this point, we could look in the log files individually. Alternatively, we could utilize the command line with a command like the one below.

grep -A 1 Read1 trimmed_reads/*.log

Doing this, we find that around 93-95% of reads survive after adapter trimming and cleanup with fastp. The reads that get tossed are due to being too short after trimming. They fall below our threshold of minimum read length of 25 (too short), poor sequence quality, or too many N’s.

Q6.) What sample has the largest number of reads after trimming?

A6.) The control sample 2 (SRR7155060) has the most reads (1,907,336 individual reads). An easy way to figure out the number of reads is to check the output log file from the trimming output. Looking at the “remaining reads” row, we see the reads (each read in a pair counted individually) that survive the trimming. We can also look at this from the command line.

grep "passed" trimmed_reads/*.log

Alternatively, you can make use of the command ‘wc’. This command counts the number of lines in a file. Since fastq files have 4 lines per read, the total number of lines must be divided by 4. Running this command only give you the total number of lines in the fastq file (Note that because the data is compressed, we need to use zcat to unzip it and print it to the screen, before passing it on to the wc command):

zcat $RNA_INT_DIR/data/SRR7155059_1.fastq.gz | wc -l
zcat $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz | wc -l

We could also run fastqc and multiqc on the trimmed data and visualize the remaining reads that way.

cd $RNA_INT_DIR
mkdir -p qc/trimmed_fastqc
fastqc $RNA_INT_DIR/trimmed_reads/*.fastq.gz -o qc/trimmed_fastqc/
cd qc/trimmed_fastqc
multiqc ./

Part 2: Data alignment

Goals:

To create HISAT2 alignment commands for all of the six samples and run alignments:

Create a directory to store the alignment results

echo $RNA_INT_DIR/alignments
mkdir -p $RNA_INT_DIR/alignments
cd $RNA_INT_DIR/alignments

Run alignment commands for each sample

hisat2 -p 8 --rg-id=T1 --rg SM:Transfected1 --rg LB:Transfected1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155055_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155055_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155055.sam
hisat2 -p 8 --rg-id=T2 --rg SM:Transfected2 --rg LB:Transfected2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155056_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155056_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155056.sam
hisat2 -p 8 --rg-id=T3 --rg SM:Transfected3 --rg LB:Transfected3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155057_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155057_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155057.sam
hisat2 -p 8 --rg-id=C1 --rg SM:Control1 --rg LB:Control1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155058_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155058_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155058.sam
hisat2 -p 8 --rg-id=C2 --rg SM:Control2 --rg LB:Control2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155059_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155059.sam
hisat2 -p 8 --rg-id=C3 --rg SM:Control3 --rg LB:Control3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155060_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155060_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155060.sam

Next, convert sam alignments to bam.

cd $RNA_INT_DIR/alignments
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155055.bam $RNA_INT_DIR/alignments/SRR7155055.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155056.bam $RNA_INT_DIR/alignments/SRR7155056.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155057.bam $RNA_INT_DIR/alignments/SRR7155057.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155058.bam $RNA_INT_DIR/alignments/SRR7155058.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155059.bam $RNA_INT_DIR/alignments/SRR7155059.sam
samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155060.bam $RNA_INT_DIR/alignments/SRR7155060.sam

Q7.) How can we obtain summary statistics for each aligned file?

A7.) There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat.

cd $RNA_INT_DIR/alignments
samtools flagstat SRR7155055.bam > SRR7155055.flagstat.txt
samtools flagstat SRR7155056.bam > SRR7155056.flagstat.txt
samtools flagstat SRR7155057.bam > SRR7155057.flagstat.txt
samtools flagstat SRR7155058.bam > SRR7155058.flagstat.txt
samtools flagstat SRR7155059.bam > SRR7155059.flagstat.txt
samtools flagstat SRR7155060.bam > SRR7155060.flagstat.txt

Pull out summaries of mapped reads from the flagstat files

grep "mapped (" *.flagstat.txt

Q8.) Approximately how much space is saved by converting the sam to a bam format?

A8.) We get about a 5.5x compression by using the bam format instead of the sam format. This can be seen by adding the -lh option when listing the files in the aligntments directory.

ls -lh $RNA_INT_DIR/alignments/

To specifically look at the sizes of the sam and bam files, we could use du -h, which shows us the disk space they are utilizing in human readable format.

du -h $RNA_INT_DIR/alignments/*.sam
du -h $RNA_INT_DIR/alignments/*.bam

In order to make visualization easier, you should now merge each of your replicate sample bams into one combined BAM for each condition. Make sure to index these bams afterwards to be able to view them on IGV.

cd $RNA_INT_DIR/alignments
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam
java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam

To visualize these merged bam files in IGV, we’ll need to index them. We can do so with the following commands.

cd $RNA_INT_DIR/alignments
samtools index $RNA_INT_DIR/alignments/control.bam
samtools index $RNA_INT_DIR/alignments/transfected.bam

Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this:

Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data?

A9.) The reads all pile up on the exonic regions of the gene since we’re dealing with RNA-Sequencing data. Not all exons have equal coverage, and this is due to different isoforms of the gene being sequenced. If the data was from a whole-genome experiment, we would ideally expect to see equal coverage across the whole gene length.

Right-click in the middle of the page, and click on “Expanded” to view the reads more easily.

Q10.) What are the lines connecting the reads trying to convey?

A10.) The lines show a connected read, where one part of the read begins mapping to one exon, while the other part maps to the next exon. This is important in RNA-Sequencing alignment as aligners must be aware to take this partial alignment strategy into account.

Part 3: Expression Estimation

Goals:

cd $RNA_INT_DIR/
mkdir -p $RNA_INT_DIR/expression

stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected1/transcripts.gtf -A expression/transfected1/gene_abundances.tsv alignments/SRR7155055.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected2/transcripts.gtf -A expression/transfected2/gene_abundances.tsv alignments/SRR7155056.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected3/transcripts.gtf -A expression/transfected3/gene_abundances.tsv alignments/SRR7155057.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control1/transcripts.gtf -A expression/control1/gene_abundances.tsv alignments/SRR7155058.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control2/transcripts.gtf -A expression/control2/gene_abundances.tsv alignments/SRR7155059.bam
stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control3/transcripts.gtf -A expression/control3/gene_abundances.tsv alignments/SRR7155060.bam

Q11.) How can you obtain the expression of the gene SOX4 across the transfected and control samples?

A11.) To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file

grep SOX4 $RNA_INT_DIR/expression/*/transcripts.gtf | cut -f 1,9 | grep FPKM

Part 4: Differential Expression Analysis

Goals:

mkdir -p $RNA_INT_DIR/ballgown/
cd $RNA_INT_DIR/ballgown/

Perform transfected vs. control comparison, using all samples, for known transcripts:

Adapt the R tutorial code that was used in Differential Expression section. Modify it to work on these data (which are also a 3x3 replicate comparison of two conditions).

First, start an R session:

R

Run the following R commands in your R session.


# load the required libraries
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

# Create phenotype data needed for ballgown analysis. Recall that:
# "T1-T3" refers to "transfected" (CBSLR shRNA knockdown) replicates
# "C1-C3" refers to "control" (shRNA control) replicates

ids=c("transfected1","transfected2","transfected3","control1","control2","control3")
type=c("Tranfected","Tranfected","Tranfected","Control","Control","Control")
results="/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/"
path=paste(results,ids,sep="")
pheno_data=data.frame(ids,type,path)

pheno_data

# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data)

# Display a description of this object
bg

# Load all attributes including gene name
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])
bg_transcript_names = unique(bg_table[,c(1,6)])

# Save the ballgown object to a file for later use
save(bg, file='bg.rda')

# Perform differential expression (DE) analysis with no filtering, at both gene and transcript level
results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Save a tab delimited file for both the transcript and gene results
write.table(results_transcripts, "Transfected_vs_Control_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "Transfected_vs_Control_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])
bg_filt_transcript_names = unique(bg_filt_table[,c(1,6)])

# Perform DE analysis now using the filtered data
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_transcripts = merge(results_transcripts, bg_filt_transcript_names, by.x=c("id"), by.y=c("t_id"))

results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Output the filtered list of genes and transcripts and save to tab delimited files
write.table(results_transcripts, "Transfected_vs_Control_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "Transfected_vs_Control_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Identify the significant genes with p-value < 0.05
sig_transcripts = subset(results_transcripts, results_transcripts$pval<0.05)
sig_genes = subset(results_genes, results_genes$pval<0.05)

sig_transcripts_ordered = sig_transcripts[order(sig_transcripts$pval),]
sig_genes_ordered = sig_genes[order(sig_genes$pval),]

# Output the significant gene results to a pair of tab delimited files
write.table(sig_transcripts_ordered, "Transfected_vs_Control_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(sig_genes_ordered, "Transfected_vs_Control_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Exit the R session
quit(save="no")

Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case?

A12.) Yes, there are about 523 significantly differntially expressed genes. Due to the fact that we’re using a subset of the fully sequenced library for each sample, the SOX4 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all the reads from each sample in the original data set, which will give you greater resolution of the expression of each gene to build mean and variance estimates for eacch gene’s expression.

Part 5: Differential Expression Analysis Visualization

Q13.) What plots can you generate to help you visualize this gene expression profile

A13.) The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of transfect vs. control FPKM values.

Make sure we are in the directory with our DE results

cd $RNA_INT_DIR/ballgown/

Restart an R session:

R

The following R commands create summary visualizations of the DE results from Ballgown


#Load libraries
library(ggplot2)
library(gplots)
library(GenomicRanges)
library(ballgown)
library(ggrepel)

#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline
load('bg.rda')

# View a summary of the ballgown object
bg

# Load gene names for lookup later in the tutorial
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Pull the gene_expression data frame from the ballgown object
gene_expression = as.data.frame(gexpr(bg))

#Set min value to 1
min_nonzero=1

# Set the columns for finding FPKM and create shorter names for figures
data_columns=c(1:6)
short_names=c("T1","T2","T3","C1","C2","C3")

#Calculate the FPKM sum for all 6 libraries
gene_expression[,"sum"]=apply(gene_expression[,data_columns], 1, sum)

#Identify genes where the sum of FPKM across all samples is above some arbitrary threshold
i = which(gene_expression[,"sum"] > 5)

#Calculate the correlation between all pairs of data
r=cor(gene_expression[i,data_columns], use="pairwise.complete.obs", method="pearson")

#Print out these correlation values
r

# Open a PDF file where we will save some plots. 
# We will save all figures and then view the PDF at the end
pdf(file="transfected_vs_control_figures.pdf")

data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3")

#Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries
#This step calculates 2-dimensional coordinates to plot points for each library
#Libraries with similar expression patterns (highly correlated to each other) should group together

#note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability
d=1-r
mds=cmdscale(d, k=2, eig=TRUE)
par(mfrow=c(1,1))
plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.01,0.01), ylim=c(-0.01,0.01))
points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16)
text(mds$points[,1], mds$points[,2], short_names, col=data_colors)

# Calculate the differential expression results including significance
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id"))

# Plot - Display the grand expression values from UHR and HBR and mark those that are significantly differentially expressed

sig=which(results_genes$pval<0.05)
results_genes[,"de"] = log2(results_genes[,"fc"])

gene_expression[,"Transfected"]=apply(gene_expression[,c(1:3)], 1, mean)
gene_expression[,"Control"]=apply(gene_expression[,c(4:6)], 1, mean)

x=log2(gene_expression[,"Transfected"]+min_nonzero)
y=log2(gene_expression[,"Control"]+min_nonzero)
plot(x=x, y=y, pch=16, cex=0.25, xlab="Transfected FPKM (log2)", ylab="Control FPKM (log2)", main="Transfected vs Control FPKMs")
abline(a=0, b=1)
xsig=x[sig]
ysig=y[sig]
points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5)
legend("topleft", "Significant", col="magenta", pch=16)

#Get the gene symbols for the top N (according to corrected p-value) and display them on the plot
topn = order(abs(results_genes[sig,"fc"]), decreasing=TRUE)[1:25]
topn = order(results_genes[sig,"qval"])[1:25]
text(x[topn], y[topn], results_genes[topn,"gene_name"], col="black", cex=0.75, srt=45)

#Plot - Volcano plot

# set default for all genes to "no change"
results_genes$diffexpressed <- "No"

# if log2Foldchange > 2 and pvalue < 0.05, set as "Up regulated"
results_genes$diffexpressed[results_genes$de > 0.6 & results_genes$pval < 0.05] <- "Up"

# if log2Foldchange < -2 and pvalue < 0.05, set as "Down regulated"
results_genes$diffexpressed[results_genes$de < -0.6 & results_genes$pval < 0.05] <- "Down"

results_genes$gene_label <- NA

# write the gene names of those significantly upregulated/downregulated to a new column
results_genes$gene_label[results_genes$diffexpressed != "No"] <- results_genes$gene_name[results_genes$diffexpressed != "No"]

ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) +
             xlab("log2Foldchange") +
             scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) +
             geom_point() +
             theme_minimal() +
             geom_text_repel() +
             geom_vline(xintercept=c(-0.6, 0.6), col="red") +
             geom_hline(yintercept=-log10(0.05), col="red") +
             guides(colour = guide_legend(override.aes = list(size=5))) +
             geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black")


dev.off()

# Exit the R session
quit(save="no")