Welcome to the blog

Posts

My thoughts and ideas

Bioinformatics Best Practices | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Bioinformatics Best Practices

Introduction

This best practices guide provides a basic overview of useful practices and tools for managing bioinformatics environments and analysis development.

Managing Your Analysis with Notebooks

Similar to the use of a laboratory notebook, taking notes about the procedures and analysis you performed is critical to reproducible science. There are a number of scientific computing notebooks available, but the most popular by far is the Jupyter Notebook.

Jupyter supports interactive data science and scientific computer across a small number of languages, although the most popular use of Jupyter is with Python, as the Jupyter notebook is built upon the Python-based iPython Notebook.

Example notebooks

A live version of Jupyter is available to try online, and provides several example notebooks in a few different languages. You can also check out a real analysis of Guide to Pharmacology gene family data for incorporation into the Drug-Gene Interaction Database.

Versioning Code with Git and GitHub

Git is a distributed version control system that allows users to make changes to code while simultaneously documenting those changes and preserving a history, allowing code to be rolled back to a previous version quickly and safely. GitHub is a freemium, online repository hosting service. You may use GitHub to track projects, discuss issues, document applications, and review code. GitHub is one of the best ways to share your projects, and should be used from the very onset of a project. Some forethought should be given in creating and managing a repository, however, as GitHub is not a good place to share very large or sensitive data files. See the 10-minute introduction to using GitHub.

Managing Your Compute Environment

One of the most challenging aspects of bioinformatics workflows is reproducibility. In addition to documenting your analysis with a notebook, providing a copy of your compute environment limits variability in results, allowing for future reproduction of results. A world of options exist to handle this, although some of the most common options are presented.

AWS Elastic Cloud Computing is a useful service for creating entire virtual machines that can easily be copied and distributed. This option does require a paid account with Amazon, and the costs of storing the images and running instances may add up over time, especially if every analysis is stored in a separate image. Additionally, this option does not isolate the analysis environment from the system environment, potentially leading to changes in analysis output as system libraries are updated over time. The RNA-seq wiki makes heavy use of AWS as a distribution platform.

VirtualBox is a general-purpose full virtualizer that allows you to emulate a computer, complete with virtual disks, a virtual operating system, and any data and applications stored therein. It has the advantage of creating machines that are stored and run on local hardware (e.g. your personal workstation), but the extra overhead of running a virtual computer on top of a host operating system can considerably slow performance of tools stored on the virtual machine, and thus is best used for testing or demonstration purposes.

Docker packages apps and their dependencies into containers which may be docked to a docker engine running on a computer. Docker engines are available on all major operating systems, and allow software to remain infrastructure independent while sharing a filespace and system resources with other docked containers. This is a much more efficient approach than guest virtual machines, and containers may be docked locally or on cloud-based infrastructure.

Conda is a language-agnostic package, dependency and environment management system. It is included in the data-science-focused distribution of Conda, Anaconda. Anaconda is based on Python and R packages for the analysis of scientific, large-scale data. Bioinformaticians also commonly use Bioconda, which add channels to Conda with bioinformatics tools (such as the popular sequence alignment tool BWA).

AWS Setup | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

AWS Setup

Amazon AWS/AMI setup for use in workshop

This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.

Create AWS account

A helpful tutorial can be found here: https://rnabio.org/module-00-setup/0000/04/01/Intro_to_AWS/

  1. Create a new gmail account to use for the course
  2. Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
  3. Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
  4. Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
  5. Sign into AWS Management Console: http://aws.amazon.com/console/
  6. Go to EC2 services

Start with existing community AMI

  1. Launch a fresh Ubuntu Image. Choose an instance type of m5.2xlarge. Increase root volume (e.g., 32GB) and add a second volume (e.g., 250gb). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance.
  2. Change permissions on downloaded key pair with chmod 400 [instructor-key].pem
  3. Login to instance with ubuntu user:

ssh -i [instructor-key].pem ubuntu@[public.ip.address]

Perform basic linux configuration

sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python-dev python-numpy python3-dev python3-pip gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev libcurl4-openssl-dev apache2 python-pip csh ruby-full gnuplot cpanminus r-base libssl-dev gcc-4.8 g++-4.8 gsl-bin libgsl-dev
sudo timedatectl set-timezone America/New_York

Set up additional storage for workspace

We first need to setup the additional storage volume that we added when we created the instance.

# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace

# Mount ephemeral storage
cd
sudo mkfs /dev/nvme0n1
sudo mount /dev/nvme0n1 /workspace

# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\n/dev/nvme0n1 /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab

# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace

# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace

Install any desired informatics tools

- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
export MANPAGER=less

Install RNA-seq software

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
bunzip2 samtools-1.9.tar.bz2
tar -xvf samtools-1.9.tar
cd samtools-1.9
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.9:$PATH

Install bam-readcount

export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.9
git clone https://github.com/genome/bam-readcount.git
cd bam-readcount
cmake -Wno-dev .
make
./bin/bam-readcount
export PATH=/home/ubuntu/bin/bam-readcount/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
unzip hisat2-2.1.0-Linux_x86_64.zip
cd hisat2-2.1.0
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.1.0:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.4d.Linux_x86_64.tar.gz
tar -xzvf stringtie-1.3.4d.Linux_x86_64.tar.gz
cd stringtie-1.3.4d.Linux_x86_64
./stringtie -h
export PATH=/home/ubuntu/bin/stringtie-1.3.4d.Linux_x86_64:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.10.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.10.6.Linux_x86_64.tar.gz
cd gffcompare-0.10.6.Linux_x86_64
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.10.6.Linux_x86_64:$PATH

Install htseq-count

cd ~/bin
wget https://github.com/simon-anders/htseq/archive/release_0.11.0.tar.gz
tar -zxvf release_0.11.0.tar.gz
cd htseq-release_0.11.0/
python setup.py install --user
chmod +x scripts/htseq-count
./scripts/htseq-count -h
export PATH=/home/ubuntu/bin/htseq-release_0.11.0:$PATH

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip --no-check-certificate
unzip fastqc_v0.11.8.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Install MultiQC

pip3 install --user multiqc
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.18.15/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

Install Flexbar

cd ~/bin
wget https://github.com/seqan/flexbar/releases/download/v3.4.0/flexbar-3.4.0-linux.tar.gz
tar -xzvf flexbar-3.4.0-linux.tar.gz
cd flexbar-3.4.0-linux/
export LD_LIBRARY_PATH=~/bin/flexbar-3.4.0-linux:$LD_LIBRARY_PATH
./flexbar
export PATH=/home/ubuntu/bin/flexbar-3.4.0-linux:$PATH

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.35
cd bedops_linux_x86_64-v2.4.35
wget -c https://github.com/bedops/bedops/releases/download/v2.4.35/bedops_linux_x86_64-v2.4.35.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.35.tar.bz2
./bin/bedops
export PATH=~/bin/bedops_linux_x86_64-v2.4.35/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install R

sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get install r-base r-base-core r-recommended

Note, if X11 libraries are not available you may need to use --with-x=no during config, on a regular linux system you would not use this option. Also, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory.

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat"),repos="http://cran.us.r-project.org")
quit(save="no")

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt"))
quit(save="no")

Install Signac

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# Tell R to also check bioconductor when installing dependencies
setRepositories(ind=1:2)

# Install Signac (GO.db must installed with Bioconductor)
install.packages("devtools")
BiocManager::install(c("GO.db","DirichletMultinomial"))
devtools::install_github("timoast/signac")
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Install TABIX (GEMINI pre-req)

sudo apt-get install tabix

Install GEMINI

mkdir -p $WORKSPACE/lib/gemini
mkdir -p $HOME/bin
wget https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py
sudo python gemini_install.py $HOME $WORKSPACE/lib/gemini

Install MUMmer

wget http://downloads.sourceforge.net/project/mummer/mummer/3.23/MUMmer3.23.tar.gz
tar -zxvf MUMmer3.23.tar.gz
cd MUMmer3.23
make check
make install

Install sniffles

wget https://github.com/fritzsedlazeck/Sniffles/archive/master.tar.gz -O Sniffles.tar.gz
tar -xzvf Sniffles.tar.gz
cd Sniffles-master/
mkdir -p build/
cd build/
cmake ..
make

Install NGM-LR

wget https://github.com/philres/ngmlr/releases/download/v0.2.6/ngmlr-0.2.6-beta-linux-x86_64.tar.gz
tar -xvzf ngmlr-0.2.6-beta-linux-x86_64.tar.gz

Install BWA

git clone https://github.com/lh3/bwa.git
cd bwa
make

Install SURVIVOR

git clone https://github.com/fritzsedlazeck/SURVIVOR.git
cd SURVIVOR/Debug
make

Install salmon

wget https://github.com/COMBINE-lab/salmon/releases/download/v0.11.3/salmon-0.11.3-linux_x86_64.tar.gz
tar -xvzf salmon-0.11.3-linux_x86_64.tar.gz

Install bedtools

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make
sudo make install

Install bcftools

wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvjf bcftools-1.9.tar.bz2
cd bcftools-1.9/
./configure --prefix=$HOME
make install

Install mosdepth

# prereq -- htslib
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar -xvjf htslib-1.9.tar.bz2
cd htslib-1.9/
./configure --prefix=$HOME
make && make install
export LD_LIBRARY_PATH=$HOME/lib:$LD_LIBRARY_PATH

# mosdepth
wget https://github.com/brentp/mosdepth/releases/download/v0.2.3/mosdepth
chmod +x mosdepth

Install freebayes

git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes/
make && sudo make install

Install samblaster

git clone git://github.com/GregoryFaust/samblaster.git
cd samblaster
make
cp samblaster $HOME/bin/.

Install NCBI SRA toolkit and NCBI E-Utilities

sudo cpanm HTML::Entities
sudo cpanm LWP::Simple
sudo cpanm XML::Simple
cd /home/ubuntu/bin/
wget ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.tar.gz
tar -zxvf edirect.tar.gz
wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -zxvf sratoolkit.current-ubuntu64.tar.gz
export PATH=/home/ubuntu/bin/sratoolkit.2.9.2-ubuntu64/bin:$PATH
export PATH=/home/ubuntu/bin/edirect:$PATH
#For testing
fastq-dump -X 5 -Z SRR925811
esearch -db sra -query PRJNA40075  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

sudo vim /etc/apache2/apache2.conf
<Directory /home/ubuntu/>
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
</Directory>
sudo vim /etc/apache2/sites-available/000-default.conf
DocumentRoot /home/ubuntu
sudo service apache2 restart

Save a public AMI

Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.

Current Public AMIs

Create IAM account

From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.

Launch student instance

  1. Go to AWS console. Login. Select EC2.
  2. Launch Instance, search for “cshl-seqtech-2019” in Community AMIs and Select.
  3. Choose “m4.2xlarge” instance type.
  4. Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
  5. Make sure that you see two snapshots (e.g., the 32GB root volume and 80GB EBS volume you set up earlier)
  6. Create a tag with name=StudentName
  7. Choose existing security group call “SSH_HTTP_8081_IN_ALL_OUT”. Review and Launch.
  8. Choose an existing key pair (CSHL.pem)
  9. View instances and wait for them to finish initiating.
  10. Find your instance in console and select it, then hit connect to get your public.ip.address.
  11. Login to node ssh -i CSHL.pem ubuntu@[public.ip.address].
  12. Optional - set up DNS redirects (see below)

Set up a dynamic DNS service

Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances.

Host necessary files for the course

Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.

After course reminders