Welcome to the blog

# Posts

My thoughts and ideas

Tool Installation | Griffith Lab

## RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

# Tool Installation

### Note:

First, make sure your environment is set up correctly.

Tools needed for this analysis are: samtools, bam-readcount, HISAT2, stringtie, gffcompare, htseq-count, flexbar, R, ballgown, fastqc and picard-tools. In the following installation example, the installs are local and will work whether you have root (i.e. admin) access or not. However, if root is available some binaries can/will be copied to system-wide locations (e.g., ~/bin/).

Set up tool installation location:

cd $RNA_HOME mkdir student_tools cd student_tools  ## SAMtools Installation type: build C++ binary from source code using make. Citation: PMID: 19505943. The following tool is installed by downloading a compressed archive using wget, decompressing it using bunzip2, unpacking the archive using tar, and building the source code using make to run compiler commands in the “Makefile” provided with the tool. When make is run without options, it attempts the “default goal” in the make file which is the first “target” defined. In this case the first “target” is :all. Once the build is complete, we test that it worked by attempting to execute the samtools binary. Remember that the ./ in ./samtools tells the commandline that you want to execute the samtools binary in the current directory. We do this because there may be other samtools binaries in our PATH. Try which samtools to see the samtools binary that appears first in our PATH and therefore will be the one used when we specify samtools without specifying a particular location of the binary. cd$RNA_HOME/student_tools/
bunzip2 samtools-1.11.tar.bz2
tar -xvf samtools-1.11.tar
cd samtools-1.11
make
./samtools


Installation type: build C++ binary from source code using cmake and make. Citation: genome/bam-readcount.

Installation of the bam-readcount tool involves “cloning” the source code with a code version control system called git. The code is then compiled using cmake and make. cmake is an application for managing the build process of software using a compiler-independent method. It is used in conjunction with native build environments such as make (cmake ref). Note that bam-readcount relies on another tool, samtools, as a dependency. An environment variable is used to specify the path to the samtools install.

cd RNA_HOME/student_tools/ export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.11 git clone https://github.com/genome/bam-readcount.git cd bam-readcount cmake -Wno-dev . make ./bin/bam-readcount  ## HISAT2 Installation type: download a precompiled binary. Citation: PMID: 31375807. The hisat2 aligner is installed below by simply downloading an archive of binaries using wget, unpacking them with unzip, and testing the tool to make sure it executes without error on the current system. This approach relies on understanding the architecture of your system and downloading the correct precompiled binary. The uname -m command lists the current system architecture. uname -m cdRNA_HOME/student_tools/
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h


## StringTie

The stringtie reference guided transcript assembly and abundance estimation tool is installed below by simply downloading an archive with wget, unpacking the archive with tar, and executing stringtie to confirm it runs without error on our system.

cd $RNA_HOME/student_tools/ wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.4.Linux_x86_64.tar.gz tar -xzvf stringtie-2.1.4.Linux_x86_64.tar.gz cd stringtie-2.1.4.Linux_x86_64 ./stringtie -h  ## gffcompare Installation type: download a precompiled binary. Citation: PMID: 25690850. The gffcompare tool for comparing transcript annotations is installed below by simply downloading an archive with wget, unpacking it with tar, and executing gffcompare to ensure it runs without error on our system. wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.1.Linux_x86_64.tar.gz tar -xzvf gffcompare-0.12.1.Linux_x86_64.tar.gz cd gffcompare-0.12.1.Linux_x86_64/ ./gffcompare  ## htseq-count Installation type: use python setup script. Citation: PMID: 25260700. The htseq-count read counting tools is installed below by downloading an archive with wget, unpacking the archive using tar and running a setup script written in Python. After setup, chmod is used to change permissions of the htseq-count file to be executable. cd$RNA_HOME/student_tools/
git clone https://github.com/htseq/htseq.git
cd htseq/
git fetch --all --tags
git checkout release_0.12.4
python setup.py install --user
chmod +x scripts/htseq-count
./scripts/htseq-count -h


## TopHat

Installation type: dowload a precompiled binary. Citation: PMID: 19289445.

Note, this tool is currently only installed for the gtf_to_fasta tool used in kallisto section.

cd RNA_HOME/student_tools/ wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz cd tophat-2.1.1.Linux_x86_64/ ./gtf_to_fasta  ## kallisto Installation type: download a precompiled binary. Citation: PMID: 27043002. The kallisto alignment free expression estimation tool is installed below simply by downloading an archive with wget, unpacking the archive with tar, and testing the binary to ensure it runs on our system. cdRNA_HOME/student_tools/
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto


cd $RNA_HOME/student_tools/ wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip unzip fastqc_v0.11.9.zip cd FastQC/ chmod 755 fastqc ./fastqc --help  ## MultiQC Installation type: use pip. Citation: PMID: 27312411. Multiqc, a tool for assembling QC reports is a python package that can be installed using the python package manager pip. pip3 install --user multiqc export PATH=/home/ubuntu/.local/bin:$PATH
python3 -m multiqc --help


## Picard

Picard is a rich tool kit for BAM file manipulation that is installed below simply by downloading a jar file. The jar file is tested using Java, a dependency that must also be installed (it should already be present in many systems).

cd $RNA_HOME/student_tools/ wget https://github.com/broadinstitute/picard/releases/download/2.23.8/picard.jar -O picard.jar java -jar$RNA_HOME/student_tools/picard.jar


cd $RNA_HOME/student_tools/ wget https://github.com/seqan/flexbar/releases/download/v3.5.0/flexbar-3.5.0-linux.tar.gz tar -xzvf flexbar-3.5.0-linux.tar.gz cd flexbar-3.5.0-linux/ export LD_LIBRARY_PATH=$RNA_HOME/student_tools/flexbar-3.5.0-linux:$LD_LIBRARY_PATH ./flexbar  ## Regtools Installation type: compile from source code using cmake and make. Citation: bioRXiv: 10.1101/436634v2. cd$RNA_HOME/student_tools/
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools


## RSeQC

Installation type: use pip. Citation: PMID: 22743226.

pip3 install RSeQC


cd $RNA_HOME/student_tools/ mkdir bedops_linux_x86_64-v2.4.39 cd bedops_linux_x86_64-v2.4.39 wget -c https://github.com/bedops/bedops/releases/download/v2.4.39/bedops_linux_x86_64-v2.4.39.tar.bz2 tar -jxvf bedops_linux_x86_64-v2.4.39.tar.bz2 ./bin/bedops ./bin/gff2bed  ## gtfToGenePred Installation type: download precompiled binary. cd$RNA_HOME/student_tools/
mkdir gtfToGenePred
cd gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred


cd $RNA_HOME/student_tools/ mkdir genePredToBed cd genePredToBed wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed chmod a+x genePredToBed ./genePredToBed  ## how_are_we_stranded_here pip3 install git+https://github.com/betsig/how_are_we_stranded_here.git check_strandedness  ## R Installation type: compile source code using make. This install takes a while, so check if you have R installed already by typing which R. It is already installed on the Cloud, but for completeness, here is how it was done. Please skip all R installation! #sudo apt-get install r-base-dev #export R_LIBS= #cd$RNA_HOME/student_tools/
#wget https://stat.ethz.ch/R/daily/R-patched.tar.gz
#tar -xzvf R-patched.tar.gz
#cd R-patched
#./configure --prefix=$RNA_HOME/student_tools/R-patched/ --with-x=no #make #make install #./bin/Rscript  Note, if X11 libraries are not available you may need to use --with-x=no during config, on a regular linux system you would not use this option. Also, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory. ## R Libraries Installation type: add new base R libraries to an R installation. For this tutorial we require: launch R (enter R at linux command prompt) and type the following at an R command prompt. NOTE: This has been pre-installed for you, so these commands can be skipped. #R #install.packages(c("devtools","dplyr","gplots","ggplot2"),repos="http://cran.us.r-project.org") #quit(save="no")  ## Bioconductor Installation type: add bioconductor libraries to an R installation. Citation: PMID: 15461798. For this tutorial we require: launch R (enter R at linux command prompt) and type the following at an R command prompt. If prompted, type “a” to update all old packages. NOTE: This has been pre-installed for you, so these commands can be skipped. #R #source("http://bioconductor.org/biocLite.R") #biocLite(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt")) #quit(save="no")  ## Sleuth Installation type: R package installation from a git repository. Citation: PMID: 28581496. #R #install.packages("devtools") #devtools::install_github("pachterlab/sleuth") #quit(save="no")  ## PRACTICAL EXERCISE 1 - Software Installation Assignment: Install bedtools on your own. Make sure you install it in your tools folder. Download, unpack, compile, and test the bedtools software. Citation: PMID: 20110278. cd$RNA_HOME/student_tools/

• Hint: google “bedtools” to find the source code
• Hint: there is a README file that will give you hints on how to install
• Hint: If your install has worked you should be able to run bedtools as follows:
$RNA_HOME/student_tools/bedtools2/bin/bedtools  Questions • What happens when you run bedtools without any options? • Where can you find detailed documentation on how to use bedtools? • How many general categories of analysis can you perform with bedtools? What are they? Solution: When you are ready you can check your approach against the Solutions ## Add locally installed tools to your PATH [OPTIONAL] To use the locally installed version of each tool without having to specify complete paths, you could add the install directory of each tool to your ‘$PATH’ variable

PATH=$RNA_HOME/student_tools/genePredToBed:$RNA_HOME/student_tools/gtfToGenePred:$RNA_HOME/student_tools/bedops_linux_x86_64-v2.4.39/bin:$RNA_HOME/student_tools/samtools-1.11:$RNA_HOME/student_tools/bam-readcount/bin:$RNA_HOME/student_tools/hisat2-2.2.1:$RNA_HOME/student_tools/stringtie-2.1.4.Linux_x86_64:$RNA_HOME/student_tools/gffcompare-0.12.1.Linux_x86_64:$RNA_HOME/student_tools/htseq-release_0.12.4/scripts:$RNA_HOME/student_tools/tophat-2.1.1.Linux_x86_64:$RNA_HOME/student_tools/kallisto_linux-v0.44.0:$RNA_HOME/student_tools/FastQC:$RNA_HOME/student_tools/flexbar-3.5.0-linux:$RNA_HOME/student_tools/regtools/build:/home/ubuntu/bin/bedtools2/bin:$PATH export LD_LIBRARY_PATH=$RNA_HOME/student_tools/flexbar-3.5.0-linux:$LD_LIBRARY_PATH echo$PATH


You can make these changes permanent by adding the above lines to your .bashrc file use a text editor to open your bashrc file. For example:

vi ~/.bashrc


### Vi instructions

1. Using your cursor, navigate down to the “export PATH” commands at the end of the file.
2. Delete the line starting with PATH using the vi command “dd”.
3. Press the “i” key to enter insert mode. Go to an empty line with you cursor and copy paste the new RNA_HOME and PATH commands into the file
4. Press the “esc” key to exit insert mode.
5. Press the “:” key to enter command mode.
6. Type “wq” to save and quit vi

cd ~
wget -N https://raw.githubusercontent.com/griffithlab/rnabio.org/master/assets/setup/.bashrc
source ~/.bashrc


## Installing tools from official ubuntu packages [OPTIONAL]

Some useful tools are available as official ubuntu packages. These can be installed using the linux package management system apt. Most bioinformatic tools (especially the latest versions) are not available as official packages. Nevertheless, here is how you would update your apt library, upgrade existing packages, and install an Ubuntu tool called tree.

#sudo apt-get update
#sudo apt-get install tree
#tree


## Installing Docker

Sometimes you might not have root access in order to be able to install the tools as described above or you might not want to deal with figuring out a way to install all of the dependencies necessary for a tool to run. One alternative way to use tools is to use a docker image for that tool. Before we can do this, we must first install docker.

First we’ll want to update apt-get and remove any old docker images that might exist on our ubuntu install.

sudo apt-get update
sudo apt-get remove docker docker-engine docker.io containerd runc


Next we’ll want to make sure that some dependencies that docker needs are available.

sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common


Then we’ll need to add Docker’s official GPG key and verify that we now have the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88, by searching for the last 8 characters of the fingerprint.

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88


Next, we’ll use the following command to set up the stable repository.

sudo add-apt-repository    "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \ stable"  Now that we’ve set up the dependencies for docker, we can finally install it. sudo apt-get install docker-ce docker-ce-cli containerd.io  We can now test our docker install. sudo docker run hello-world  Notice that we had to use sudo to run the docker container. If you tried to run the above command, then you would get an error of permission denied. In order to not have to use root access everytime we want to use docker, we can add the ubuntu user to the docker group. We’ll then have to reboot to instance in order for this change to take place. sudo usermod -a -G docker ubuntu sudo reboot  After reboot, you should now be able to run docker run hello-world without using sudo before it. ## Installing tools by Docker image Some tools have complex dependencies that are difficult to reproduce across systems or make work in the same environment with tools that require different versions of the same dependencies. Container systems such as Docker and Singularity allow you to isolate a tool’s environment giving you almost complete control over dependency issues. For this reason, many tool developers have started to distribute their tools as docker images. Many of these are placed in container image repositories such as DockerHub. Here is an example tool installation using docker. Install samtools: docker pull biocontainers/samtools:v1.9-4-deb_cv1 docker run -t biocontainers/samtools:v1.9-4-deb_cv1 samtools --help  Install pvactools for personalized cancer vaccine designs: #docker pull griffithlab/pvactools:latest #docker run -t griffithlab/pvactools:latest pvacseq --help  ## Installing tools by Docker image (using Singularity) Some systems do not allow docker to be run for various reasons. Sometimes singularity is used instead. The equivalent to the above but using singularity looks like the following: #singularity pull docker://griffithlab/pvactools:latest #singularity run docker://griffithlab/pvactools:latest pvacseq -h  Note that if you encounter errors with /tmp space usage or would like to control where singularity stores its temp files, you can set the environment variables: #export SINGULARITY_CACHEDIR=/media/workspace/.singularity #export TMPDIR=/media/workspace/temp  Environment | Griffith Lab ## RNA-seq Bioinformatics Introduction to bioinformatics for RNA sequence analysis # Environment ### Getting Started This tutorial assumes use of a Linux computer with an ‘x86_64’ architecture. The rest of the tutorial should be conducted in a linux Terminal session. In other words you must already be logged into the Amazon EC2 instance as described in the previous section. Before proceeding you must define a global working directory by setting the environment variable: ‘RNA_HOME’ Log into a server and SET THIS BEFORE RUNNING EVERYTHING. Create a working directory and set the ‘RNA_HOME’ environment variable mkdir -p ~/workspace/rnaseq/ export RNA_HOME=~/workspace/rnaseq  Make sure whatever the working dir is, that it is set and is valid echo$RNA_HOME


You can place the RNA_HOME variable (and other environment variables) in your .bashrc and then logout and login again to avoid having to worry about it. A .bashrc file with these variables has already been created for you.

In order to view the contents of this file, you can type:

less ~/.bashrc


To exit the file, type q.

Environment variables used throughout this tutorial:

export RNA_HOME=~/workspace/rnaseq
export RNA_DATA_DIR=$RNA_HOME/data export RNA_DATA_TRIM_DIR=$RNA_DATA_DIR/trimmed
export RNA_REFS_DIR=$RNA_HOME/refs export RNA_REF_INDEX=$RNA_REFS_DIR/chr22_with_ERCC92
export RNA_REF_FASTA=$RNA_REF_INDEX.fa export RNA_REF_GTF=$RNA_REF_INDEX.gtf
export RNA_ALIGN_DIR=\$RNA_HOME/alignments/hisat2


We will be using picard tools throughout this workshop. To follow along, you will need to set an environment variable pointing to your picard installation.

export PICARD=~/bin/picard.jar


If these variables are not part of your .bashrc, you can type the following. First, you can open your .bashrc file with nano by simply typing:

nano ~/.bashrc


You can now see the contents of this file. Then, you want to add the above environment variables to the bottom of the file. You can do this by copying and pasting. Once you have the variables in the file, you’ll want to type ctrl + o to save the file, then enter to confirm you want the same filename, then ctrl + x to exit nano.

Since all the environment variables we set up for the RNA-seq workshop start with ‘RNA’ we can easily view them all by combined use of the env and grep commands as shown below. The env command shows all environment variables currently defined and the grep command identifies string matches.

env | grep RNA

• ## Tool Installation

### Note:

First, make sure your environment is set up correctly.

Tools needed for this analysis are:...

• ## Environment

### Getting Started

This tutorial assumes use of a Linux computer with an ‘x86_64’ architecture. The rest of the tutorial...