Posts
My thoughts and ideas
Welcome to the blog
My thoughts and ideas
Introduction to bioinformatics for RNA sequence analysis
This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.
A helpful tutorial can be found here
m5.2xlarge
. Increase root volume (e.g., 32GB) and add a second volume (e.g., 250GB). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance.chmod 400 [instructor-key].pem
ssh -i [instructor-key].pem ubuntu@[public.ip.address]
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev libcurl4-openssl-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/New_York
We first need to setup the additional storage volume that we added when we created the instance.
# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace
# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace
# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\nUUID=98618ffc-ab82-4344-9f76-4ace7d263f59 /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab
/dev/nvme1n1: UUID="98618ffc-ab82-4344-9f76-4ace7d263f59" TYPE="ext4"
# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace
# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
export RNA_HOME=~/workspace/rnaseq
to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc.man ls
and if the problem exists, add the following to .bashrc:export MANPAGER=less
/home/ubuntu/bin/
and its install location is exported to the $PATH variable for easy access.mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu
wget https://github.com/samtools/samtools/releases/download/1.14/samtools-1.14.tar.bz2
bunzip2 samtools-1.14.tar.bz2
tar -xvf samtools-1.14.tar
cd samtools-1.14
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.14:$PATH
cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.14
git clone https://github.com/genome/bam-readcount
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/bin:$PATH
uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz
tar -xzvf stringtie-2.1.6.tar.gz
cd stringtie-2.1.6
make release
export PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
sudo apt install python3-htseq
TopHat will not install if the version of OpenSSL is too old.
To get version:
openssl version
If version is OpenSSL 1.1.1f
, then it needs to be updated using the following steps.
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #incase install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig
Again, from the terminal issue the command:
openssl version
Your output should be as follows:
OpenSSL 1.1.1g 21 Apr 2020
Then create ~/.wgetrc
file and add to it
ca_certificate=/etc/ssl/certs/ca-certificates.crt
using vim or nano.
cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH
cd ~/bin
pip3 install multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
multiqc --help
cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar
sudo apt install flexbar
cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH
pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH
cd ~/bin
mkdir bedops_linux_x86_64-v2.4.40
cd bedops_linux_x86_64-v2.4.40
wget -c https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.40.tar.bz2
./bin/bedops
export PATH=~/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH
pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness
cd ~/bin
wget `download_link`
tar -xzvf cellranger-6.1.2.tar.gz
export PATH=/home/ubuntu/bin/cellranger-6.1.2:$PATH
sudo apt-get install tabix
cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH
cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -zxvf bedtools-2.30.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
cd ~/bin
wget https://github.com/samtools/bcftools/releases/download/1.14/bcftools-1.14.tar.bz2
bunzip2 bcftools-1.14.tar.bz2
tar -xvf bcftools-1.14.tar
cd bcftools-1.14
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.14/htslib-1.14.tar.bz2
bunzip2 htslib-1.14.tar.bz2
tar -xvf htslib-1.14.tar
cd htslib-1.14
make
./htslib
export PATH=/home/ubuntu/bin/htslib-1.14:$PATH
cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .
cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar
chmod +x ./slivar
cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling
chmod +x ./strling
git clone --recursive https://github.com/freebayes/freebayes.git
cd freebayes
meson build/ --buildtype debug
cd build
ninja
ninja test
export PATH=/home/ubuntu/bin/freebayes/build:$PATH
sudo apt-get remove r-base-core
sudo apt-get remove r-base
wget -c https://cran.r-project.org/src/base/R-4/R-4.0.0.tar.gz
tar -xf R-4.0.0.tar.gz
cd R-4.0.0
./configure
make -j9
sudo make install
Note, if X11 libraries are not available you may need to use --with-x=no
during config, on a regular linux system you would not use this option.
Also, linking the R-patched bin
directory into your PATH
may cause weird things to happen, such as man pages or git log
to not display. This can be circumvented by directly linking the R*
executables (R
, RScript
, RCmd
, etc.) into a PATH
directory.
For this tutorial we require:
R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")
For this tutorial we require:
R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db"))
quit(save="no")
R
devtools::install_github("diazlab/CONICS/CONICSmat", dep = TRUE)
quit(save="no")
R
# Tell R to also check bioconductor when installing dependencies
setRepositories(ind=1:2)
# Install Signac (GO.db must installed with Bioconductor)
BiocManager::install(c("GO.db","DirichletMultinomial"))
devtools::install_github("timoast/signac")
quit(save="no")
R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")
For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.
We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.
sudo vim /etc/apache2/apache2.conf
<Directory /home/ubuntu/>
Options Indexes FollowSymLinks
AllowOverride None
Require all granted
</Directory>
sudo vim /etc/apache2/sites-available/000-default.conf
DocumentRoot /home/ubuntu
sudo service apache2 restart
Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.
From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.
ssh -i cshl_2021_student.pem ubuntu@[public.ip.address]
.Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like
Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.