AWS Setup
Amazon AWS/AMI setup for use in workshop
This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.
Create AWS account
A helpful tutorial can be found here: https://rnabio.org/module-00-setup/0000/04/01/Intro_to_AWS/
- Create a new gmail account to use for the course
- Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
- Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
- Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
- Sign into AWS Management Console: http://aws.amazon.com/console/
- Go to EC2 services
Start with existing community AMI
- Launch a fresh Ubuntu Image (Ubuntu Server 18.04 LTS at the time of writing this). Choose an instance type of
m5.2xlarge
. Increase root volume (e.g., 32GB) and add a second volume (e.g., 250GB). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance. - Change permissions on downloaded key pair with
chmod 400 [instructor-key].pem
- Login to instance with ubuntu user:
ssh -i [instructor-key].pem ubuntu@[public.ip.address]
Perform basic linux configuration
- To allow installation of bioinformatics tools some basic dependencies must be installed first.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python-dev python-numpy python3-dev python3-pip gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev libcurl4-openssl-dev apache2 csh ruby-full gnuplot cpanminus r-base libssl-dev gcc-4.8 g++-4.8 gsl-bin libgsl-dev apt-transport-https software-properties-common
sudo timedatectl set-timezone America/New_York
- logout and log back in for changes to take affect.
Set up additional storage for workspace
We first need to setup the additional storage volume that we added when we created the instance.
# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace
# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace
# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\n/dev/nvme1n1 /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab
# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace
# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace
Install any desired informatics tools
- NOTE: R in particular is a slow install.
- NOTE:
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
- Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add
export RNA_HOME=~/workspace/rnaseq
to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc. - NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a
man ls
and if the problem exists, add the following to .bashrc:
export MANPAGER=less
Install RNA-seq software
- These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in
/home/ubuntu/bin/
and its install location is exported to the $PATH variable for easy access.
Create directory to install software to and setup path variables
mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu
Install SAMtools
wget https://github.com/samtools/samtools/releases/download/1.11/samtools-1.11.tar.bz2
bunzip2 samtools-1.11.tar.bz2
tar -xvf samtools-1.11.tar
cd samtools-1.11
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.11:$PATH
Install bam-readcount
cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.11
git clone https://github.com/genome/bam-readcount.git
cd bam-readcount
cmake -Wno-dev .
make
./bin/bam-readcount
export PATH=/home/ubuntu/bin/bam-readcount/bin:$PATH
Install HISAT2
uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/4pMgDq4oAF9QCfA/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
Install StringTie
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.4.Linux_x86_64.tar.gz
tar -xzvf stringtie-2.1.4.Linux_x86_64.tar.gz
cd stringtie-2.1.4.Linux_x86_64
./stringtie -h
export PATH=/home/ubuntu/bin/stringtie-2.1.4.Linux_x86_64:$PATH
Install gffcompare
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.1.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.1.Linux_x86_64.tar.gz
cd gffcompare-0.12.1.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.1.Linux_x86_64:$PATH
Install htseq-count
cd ~/bin
git clone https://github.com/htseq/htseq.git
cd htseq/
git fetch --all --tags
git checkout release_0.12.4
python setup.py install --user
chmod +x scripts/htseq-count
./scripts/htseq-count -h
export PATH=/home/ubuntu/bin/htseq:$PATH
Install TopHat
cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
Install kallisto
cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
Install FastQC
cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH
Install MultiQC
cd ~/bin
pip3 install --user multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
python3 -m multiqc --help
Install Picard
cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.23.8/picard.jar -O picard.jar
java -jar ~/bin/picard.jar
Install Flexbar
cd ~/bin
wget https://github.com/seqan/flexbar/releases/download/v3.5.0/flexbar-3.5.0-linux.tar.gz
tar -xzvf flexbar-3.5.0-linux.tar.gz
cd flexbar-3.5.0-linux/
export LD_LIBRARY_PATH=~/bin/flexbar-3.5.0-linux:$LD_LIBRARY_PATH
./flexbar
export PATH=/home/ubuntu/bin/flexbar-3.5.0-linux:$PATH
Install Regtools
cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH
Install RSeQC
pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH
Install bedops
cd ~/bin
mkdir bedops_linux_x86_64-v2.4.39
cd bedops_linux_x86_64-v2.4.39
wget -c https://github.com/bedops/bedops/releases/download/v2.4.39/bedops_linux_x86_64-v2.4.39.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.39.tar.bz2
./bin/bedops
export PATH=~/bin/bedops_linux_x86_64-v2.4.39/bin:$PATH
Install gtfToGenePred
cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
Install genePredToBed
cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH
Install how_are_we_stranded_here
pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness
Install Cell Ranger
- Must register to get download link
cd ~/bin
wget `download_link`
tar -xzvf cellranger-4.0.0.tar.gz
export PATH=/home/ubuntu/bin/cellranger-4.0.0:$PATH
Install R
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/'
sudo apt-get update
sudo apt-get install r-base r-base-core r-recommended
Note, if X11 libraries are not available you may need to use --with-x=no
during config, on a regular linux system you would not use this option.
Also, linking the R-patched bin
directory into your PATH
may cause weird things to happen, such as man pages or git log
to not display. This can be circumvented by directly linking the R*
executables (R
, RScript
, RCmd
, etc.) into a PATH
directory.
R Libraries
For this tutorial we require:
- devtools
- dplyr
- gplots
- ggplot2
- Seurat
- sctransform
- RColorBrewer
- ggthemes
- cowplot
- data.table
- Rtsne
- gridExtra
- UpSetR
R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")
Bioconductor libraries
For this tutorial we require:
R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva"))
quit(save="no")
Install CONICSmat
R
install.packages("devtools")
devtools::install_github("diazlab/CONICS/CONICSmat", dep = TRUE)
Install Signac
R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
# Tell R to also check bioconductor when installing dependencies
setRepositories(ind=1:2)
# Install Signac (GO.db must installed with Bioconductor)
install.packages("devtools")
BiocManager::install(c("GO.db","DirichletMultinomial"))
devtools::install_github("timoast/signac")
quit(save="no")
Install Sleuth
R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")
Install TABIX (GEMINI pre-req)
sudo apt-get install tabix
Install BWA
git clone https://github.com/lh3/bwa.git
cd bwa
make
Install bedtools
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make
Set up Apache web server
We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.
- Edit config to allow files to be served from outside /usr/share and /var/www
sudo vim /etc/apache2/apache2.conf
- Add the following content to apache2.conf
<Directory /home/ubuntu/>
Options Indexes FollowSymLinks
AllowOverride None
Require all granted
</Directory>
- Edit vhost file
sudo vim /etc/apache2/sites-available/000-default.conf
- Change document root in 000-default.conf
DocumentRoot /home/ubuntu
- Restart apache
sudo service apache2 restart
Save a public AMI
Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.
Current Public AMIs
- cshl-seqtec-2019 (ami-018b3bf40f9926ac5; N. Virginia)
- cshl-seqtech-2020 (ami-09ecbedc3b79937e3; N. Virginia)
Modification for SeqTech 2020:
cd /
sudo mkdir workspace2
sudo mount /dev/nvme1n1 /workspace2
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\n/dev/nvme0n1 /workspace ext4 defaults,nofail 0 2\n/dev/nvme1n1 /workspace2 ext2 defaults,nofail 0 2" | sudo tee /etc/fstab
cd workspace2/
sudo chown -R ubuntu:ubuntu /workspace2
rsync -av /workspace/* /workspace2/
rm -rf lost+found/
cd
rm -rf workspace
ln -s /workspace2 workspace
Create IAM account
From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.
Launch student instance
- Go to AWS console. Login. Select EC2.
- Launch Instance, search for “cshl-seqtech-2020” in Community AMIs and Select.
- Choose “m5.2xlarge” instance type.
- Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
- Make sure that you see two snapshots (e.g., the 32GB root volume and 80GB EBS volume you set up earlier)
- Create a tag with name=StudentName
- Choose existing security group call “SSH_HTTP_8081_IN_ALL_OUT”. Review and Launch.
- Choose an existing key pair (CSHL.pem)
- View instances and wait for them to finish initiating.
- Find your instance in console and select it, then hit connect to get your public.ip.address.
- Login to node
ssh -i CSHL.pem ubuntu@[public.ip.address]
. - Optional - set up DNS redirects (see below)
Set up a dynamic DNS service
Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like
Host necessary files for the course
Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.
- Files copied to: /gscmnt/sata102/info/ftp-staging/pub/rnaseq/
- Appear here: http://genome.wustl.edu/pub/rnaseq/
After course reminders
- Delete the student IAM account created above otherwise students will continue to have EC2 privileges.
- Terminate all instances and clean up any unnecessary volumes, snapshots, etc.