Welcome to the blog

Posts

My thoughts and ideas

GCP Setup | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

GCP Setup

UNDER DEVELOPMENT

Google Cloud Platform setup for use in workshop

This tutorial explains how a Google Cloud Instance can be configured from scratch for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Google GCP.

Create a Google Cloud account

  1. You will need a Google account (personal or institutional)
  2. Use the above email account to log into the Google Cloud Console: https://console.cloud.google.com/. Note: Any GCP account needs to be linked to an actual person/credit card account or institutional billing account.
  3. Create a Google Cloud Project connected to a billing source
  4. Optional - Set up an IAM account. Details to be resolved…
  5. Request limit increases. You need to be able to spin up at least one instance for every student and TA/instructor. To find current limits and request increases in the console, go to: IAM & Admin -> Quotas.
  6. In the GCP console: Go to Compute Engine -> VM instances.

Start with existing base image

  1. Create Instance
  2. Give the Instance a Name (e.g. rnabio-course-2023)
  3. Select Machine Type (e.g. E2 Series: e2-standard-2)
  4. Change the Boot disk to Ubuntu -> Ubuntu 20.04 LTS (x86/64). Change size to 250 GB.
  5. Under Firewall, select these options: Allow HTTP traffic and Allow HTTPS traffic
  6. Hit the Create button

Install the Google Cloud SDK (gcloud), authenticate your user, and login to your VM

  1. Install the Google Cloud commandline interface following instructions here: https://cloud.google.com/sdk/docs/install
  2. Use the following command and follow instructions to authenticate your user: gcloud auth login
  3. Set the project to the google billing project created above as follows: gcloud config set project $project_name
  4. Check the authentication configuration as follows: gcloud config list
  5. Log into the instance using the instance name chosen above follows
gcloud compute ssh rnabio-course-2023

Set up the ubuntu user:

Logging into a Google VM of Ubuntu is a bitter different from AWS. By default you will login with your Google user name (or is it your username from the host machine you login from?) instead of using the “ubuntu” user.

Set password for the ubuntu user (and make note of this somewhere safe). Then change users to the “ubuntu” user before proceeding with the rest of this setup. Note that later if you login as another sudo user, if you need to, you should be able to reset the password associated with the ubuntu user.

whoami
sudo passwd ubuntu
su ubuntu
cd ~

Perform basic linux configuration

  • To allow installation of bioinformatics tools some basic dependencies must be installed first.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcurl4-openssl-dev
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/Chicago
  • logout and log back in for changes to take effect.
exit
exit
gcloud compute ssh rnabio-course-2023
su ubuntu
cd ~

Add ubuntu user to docker group

sudo usermod -aG docker ubuntu

Then exit shell and log back into instance.

Install any desired informatics tools

  • NOTE: R in particular is a slow install.
  • NOTE:
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
  • Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add export RNA_HOME=~/workspace/rnaseq to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc.
  • NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc:
export MANPAGER=less

Install RNA-seq software

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

cd ~/bin
wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
bunzip2 samtools-1.16.1.tar.bz2
tar -xvf samtools-1.16.1.tar
cd samtools-1.16.1
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH

Install bam-readcount

cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.16.1
git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz
tar -xzvf stringtie-2.1.6.tar.gz
cd stringtie-2.1.6
make release
export PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH

Install htseq-count

sudo apt install python3-htseq

Make sure that OpenSSL is on correct version

TopHat will not install if the version of OpenSSL is too old.

To get version:

openssl version

If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps.

cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig

Again, from the terminal issue the command:

openssl version

Your output should be as follows:

OpenSSL 1.1.1g  21 Apr 2020

Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano.

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Intall a particular version of numpy that hopefully works with all the dependencies that rely on it

cd ~/bin
pip install --force-reinstall -v "numpy==1.24.1"

Install MultiQC

cd ~/bin
export PATH=/home/ubuntu/.local/bin:$PATH
pip3 install multiqc
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

Install Flexbar

sudo apt install flexbar

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=/home/ubuntu/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.40
cd bedops_linux_x86_64-v2.4.40
wget -c https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.40.tar.bz2
./bin/bedops
export PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH

Install Cell Ranger

  • Must register to get download link
cd ~/bin
wget `download_link`
tar -xzvf cellranger-7.1.0.tar.gz
cd cellranger-7.1.0
./bin/cellranger
export PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH

Install TABIX

sudo apt-get install tabix

Install BWA

cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH

Install bedtools

cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -zxvf bedtools-2.30.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH

Install BCFtools

cd ~/bin
wget wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2
bunzip2 bcftools-1.16.tar.bz2
tar -xvf bcftools-1.16.tar
cd bcftools-1.16
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.14:$PATH

Install htslib

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2
bunzip2 htslib-1.16.tar.bz2
tar -xvf htslib-1.16.tar
cd htslib-1.16
make
./htsfile
export PATH=/home/ubuntu/bin/htslib-1.14:$PATH

Install peddy

cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .
python -m peddy -h

Install slivar

cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar
chmod +x ./slivar
./slivar
export PATH=/home/ubuntu/bin:$PATH

Install STRling

cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling
chmod +x ./strling
./strling -h
export PATH=/home/ubuntu/bin:$PATH

Install freebayes

sudo apt install freebayes

Install vcflib

sudo apt install libvcflib-tools libvcflib-dev

Install Anaconda

cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh

Press Enter to review the license agreement. Then press and hold Enter to scroll.

Enter “yes” to agree to the license agreement.

Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3.

Install [VEP]

Describes dependencies for VEP 108, used in this course for variant annotation. When running the VEP installer follow the prompts specified:

  1. Do you want to install any cache files (y/n)? n [ENTER] (select number for homo_sapiens_vep_108_GRCh38.tar.gz) [ENTER]
  2. Do you want to install any FASTA files (y/n)? n [ENTER] (select number for homo_sapiens) [ENTER]
  3. Do you want to install any plugins (y/n)? n [ENTER]

The VEP cache and FASTA files are very large ~25G or more. Probably do NOT want to install these as part of an image, but it should be possible to rerun this tool and install them later as needed.

mkdir ~/workspace
cd ~/bin
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/bin/ensembl-vep:$PATH

Set up Jupyter to render in web brower

Followed this website

First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:

export PATH=/home/ubuntu/anaconda3/bin:$PATH

Then you need to source the .bashrc for changes to take effect.

source .bashrc

We then need to create our Jupyter configuration file. In order to create that file, you need to run:

jupyter notebook --generate-config

After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:

Enter the IPython command line:

ipython

Now follow these steps to generate your password:

from IPython.lib import passwd

passwd()

exit

You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.

Next go into your jupyter config file:

cd ~/.jupyter/

vim jupyter_notebook_config.py

Note: You may need first to run exit in order to exit IPython otherwise the vim command may not be recognized by the terminal.

And add the following code:

conf = get_config()

conf.NotebookApp.ip = '0.0.0.0'
conf.NotebookApp.password = u'YOUR PASSWORD HASH'
conf.NotebookApp.port = 8888
# Note: this code below should be put at the beginning of the document.

We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:

cd ~/workspace
mkdir Jupyter_Notebooks

You can call this folder anything, for this example we call it Notebooks

After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:

jupyter notebook

From there you should be able to access your server by going to:

https://(your GCP public IP):8888/

Note that in order for this to work you need to have allowed external access to this machine over port 8888. In the GCP firewall settings this means adding a new firewall rull:

  • Name: default-allow-jupyter
  • Direction of traffic: Ingress
  • Action on match: Allow
  • Targets: All instances in the network
  • Source IP ranges: 0.0.0.0/0
  • Specified protocols and ports: tcp:8888

Install R

sudo apt-get -y remove r-base-core
sudo apt-get -y remove r-base
sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt install r-base
R --version

#make R library location accessible
sudo chown -R ubuntu:ubuntu /usr/local/lib/R/
chmod -R 775 /usr/local/lib/R

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db"))
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Path setup

Add the following lines to the .bashrc using vim to ensure that all tools install are in the ubuntu user path:

PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH
PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH
PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
PATH=/home/ubuntu/bin/FastQC:$PATH
PATH=/home/ubuntu/.local/bin:$PATH
PATH=/home/ubuntu/bin/regtools/build:$PATH
PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
PATH=/home/ubuntu/bin/genePredToBed:$PATH
PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH
PATH=/home/ubuntu/bin/bwa:$PATH
PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
PATH=/home/ubuntu/bin/htslib-1.14:$PATH
PATH=/home/ubuntu/bin/ensembl-vep:$PATH

For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

  • Edit config to allow files to be served from outside /usr/share and /var/www
sudo vim /etc/apache2/apache2.conf
  • Add the following content to apache2.conf
<Directory /home/ubuntu/workspace/>
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
</Directory>
  • Edit vhost file
sudo vim /etc/apache2/sites-available/000-default.conf
  • Change document root in 000-default.conf
DocumentRoot /home/ubuntu/workspace
  • Restart apache
sudo service apache2 restart

Test by going to your instance’s public IP address in your browser.

Create a public Google cloud image using the GCP console

  1. Under Compute Engine -> Virtual Machines -> VM Instances. Stop the instance.
  2. Under Compute Engine -> Storage -> Images. Create Image.
  3. Provide a name for the image (e.g. rnabio-course-2023-v1).
  4. Select Source -> Disk
  5. Under `Source disk’ -> Choose the name of the stopped instance (e.g. rnabio-course-2023)
  6. Select Location -> Multi-regional
  7. Select location -> us (multiple regions in the United States)
  8. Leave Family blank, but add a description.
  9. Encryption -> Google-managed encryption key.

To make the image fully public execute the following Google SDK command:


gcloud compute images add-iam-policy-binding rnabio-course-2023-v2 --member='allAuthenticatedUsers' --role='roles/compute.imageUser'

To list the image from the command line:

gcloud compute images list --filter="name=rnabio-course-2023-v2"

Current Public Google Images

  • rnabio-course-2023-v2

Launch student instance using this image

To start a new VM with the public image above one can use the GCP console as was done above to create a new VM with vanilla ubuntu, except this time selecting the pre-configured image with all tool installed already.

We have been unable to get this to work using the Console. It seems listing custom public images is not working there… ?

From the command line you can launch an instance as follows (you should probably personalize the malachi-course-2023 name used in two places of this command):

gcloud compute instances create malachi-course-2023 --zone=us-central1-a --machine-type=e2-standard-4 --network-interface=network-tier=PREMIUM,subnet=default --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,image=projects/griffith-lab/global/images/rnabio-course-2023-v2,mode=rw,size=250,type=pd-balanced,device-name=malachi-course-2023 

Initial setup upon first login

You will want to do everything on this VM as the “ubuntu” user. First set the password for that user and then change to it.

Login as follows

gcloud compute ssh ubuntu@malachi-course-2023

Test environment

bwa mem
env
AWS Setup | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

AWS Setup

Amazon AWS/AMI setup for use in workshop

This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.

Create AWS account

A helpful tutorial can be found here

  1. Create a new gmail account to use for the course
  2. Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account.
  3. Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.
  4. Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use.
  5. Sign into AWS Management Console: http://aws.amazon.com/console/
  6. Go to EC2 services

Start with existing community AMI

  1. Launch a fresh Ubuntu Image (Ubuntu Server 22.04 LTS at the time of writing this). Choose an instance type of m6a.xlarge. Increase root volume (e.g., 60GB)(type:gp3) and add a second volume (e.g., 500GB)(type:gp3). Choose appropriate security group (for 2024 course, choose security group SSH/HTTP/Jupyter (cshl-2023)). Review and Launch. If necessary, create a new key pair, name and save somewhere safe. Select ‘View Instances’. Take note of public IP address of newly launched instance.
  2. Change permissions on downloaded key pair with chmod 400 [instructor-key].pem
  3. Login to instance with ubuntu user:

ssh -i [instructor-key].pem ubuntu@[public.ip.address]

Perform basic linux configuration

sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/New_York

Add ubuntu user to docker group

sudo usermod -aG docker ubuntu

Then exit shell and log back into instance.

Set up additional storage for workspace

We first need to setup the additional storage volume that we added when we created the instance.

# Create mountpoint for additional storage volume
cd /
sudo mkdir workspace

# Mount ephemeral storage
cd
sudo mkfs -t ext4 /dev/nvme1n1
sudo mount /dev/nvme1n1 /workspace

In order to make the workspace volume persistent, we need to edit the etc/fstab file in order. AWS provides instructions for how to do this here.

# Make ephemeral storage mounts persistent
# See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS

# get UUID from sudo lsblk -f
UUID=$(sudo lsblk -f | grep nvme1n1 | awk {'print $4'})
#if want to double check, can do 'echo $UUID' to see the UUID. 

#then add that UUID to /etc/fstab
echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\nUUID=$UUID /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab
#'less /etc/fstab' , to see if the new line has been added

# Change permissions on required drives
sudo chown -R ubuntu:ubuntu /workspace

# Create symlink to the added volume in your home directory
cd ~
ln -s /workspace workspace

Install any desired informatics tools

- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
export MANPAGER=less

Install RNA-seq software

Create directory to install software to and setup path variables

mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu

Install SAMtools

~/bin
wget https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2
bunzip2 samtools-1.18.tar.bz2
tar -xvf samtools-1.18.tar
cd samtools-1.18
make
./samtools
#add the following line to .bashrc
export PATH=/home/ubuntu/bin/samtools-1.18:$PATH
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.18

Install bam-readcount

cd ~/bin
git clone https://github.com/genome/bam-readcount 
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH

Install HISAT2

uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH

Install StringTie

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz
tar -xzvf stringtie-2.2.1.tar.gz
cd stringtie-2.2.1
make release
export PATH=/home/ubuntu/bin/stringtie-2.2.1:$PATH

Install gffcompare

cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH

Install htseq-count

sudo apt install python3-htseq
# to check version,type : htseq-count --version

Make sure that OpenSSL is on correct version

TopHat will not install if the version of OpenSSL is too old.

To get version:

openssl version

If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps.

cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig

Again, from the terminal issue the command:

openssl version

Your output should be as follows:

OpenSSL 1.1.1g  21 Apr 2020

Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano.

Install TopHat

cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64
./gtf_to_fasta
export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH

Install kallisto

Note: There are a couple of arguments only supported in kallisto legacy versions (version before 0.50.0). Also how_are_we_stranded_here uses kallisto == 0.44.x. Thus, installation steps below if for 1 of the legacy versions. But if run into problem, consider using a more updated version.

cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH

Install FastQC

cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH

Install MultiQC

cd ~/bin
pip3 install multiqc
export PATH=/home/ubuntu/.local/bin:$PATH
multiqc --help

Install Picard

cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar

export PICARD=/home/ubuntu/bin/picard.jar

Install Flexbar

sudo apt install flexbar

Install Regtools

cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH

Install RSeQC

pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=~/.local/bin/:$PATH

Install bedops

cd ~/bin
mkdir bedops_linux_x86_64-v2.4.41
cd bedops_linux_x86_64-v2.4.41
wget -c https://github.com/bedops/bedops/releases/download/v2.4.41/bedops_linux_x86_64-v2.4.41.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.41.tar.bz2
./bin/bedops

export PATH=~/bin/bedops_linux_x86_64-v2.4.41/bin:$PATH

Install gtfToGenePred

cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH

Install genePredToBed

cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredtoBed:$PATH 
#note: the path has lowercase 't' at in 'genePredtoBed'
#genePredToBed 

Install how_are_we_stranded_here

pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness

Install Cell Ranger

cd ~/bin
wget `download_link`
tar -xzvf cellranger-7.2.0.tar.gz
export PATH=/home/ubuntu/bin/cellranger-7.2.0:$PATH

Install TABIX

sudo apt-get install tabix

Install BWA

cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH
#bwa mem #to call bwa

Install bedtools

cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz
tar -zxvf bedtools-2.31.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH

Install BCFtools

cd ~/bin
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
bunzip2 bcftools-1.18.tar.bz2
tar -xvf bcftools-1.18.tar
cd bcftools-1.18
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.18:$PATH

Install htslib

cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
bunzip2 htslib-1.18.tar.bz2
tar -xvf htslib-1.18.tar
cd htslib-1.18
make
sudo make install
#htsfile --help
export PATH=/home/ubuntu/bin/htslib-1.18:$PATH

Install peddy

cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .

Install slivar

cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.3.0/slivar
chmod +x ./slivar

Install STRling

cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.2/strling
chmod +x ./strling

Install freebayes

sudo apt install freebayes

Install vcflib

sudo apt install libvcflib-tools libvcflib-dev

Install Anaconda

cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh 
bash Anaconda3-2023.09-0-Linux-x86_64.sh

Press Enter to review the license agreement. Then press and hold Enter to scroll.

Enter “yes” to agree to the license agreement.

Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3.

Add in bashrc:

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

To see location of conda executable: which conda

Install VEP

Note: Install VEP in workspace because cache file for that takes a lot of space (~25G).

Describes dependencies for VEP 110, used in this course for variant annotation. When running the VEP installer follow the prompts specified:

  1. Do you want to install any cache files (y/n)? n (In case want to install cache file, choose ‘y’ [ENTER] (select number for homo_sapiens_vep_110_GRCh38.tar.gz) [ENTER] )
  2. Do you want to install any FASTA files (y/n)? y [ENTER] (select number for homo_sapiens) [ENTER]
  3. Do you want to install any plugins (y/n)? n [ENTER]
cd ~/workspace
sudo git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
sudo perl -MCPAN -e'install "LWP::Simple"'
sudo perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/workspace/ensembl-vep:$PATH
#vep --help

Set up Jupyter to render in web brower

Followed this website and this website Note: The old jupyter notebook was split into jupyter-server and nbclassic. The steps to set up jupyter on ec2 in the first link therefore have been adapted based on suggestions in the second link to accommodate this migration.

First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:

export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH

Then you need to source the .bashrc for changes to take effect.

source .bashrc

We then need to create our Jupyter configuration file. In order to create that file, you need to run:

jupyter notebook --generate-config

( ~~~~~~~ Optional: After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:

Enter the IPython command line:

ipython

Now follow these steps to generate your password:

from notebook.auth import passwd

passwd()

You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.

Run exit in order to exit IPython. ~~~~~~ )

Next go into your jupyter config file (/home/ubuntu/.jupyter/jupyter_server_config.py) :

cd .jupyter

vim jupyter_notebook_config.py

And add the following code at the beginning of the document:

c = get_config() #add this line if it's not already in jupyter_notebook_config.py

c.ServerApp.ip = '0.0.0.0'
#c.ServerApp.password = u'YOUR PASSWORD HASH' #uncomment this line if decide to use password
c.ServerApp.port = 8888

(~~~~~~ Optional:

We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:

mkdir Notebooks

You can call this folder anything, for this example we call it Notebooks ~~~~~~ )

After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:

jupyter nbclassic

From there you should be able to access your server by going to:

http://(your AWS public dns):8888/ or http://(your AWS public dns):8888/(tree?token=... - in the message generated while running 'jupyter nbclassic')

(Note: if ever run into problem accessing server, double check whether you are using http or https. If you didnt add https port in security group configuration step when create the instance, then you wouldn’t be able to access server with https.)

Install R

Follow this guide from cran website to install R ver 4.4 website

# update indices
sudo apt update -qq
# install two helper packages we need
sudo apt install --no-install-recommends software-properties-common dirmngr
# add the signing key (by Michael Rutter) for these repos
# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc 
# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"

#install R and its dependencies
sudo apt install --no-install-recommends r-base

Note, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory.

R Libraries

For this tutorial we require:

R
install.packages(c("devtools","dplyr","gplots","ggplot2","sctransform","Seurat","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR","tidyverse"),repos="http://cran.us.r-project.org")
quit(save="no")

Note: if asked if want to install in personal library, type ‘yes’.

Bioconductor libraries

For this tutorial we require:

R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db","DESeq2","apeglm"))
quit(save="no")

Install Sleuth

R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")

Install softwares for germline analyses

Install gatk

(Note: in cshl2023 version of the course, install this gatk 4.2.1.0 instead of an more updated ver since this work with the current Java version - Java ver 11)

cd ~/bin
wget https://github.com/broadinstitute/gatk/releases/download/4.2.1.0/gatk-4.2.1.0.zip
unzip gatk-4.2.1.0.zip

export PATH=/home/ubuntu/bin/gatk-4.2.1.0:$PATH #add to .bashrc
gatk --help
gatk --list

Install minimap2

cd ~/bin
curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.26_x64-linux/minimap2

export PATH=/home/ubuntu/bin/minimap2-2.26_x64-linux:$PATH #add to .bashrc
minimap2 --help

Install NanoPlot

pip install NanoPlot
#which NanoPlot
#NanoPlot -h

Install Varscan

cd ~/bin
curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
java -jar ~/bin/VarScan.v2.4.2.jar

Install faSplit

conda create -n fasplit_env bioconda::ucsc-fasplit
source activate fasplit_env
conda activate fasplit_env
#faSplit
conda deactivate

Install packages for single-cell ATAC-seq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name snapatac2_env python=3.11
source activate snapatac2_env
conda activate snapatac2_env

pip install snapatac2
#pip show snapatac2 
pip install scanpy
pip install MACS2
pip install --user magic-impute
pip install deeptools 

conda deactivate

To run virtual environment in jupyter nbclassic, there are a few extra set up steps:

#Step 1: Activate the Conda Environment of interest:
conda activate snapatac2_env
#Step 2: Install Ipykernel: 
conda install ipykernel
#Step 3: Create a Jupyter Kernel for the environment
python -m ipykernel install --user --name=snapatac2_env_kernel
```bash
Then run jupyter notebook as usual:
```bash
jupyter nbclassic

Access server by adding to the browser: http://(your AWS public dns):8888/ We can either create a notebook using desired environment kernel, or just create a notebook using the default ipykernel and change kernel within the notebook itself.

Install ATACseqQC

R
#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
BiocManager::install("ATACseqQC")
quit(save="no")

Install packages for single-cell RNAseq lab

To prevent dependencies conflicts, install packages for this lab in a conda environment.

Packages:

conda create --name scRNAseq_env python=3.11
source activate scRNAseq_env
conda activate scRNAseq_env
pip install 'scanpy[leiden]'
pip install gtfparse==1.2.0
pip install scrublet
pip install fast_matrix_market
pip install harmony-pytorch

conda install ipykernel
python -m ipykernel install --user --name=scRNAseq_env_kernel
conda deactivate

Install packages for Variant annotation and python visualization lab

Packages:

#pyenv
#note: after installation, add pyenv configs in .bashrc (see below)
cd ~/bin
curl https://pyenv.run | bash

#virtualenv
sudo apt install pipx
pipx install virtualenv

#beautifulsoup4
pip install beautifulsoup4

#requests
pip install requests

#vcfpy
#installation note for vcfpy: https://github.com/KarchinLab/open-cravat/issues/98
conda install -c bioconda open-cravat
pip install vcfpy

#vcftools
#installation note: https://github.com/vcftools/vcftools/issues/188
cd ~/bin
sudo apt-get install autoconf
git clone https://github.com/vcftools/vcftools.git
cd vcftools
./autogen.sh
./configure
make
sudo make install

#pysam
#conda config --show channels #to see if 3 needed channels are already configured in the conda environment. if not, add:
#conda config --add channels defaults
#conda config --add channels conda-forge
#conda config --add channels bioconda
#conda install pysam
pip install pysam
#installed at: ./bin/anaconda3/lib/python3.11/site-packages

#civicpy
pip install civicpy

#pandas
pip install pandas

#jq
sudo apt-get install jq

add pyenv configs in .bashrc

## pyenv configs
export PYENV_ROOT="/home/ubuntu/bin/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"

if command -v pyenv 1>/dev/null 2>&1; then
  eval "$(pyenv init -)"
fi

Path setup

For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.

Set up Apache web server

We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.

sudo vim /etc/apache2/apache2.conf

Add the following content to apache2.conf

<Directory /workspace>
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
</Directory>
sudo service apache2 restart

To check if the server works, type in browser of choice: http://[public ip address of ec2 instance]. You should see the content within /workspace .

Save a public AMI

Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.

Current Public AMIs

Create IAM account

From AWS Console select Services -> IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -> Manage Password -> ‘Assign a Custom Password’. Go to Groups -> Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -> Add Users to Group, select your new user to add it to the new group.

Launch student instance

  1. Go to AWS console. Login. Select EC2.
  2. Launch Instance, search for “cshl-seqtec-2024” in Community AMIs and Select.
  3. Choose “m6a.xlarge” instance type.
  4. Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”
  5. Make sure that you see two snapshots (e.g., the 60GB root volume (gp3) and 500GB EBS volume (gp3) you set up earlier)
  6. Create a tag with Name=StudentName
  7. Choose existing security group call “SSH/HTTP/Jupyter”. Review and Launch.
  8. Choose an existing key pair (cshl_2024_student.pem)
  9. View instances and wait for them to finish initiating.
  10. Find your instance in console and select it, then hit connect to get your public.ip.address.
  11. Login to node ssh -i cshl_2024_student.pem ubuntu@[public.ip.address].
  12. Optional - set up DNS redirects (see below)

Set up a dynamic DNS service

Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances.

Host necessary files for the course

Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.

After course reminders