GCP Setup
UNDER DEVELOPMENT
Google Cloud Platform setup for use in workshop
This tutorial explains how a Google Cloud Instance can be configured from scratch for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Google GCP.
Create a Google Cloud account
- You will need a Google account (personal or institutional)
- Use the above email account to log into the Google Cloud Console: https://console.cloud.google.com/. Note: Any GCP account needs to be linked to an actual person/credit card account or institutional billing account.
- Create a Google Cloud Project connected to a billing source
- Optional - Set up an IAM account. Details to be resolved…
- Request limit increases. You need to be able to spin up at least one instance for every student and TA/instructor. To find current limits and request increases in the console, go to:
IAM & Admin
->Quotas
. - In the GCP console: Go to
Compute Engine
->VM instances
.
Start with existing base image
Create Instance
- Give the Instance a Name (e.g.
rnabio-course-2023
) - Select Machine Type (e.g. E2 Series:
e2-standard-2
) - Change the Boot disk to Ubuntu -> Ubuntu 20.04 LTS (x86/64). Change size to 250 GB.
- Under Firewall, select these options:
Allow HTTP traffic
andAllow HTTPS traffic
- Hit the
Create
button
Install the Google Cloud SDK (gcloud), authenticate your user, and login to your VM
- Install the Google Cloud commandline interface following instructions here: https://cloud.google.com/sdk/docs/install
- Use the following command and follow instructions to authenticate your user:
gcloud auth login
- Set the project to the google billing project created above as follows:
gcloud config set project $project_name
- Check the authentication configuration as follows:
gcloud config list
- Log into the instance using the instance name chosen above follows
gcloud compute ssh rnabio-course-2023
Set up the ubuntu user:
Logging into a Google VM of Ubuntu is a bitter different from AWS. By default you will login with your Google user name (or is it your username from the host machine you login from?) instead of using the “ubuntu” user.
Set password for the ubuntu user (and make note of this somewhere safe). Then change users to the “ubuntu” user before proceeding with the rest of this setup. Note that later if you login as another sudo user, if you need to, you should be able to reset the password associated with the ubuntu user.
whoami
sudo passwd ubuntu
su ubuntu
cd ~
Perform basic linux configuration
- To allow installation of bioinformatics tools some basic dependencies must be installed first.
sudo apt-get update
sudo apt-get upgrade
sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcurl4-openssl-dev
sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json
sudo timedatectl set-timezone America/Chicago
- logout and log back in for changes to take effect.
exit
exit
gcloud compute ssh rnabio-course-2023
su ubuntu
cd ~
Add ubuntu user to docker group
sudo usermod -aG docker ubuntu
Then exit shell and log back into instance.
Install any desired informatics tools
- NOTE: R in particular is a slow install.
- NOTE:
- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
- Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add
export RNA_HOME=~/workspace/rnaseq
to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc. - NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a
man ls
and if the problem exists, add the following to .bashrc:
export MANPAGER=less
Install RNA-seq software
- These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in
/home/ubuntu/bin/
and its install location is exported to the $PATH variable for easy access.
Create directory to install software to and setup path variables
mkdir ~/bin
cd bin
WORKSPACE=/home/ubuntu/workspace
HOME=/home/ubuntu
Install SAMtools
cd ~/bin
wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
bunzip2 samtools-1.16.1.tar.bz2
tar -xvf samtools-1.16.1.tar
cd samtools-1.16.1
make
./samtools
export PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH
Install bam-readcount
cd ~/bin
export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.16.1
git clone https://github.com/genome/bam-readcount
cd bam-readcount
mkdir build
cd build
cmake ..
make
export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH
Install HISAT2
uname -m
cd ~/bin
curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download > hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
cd hisat2-2.2.1
./hisat2 -h
export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
Install StringTie
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz
tar -xzvf stringtie-2.1.6.tar.gz
cd stringtie-2.1.6
make release
export PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
Install gffcompare
cd ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz
cd gffcompare-0.12.6.Linux_x86_64/
./gffcompare
export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
Install htseq-count
sudo apt install python3-htseq
Make sure that OpenSSL is on correct version
TopHat will not install if the version of OpenSSL is too old.
To get version:
openssl version
If version is OpenSSL 1.1.1f
, then it needs to be updated using the following steps.
cd ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
tar -zxf openssl-1.1.1g.tar.gz && cd openssl-1.1.1g
./config
make
make test
sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong
sudo make install
sudo ln -s /usr/local/bin/openssl /usr/bin/openssl
sudo ldconfig
Again, from the terminal issue the command:
openssl version
Your output should be as follows:
OpenSSL 1.1.1g 21 Apr 2020
Then create ~/.wgetrc
file and add to it
ca_certificate=/etc/ssl/certs/ca-certificates.crt
using vim or nano.
Install TopHat
cd ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz
cd tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
export PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
Install kallisto
cd ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
tar -zxvf kallisto_linux-v0.44.0.tar.gz
cd kallisto_linux-v0.44.0/
./kallisto
export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
Install FastQC
cd ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC/
chmod 755 fastqc
./fastqc --help
export PATH=/home/ubuntu/bin/FastQC:$PATH
Intall a particular version of numpy that hopefully works with all the dependencies that rely on it
cd ~/bin
pip install --force-reinstall -v "numpy==1.24.1"
Install MultiQC
cd ~/bin
export PATH=/home/ubuntu/.local/bin:$PATH
pip3 install multiqc
multiqc --help
Install Picard
cd ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar
java -jar ~/bin/picard.jar
Install Flexbar
sudo apt install flexbar
Install Regtools
cd ~/bin
git clone https://github.com/griffithlab/regtools
cd regtools/
mkdir build
cd build/
cmake ..
make
./regtools
export PATH=/home/ubuntu/bin/regtools/build:$PATH
Install RSeQC
pip3 install RSeQC
~/.local/bin/read_GC.py
export PATH=/home/ubuntu/.local/bin/:$PATH
Install bedops
cd ~/bin
mkdir bedops_linux_x86_64-v2.4.40
cd bedops_linux_x86_64-v2.4.40
wget -c https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2
tar -jxvf bedops_linux_x86_64-v2.4.40.tar.bz2
./bin/bedops
export PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
Install gtfToGenePred
cd ~/bin
mkdir gtfToGenePred
cd gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
chmod a+x gtfToGenePred
./gtfToGenePred
export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
Install genePredToBed
cd ~/bin
mkdir genePredtoBed
cd genePredtoBed
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod a+x genePredToBed
./genePredToBed
export PATH=/home/ubuntu/bin/genePredToBed:$PATH
Install Cell Ranger
- Must register to get download link
cd ~/bin
wget `download_link`
tar -xzvf cellranger-7.1.0.tar.gz
cd cellranger-7.1.0
./bin/cellranger
export PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH
Install TABIX
sudo apt-get install tabix
Install BWA
cd ~/bin
git clone https://github.com/lh3/bwa.git
cd bwa
make
export PATH=/home/ubuntu/bin/bwa:$PATH
Install bedtools
cd ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -zxvf bedtools-2.30.0.tar.gz
cd bedtools2
make
export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
Install BCFtools
cd ~/bin
wget wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2
bunzip2 bcftools-1.16.tar.bz2
tar -xvf bcftools-1.16.tar
cd bcftools-1.16
make
./bcftools
export PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
Install htslib
cd ~/bin
wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2
bunzip2 htslib-1.16.tar.bz2
tar -xvf htslib-1.16.tar
cd htslib-1.16
make
./htsfile
export PATH=/home/ubuntu/bin/htslib-1.14:$PATH
Install peddy
cd ~/bin
git clone https://github.com/brentp/peddy
cd peddy
pip install -r requirements.txt
pip install --editable .
python -m peddy -h
Install slivar
cd ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar
chmod +x ./slivar
./slivar
export PATH=/home/ubuntu/bin:$PATH
Install STRling
cd ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling
chmod +x ./strling
./strling -h
export PATH=/home/ubuntu/bin:$PATH
Install freebayes
sudo apt install freebayes
Install vcflib
sudo apt install libvcflib-tools libvcflib-dev
Install Anaconda
cd ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
Press Enter to review the license agreement. Then press and hold Enter to scroll.
Enter “yes” to agree to the license agreement.
Saved the installation to /home/ubuntu/bin/anaconda3
and chose yes to initializng Anaconda3.
Install [VEP]
Describes dependencies for VEP 108, used in this course for variant annotation. When running the VEP installer follow the prompts specified:
- Do you want to install any cache files (y/n)? n [ENTER] (select number for homo_sapiens_vep_108_GRCh38.tar.gz) [ENTER]
- Do you want to install any FASTA files (y/n)? n [ENTER] (select number for homo_sapiens) [ENTER]
- Do you want to install any plugins (y/n)? n [ENTER]
The VEP cache and FASTA files are very large ~25G or more. Probably do NOT want to install these as part of an image, but it should be possible to rerun this tool and install them later as needed.
mkdir ~/workspace
cd ~/bin
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/
export PATH=/home/ubuntu/bin/ensembl-vep:$PATH
Set up Jupyter to render in web brower
Followed this website
First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:
export PATH=/home/ubuntu/anaconda3/bin:$PATH
Then you need to source the .bashrc for changes to take effect.
source .bashrc
We then need to create our Jupyter configuration file. In order to create that file, you need to run:
jupyter notebook --generate-config
After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:
Enter the IPython command line:
ipython
Now follow these steps to generate your password:
from IPython.lib import passwd
passwd()
exit
You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.
Next go into your jupyter config file:
cd ~/.jupyter/
vim jupyter_notebook_config.py
Note: You may need first to run exit
in order to exit IPython otherwise the vim command may not be recognized by the terminal.
And add the following code:
conf = get_config()
conf.NotebookApp.ip = '0.0.0.0'
conf.NotebookApp.password = u'YOUR PASSWORD HASH'
conf.NotebookApp.port = 8888
# Note: this code below should be put at the beginning of the document.
We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:
cd ~/workspace
mkdir Jupyter_Notebooks
You can call this folder anything, for this example we call it Notebooks
After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:
jupyter notebook
From there you should be able to access your server by going to:
https://(your GCP public IP):8888/
Note that in order for this to work you need to have allowed external access to this machine over port 8888. In the GCP firewall settings this means adding a new firewall rull:
- Name:
default-allow-jupyter
- Direction of traffic:
Ingress
- Action on match:
Allow
- Targets:
All instances in the network
- Source IP ranges:
0.0.0.0/0
- Specified protocols and ports:
tcp:8888
Install R
sudo apt-get -y remove r-base-core
sudo apt-get -y remove r-base
sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt install r-base
R --version
#make R library location accessible
sudo chown -R ubuntu:ubuntu /usr/local/lib/R/
chmod -R 775 /usr/local/lib/R
R Libraries
For this tutorial we require:
- devtools
- dplyr
- gplots
- ggplot2
- Seurat
- sctransform
- RColorBrewer
- ggthemes
- cowplot
- data.table
- Rtsne
- gridExtra
- UpSetR
R
install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org")
quit(save="no")
Bioconductor libraries
For this tutorial we require:
R
# Install Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db"))
quit(save="no")
Install Sleuth
R
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
quit(save="no")
Path setup
Add the following lines to the .bashrc using vim to ensure that all tools install are in the ubuntu user path:
PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH
PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH
PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
PATH=/home/ubuntu/bin/FastQC:$PATH
PATH=/home/ubuntu/.local/bin:$PATH
PATH=/home/ubuntu/bin/regtools/build:$PATH
PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
PATH=/home/ubuntu/bin/genePredToBed:$PATH
PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH
PATH=/home/ubuntu/bin/bwa:$PATH
PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
PATH=/home/ubuntu/bin/htslib-1.14:$PATH
PATH=/home/ubuntu/bin/ensembl-vep:$PATH
For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.
Set up Apache web server
We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.
- Edit config to allow files to be served from outside /usr/share and /var/www
sudo vim /etc/apache2/apache2.conf
- Add the following content to apache2.conf
<Directory /home/ubuntu/workspace/>
Options Indexes FollowSymLinks
AllowOverride None
Require all granted
</Directory>
- Edit vhost file
sudo vim /etc/apache2/sites-available/000-default.conf
- Change document root in 000-default.conf
DocumentRoot /home/ubuntu/workspace
- Restart apache
sudo service apache2 restart
Test by going to your instance’s public IP address in your browser.
Create a public Google cloud image using the GCP console
- Under Compute Engine -> Virtual Machines -> VM Instances. Stop the instance.
- Under Compute Engine -> Storage -> Images. Create Image.
- Provide a name for the image (e.g.
rnabio-course-2023-v1
). - Select
Source
->Disk
- Under `Source disk’ -> Choose the name of the stopped instance (e.g. rnabio-course-2023)
- Select
Location
->Multi-regional
Select location
->us (multiple regions in the United States)
- Leave
Family
blank, but add a description. Encryption
->Google-managed encryption key
.
To make the image fully public execute the following Google SDK command:
gcloud compute images add-iam-policy-binding rnabio-course-2023-v2 --member='allAuthenticatedUsers' --role='roles/compute.imageUser'
To list the image from the command line:
gcloud compute images list --filter="name=rnabio-course-2023-v2"
Current Public Google Images
rnabio-course-2023-v2
Launch student instance using this image
To start a new VM with the public image above one can use the GCP console as was done above to create a new VM with vanilla ubuntu, except this time selecting the pre-configured image with all tool installed already.
We have been unable to get this to work using the Console. It seems listing custom public images is not working there… ?
From the command line you can launch an instance as follows (you should probably personalize the malachi-course-2023
name used in two places of this command):
gcloud compute instances create malachi-course-2023 --zone=us-central1-a --machine-type=e2-standard-4 --network-interface=network-tier=PREMIUM,subnet=default --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,image=projects/griffith-lab/global/images/rnabio-course-2023-v2,mode=rw,size=250,type=pd-balanced,device-name=malachi-course-2023
Initial setup upon first login
You will want to do everything on this VM as the “ubuntu” user. First set the password for that user and then change to it.
Login as follows
gcloud compute ssh ubuntu@malachi-course-2023
Test environment
bwa mem
env