Welcome to the blog

Posts

My thoughts and ideas

Log into Compute Canada | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Log into Compute Canada

Signing into Compute Canada for the course

In order to sign into your Compute Canada instance, you will need a valid user ID and password for Compute Canada. These should have been provided to you by the instructors.

Logging in with ssh (Mac/Linux)

ssh user#@login1.CBW.calculquebec.cloud

user# is the name of a user on the system you are logging into. login1.CBW.calculquebec.cloud is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password.

Logging in with putty (Windows)

To log in on windows, you must first install putty. Once you have putty installed, you can log in using the following parameters. If you would like photos of where to input these parameters, please refer here.

Session-hostname: login1.CBW.calculquebec.cloud

Connection-Data-Auto-login username: user#

user# is the name of a user on the system you are logging into. login1.CBW.calculquebec.cloud is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password.

Copying files to your computer

  • To copy files from an instance, use scp in a similar fashion (in this case to copy a file called nice_alignments.bam):
scp user#@login1.CBW.calculquebec.cloud:nice_alignments.bam .

Using Jupyter Notebook or JupyterLab

Everything created in your workspace on the cloud is also available by a web server using Jupyter Notebooks or JupyterLab. You can also perform python/R analysis and access an interactive command-line terminal via JupyterLab. Simply go to the following in your browser and choose Jupyter Notebook (or JupyterLab) in the User Interface dropdown menu. For simply browsing and downloading of files you can select Number of cores = 1 and Memory (MB) = 3200. For analysis in JupyterLab you select Number of cores = 4 and Memory (MB) = 32000. NOTE: Be aware that if you request resources from both your terminal/putty (e.g., salloc requests) and also via Jupyter. These are additive. Make sure to terminate any terminal or Jupyter session not in use. It is important to log out once you finish Jupyter session to release the resources. If you only close the browser window, your Jupyter session is still running and using the resources.

https://jupyter.cbw.calculquebec.cloud/

File system layout

When you log in, you will be in your home directory (e.g., /home/user##). You will notice that you have three directories: “CourseData”, “projects”, and “scratch”. For the purposes of this course, we will mostly be working in your home directory and making use of some data files in the CourseData directory.

How to request and use a compute node

After you log into the cluster, you will be on the login node. This has very limited compute and memory resources. Do NOT run anything on the login node. You can access a compute node with an interactive session using salloc command. For example, salloc --mem 24000M -c 4 -t 8:0:0

--mem: the real memory (in megabytes) required per node.
-c | --cpus-per-task: number of processors required.
-t | --time: limit on the total run time of the job allocation.

The above command requests an interactive session with 4 cores and 32000M memory for 8 hours. Once the job is allocated, you will be on one of the compute nodes.

After you have received your compute node, you will need to load the software that we will be using for this workshop.

This can be done with the following command.

module load samtools/1.10 bam-readcount/0.8.0 hisat2/2.2.0 stringtie/2.1.0 gffcompare/0.11.6 tophat/2.1.1 kallisto/0.46.1 fastqc/0.11.8 multiqc/1.8 picard/2.20.6 flexbar/3.5.0 RSeQC/3.0.1 bedops/2.4.39 ucsctools/399 r/4.0.0 python/3.7.4 bam-readcount/0.8.0 HTSeq/1.18.1 regtools/0.5.2

Getting information on your compute jobs

The following command allow you to see all current jobs requested by your user and cancel a job if needed. This could be needed if you get connected from your compute session and you wind up with “zombie” jobs that you are no longer connected to. The first command can be used to find the job id needed for the second command.

squeue -u $user
scancel $jobid

When you are done with the compute node, make sure to type exit to exit the node and free up the resources you allocated for the node.

Strand Settings | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Strand Settings

There are various strand-related settings for RNA-seq tools that must be adjusted to account for library construction strategy. The following table provides read orientation codes and software settings for commonly used RNA-seq analysis tools including: IGV, TopHat, HISAT2, HTSeq, Picard, Kallisto, StringTie, and others. Each of these explanations/settings is provided for several commonly used RNA-seq library construction kits that produce either stranded or unstranded data.

NOTE: A useful tool to infer strandedness of your raw sequence data is the check_strandedness tool. We provide a tutorial for using this tool here.

NOTE: In the table below, the list of methods/kits for specific strand settings assumes that these kits are used as specified by their manufacturer. It is very possible that a sequencing provider/core may make modifications to these kits. For example, in one case we obtained RNAseq data processed with NEBNext Ultra II Directional kit (dUTP method). However instead of using the NEB hairpin adapters, IDT xGen UDI-UMI adapters were substituted, and this results in the insert strandedness being flipped (from RF/fr-firststrand to FR/fr-secondstrand). Because this level of detail is not always provided it is highly recommended to confirm your data’s strandedness empirically.

Tool RF/fr-firststrand stranded (dUTP) FR/fr-secondstrand stranded (Ligation) Unstranded
check_strandedness (output) RF/fr-firststrand FR/fr-secondstrand unstranded
IGV (5p to 3p read orientation code) F2R1 F1R2 F2R1 or F1R2
TopHat (–library-type parameter) fr-firststrand fr-secondstrand fr-unstranded
HISAT2 (–rna-strandness parameter) R/RF F/FR NONE
HTSeq (–stranded/-s parameter) reverse yes no
Picard CollectRnaSeqMetrics (STRAND_SPECIFICITY parameter) SECOND_READ_TRANSCRIPTION_STRAND FIRST_READ_TRANSCRIPTION_STRAND NONE
Kallisto quant (parameter) –rf-stranded –fr-stranded NONE
StringTie (parameter) –rf –fr NONE
FeatureCounts (-s parameter) 2 1 0
RSEM (–forward-prob parameter) 0 1 0.5
Salmon (–libType parameter) ISR (assuming paired-end with inward read orientation) ISF (assuming paired-end with inward read orientation) IU (assuming paired-end with inward read orientation)
Trinity (–SS_lib_type parameter) RF FR NONE
MGI CWL YAML (strand parameter) first second NONE
RegTools (strand parameter) -s 1 -s 2 -s 0
  Example methods/kits: dUTP, NSR, NNSR, Illumina TruSeq Strand Specific Total RNA, NEBNext Ultra II Directional Example methods/kits: Ligation, Standard SOLiD, NuGEN Encore, 10X scRNA data Example kits/data: Standard Illumina, NuGEN OvationV2, GDC normalized TCGA data

Notes

To identify which ‘–library-type’ setting to use with TopHat, Illumina specifically documents the types in the ‘RNA Sequencing Analysis with TopHat’ Booklet. For the TruSeq RNA Sample Prep Kit, the appropriate library type is ‘fr-unstranded’. For TruSeq stranded sample prep kits, the library type is specified as ‘fr-firststrand’. These posts are also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)? and How to determine if a library Is strand-specific and Strandness in RNASeq by Hong Zheng. Another suggestion is to view aligned reads in IGV and determine the read orientation by one of two methods. First, you can have IGV color alignments according to strand using the ‘Color alignments’ by ‘First-of-pair strand’ setting. Second, to get more detailed information you can hover your cursor over a read aligned to an exon. ‘F2 R1’ means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5’ to 3’) this would denote a fr-firststrand setting in TopHat, i.e. “the right-most end of the fragment (in transcript coordinates) is the first sequenced”. For a negative DNA strand transcript (3’ to 5’) this would denote a fr-secondstrand setting in TopHat. ‘F1 R2’ means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but its simply the inverse for ‘F1 R2’ mapping. Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically ‘RF’ orientation is reserved for large-insert mate-pair libraries. Other orientations like ‘FF’ and ‘RR’ seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference. Additional details are provided in the TopHat manual.

For HTSeq, the htseq-count manual indicates that for the ‘–stranded’ option, ‘stranded=no’ means that a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For ‘stranded=yes’ and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For ‘stranded=reverse’, these rules are reversed.

For the ‘CollectRnaSeqMetrics’ sub-command of Picard, the Picard manual indicates that one should use ‘FIRST_READ_TRANSCRIPTION_STRAND’ if the reads are expected to be on the transcription strand.