Welcome to the blog

Posts

My thoughts and ideas

Indexing | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Indexing


RNA-seq_Flowchart


Create a HISAT2 index

Create a HISAT2 index for chr22 and the ERCC spike-in sequences. HISAT2 can incorporate exons and splice sites into the index file for alignment. First create a splice site file, then an exon file. Finally make the aligner FM index.

To learn more about how the HISAT2 indexing strategy is distinct from other next gen aligners refer to the HISAT publication.

    cd $RNA_HOME
    hisat2_extract_splice_sites.py $RNA_REF_GTF > $RNA_REFS_DIR/splicesites.tsv
    hisat2_extract_exons.py $RNA_REF_GTF > $RNA_REFS_DIR/exons.tsv
    hisat2-build -p 8 --ss $RNA_REFS_DIR/splicesites.tsv --exon $RNA_REFS_DIR/exons.tsv $RNA_REF_FASTA $RNA_REF_INDEX

[OPTIONAL] To create an index for all chromosomes instead of just chr22 you would do the following:

NOTE: The below example does NOT take advantage of adding the splice sites and exons to the index. If desired, you would make those files using the full GTF and add them to the command using the appropriate options.

WARNING: In order to index the entire human genome, HISAT2 requires 160GB of RAM. Your AWS instance size will run out of RAM.

    #cd /home/ubuntu/workspace/data/fasta/GRCh38
    #hisat2-build -p 8 Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa Homo_sapiens.GRCh38.dna_sm.primary_assembly
Annotations | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Annotations


RNA-seq_Flowchart


Obtain Known Gene/Transcript Annotations

In this tutorial we will use annotations obtained from Ensembl (Homo_sapiens.GRCh38.86.gtf.gz) for chromosome 22 only. For time reasons, these are prepared for you and made available on your AWS instance. But you should get familiar with sources of gene annotations for RNA-seq analysis.

Copy the gene annotation files to the working directory.

    echo $RNA_REFS_DIR
    cd $RNA_REFS_DIR
    wget http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf

Take a look at the contents of the .gtf file. Press q to exit the less display.

    echo $RNA_REF_GTF
    less -p start_codon -S $RNA_REF_GTF

How many unique gene IDs are in the .gtf file?

We can use a perl command-line command to find out:

    perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | wc -l

Now view the structure of a single transcript in GTF format. Press q to exit the less display when you are done.

    grep ENST00000342247 $RNA_REF_GTF | less -p "exon\s" -S

To learn more, see:

Definitions:

The Purpose of Gene Annotations (.gtf file)

When running the HISAT2/StringTie/Ballgown pipeline, known gene/transcript annotations are used for several purposes:

Sources for obtaining gene annotation files formatted for HISAT2/StringTie/Ballgown

There are many possible sources of .gtf gene/transcript annotation files. For example, from Ensembl, UCSC, RefSeq, etc. Several options and related instructions for obtaining the gene annotation files are provided below.

I. ENSEMBL FTP SITE

Based on Ensembl annotations only. Available for many species. http://useast.ensembl.org/info/data/ftp/index.html

II. UCSC TABLE BROWSER

Based on UCSC annotations or several other possible annotation sources collected by UCSC. You might chose this option if you want to have a lot of flexibility in the annotations you obtain. e.g. to grab only the transcripts from chromosome 22 as in the following example:

In addition to the .gtf file you may find uses for some extra files providing alternatively formatted or additional information on the same transcripts. For example: How to get a Gene bed file:

How to get an Exon bed file:

How to get gene symbols and descriptions for all UCSC genes:

III. HISAT2 Precomputed Genome Index

HISAT2 has prebuilt reference genome index files for both DNA and RNA alignment. Various versions of the index files include SNPs and/or transcript splice sites. Versions of both the Ensembl and UCSC genomes for human build 38 are linked from the main HISAT2 page: https://ccb.jhu.edu/software/hisat2/index.shtml

Or those same files are directly available from their FTP site: ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/

Important notes:

On chromosome naming conventions: In order for your RNA-seq analysis to work, the chromosome names in your .gtf file must match those in your reference genome (i.e. your reference genome fasta file). If you get a StringTie result where all transcripts have an expression value of 0, you may have overlooked this. Unfortunately, Ensembl, NCBI, and UCSC can not agree on how to name the chromosomes in many species, so this problem may come up often. You can avoid this by getting a complete reference genome and gene annotation package from the same source (e.g., Ensembl) to maintain consistency.

On reference genome builds: Your annotations must correspond to the same reference genome build as your reference genome fasta file. e.g., both correspond to UCSC human build ‘hg38’, NCBI human build ‘GRCh38’, etc. Even if both your reference genome and annotations are from UCSC or Ensembl they could still correspond to different versions of that genome. This would cause problems in any RNA-seq pipeline.