RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Annotations


RNA-seq_Flowchart


FASTA/FASTQ/GTF mini lecture

If you would like a refresher on common file formats such as FASTA, FASTQ, and GTF files, we have made a mini lecture briefly covering these.

Obtain Known Gene/Transcript Annotations

In this tutorial we will use annotations obtained from Ensembl (Homo_sapiens.GRCh38.86.gtf.gz) for chromosome 22 only. For time reasons, these are prepared for you and made available on your AWS instance. But you should get familiar with sources of gene annotations for RNA-seq analysis.

Copy the gene annotation files to the working directory.

echo $RNA_REFS_DIR
cd $RNA_REFS_DIR
wget http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf

Take a look at the contents of the .gtf file. Press q to exit the less display.

echo $RNA_REF_GTF
less -p start_codon -S $RNA_REF_GTF

Note how the -S option makes it easier to veiw this file with less. Make the formatting a bit nicer still:

cat chr22_with_ERCC92.gtf | column -t | less -p exon -S

How many unique gene IDs are in the .gtf file?

We can use a perl command-line command to find out:

perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | wc -l

We can also use grep to find this same information.

cat chr22_with_ERCC92.gtf | grep -w gene | wc -l

Now view the structure of a single transcript in GTF format. Press q to exit the less display when you are done.

grep ENST00000342247 $RNA_REF_GTF | less -p "exon\s" -S

To learn more, see:

Definitions:

Reference genome - The nucleotide sequence of the chromosomes of a species. Genes are the functional units of a reference genome and gene annotations describe the structure of transcripts expressed from those gene loci.

Gene annotations - Descriptions of gene/transcript models for a genome. A transcript model consists of the coordinates of the exons of a transcript on a reference genome. Additional information such as the strand the transcript is generated from, gene name, coding portion of the transcript, alternate transcript start sites, and other information may be provided.

GTF (.gtf) file - A common file format referred to as Gene Transfer Format used to store gene and transcript annotation information. You can learn more about this format here: http://genome.ucsc.edu/FAQ/FAQformat#format4

The Purpose of Gene Annotations (.gtf file)

When running the HISAT2/StringTie/Ballgown pipeline, known gene/transcript annotations are used for several purposes:

Sources for obtaining gene annotation files formatted for HISAT2/StringTie/Ballgown

There are many possible sources of .gtf gene/transcript annotation files. For example, from Ensembl, UCSC, RefSeq, etc. Several options and related instructions for obtaining the gene annotation files are provided below.

I. ENSEMBL FTP SITE

Based on Ensembl annotations only. Available for many species. http://useast.ensembl.org/info/data/ftp/index.html

II. UCSC TABLE BROWSER

Based on UCSC annotations or several other possible annotation sources collected by UCSC. You might chose this option if you want to have a lot of flexibility in the annotations you obtain. e.g. to grab only the transcripts from chromosome 22 as in the following example:

In addition to the .gtf file you may find uses for some extra files providing alternatively formatted or additional information on the same transcripts. For example:

How to get a Gene bed file:
How to get an Exon bed file:
How to get gene symbols and descriptions for all UCSC genes:

III. HISAT2 Precomputed Genome Index

HISAT2 has prebuilt reference genome index files for both DNA and RNA alignment. Various versions of the index files include SNPs and/or transcript splice sites. Versions of both the Ensembl and UCSC genomes for human build 38 are linked from the main HISAT2 page: https://ccb.jhu.edu/software/hisat2/index.shtml

Or those same files are directly available from their FTP site: ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/

Important notes:

On chromosome naming conventions: In order for your RNA-seq analysis to work, the chromosome names in your .gtf file must match those in your reference genome (i.e. your reference genome fasta file). If you get a StringTie result where all transcripts have an expression value of 0, you may have overlooked this. Unfortunately, Ensembl, NCBI, and UCSC can not agree on how to name the chromosomes in many species, so this problem may come up often. You can avoid this by getting a complete reference genome and gene annotation package from the same source (e.g., Ensembl) to maintain consistency.

On reference genome builds: Your annotations must correspond to the same reference genome build as your reference genome fasta file. e.g., both correspond to UCSC human build ‘hg38’, NCBI human build ‘GRCh38’, etc. Even if both your reference genome and annotations are from UCSC or Ensembl they could still correspond to different versions of that genome. This would cause problems in any RNA-seq pipeline.

A more detailed discussion of commonly used version of the human reference genome can be found in a companion workshop PMBIO Reference Genomes.