Welcome to the blog

Posts

My thoughts and ideas

Alignment Visualization | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Alignment Visualization


RNA-seq_Flowchart3


Before we can view our alignments in the IGV browser we need to index our BAM files. We will use samtools index for this purpose. For convenience later, index all bam files.

echo $RNA_ALIGN_DIR
cd $RNA_ALIGN_DIR

samtools index HBR.bam
samtools index HBR_Rep1.bam
samtools index HBR_Rep2.bam
samtools index HBR_Rep3.bam
samtools index UHR.bam
samtools index UHR_Rep1.bam
samtools index UHR_Rep2.bam
samtools index UHR_Rep3.bam

# Note that we could have created and run a samtools index command for all files ending in .bam using the following construct:
# find *.bam -exec echo samtools index {} \; | sh

Optional:

Try to create an index file for one of your bam files using a samtools docker image rather than the locally installed version of samtools. Below is an example docker run command.

cp HBR.bam /tmp/
docker run -v /tmp:/docker_workspace biocontainers/samtools:v1.9-4-deb_cv1 samtools index /docker_workspace/HBR.bam
ls /tmp/HBR.bam*

docker run is how you initialize a docker container to run a command

-v is the parameter used to mount your workspace so that the docker container can see the files that you’re working with. In the example above, /tmp from the EC2 instance has been mounted as /docker_workspace within the docker container.

biocontainers/samtools is the docker container name. The :v1.9-4-deb_cv1 refers to the specific tag and release of the docker container.

Visualize alignments

Start IGV on your laptop. Load the UHR.bam & HBR.bam files in IGV. If you’re using AWS, you can load the necessary files in IGV directly from your web accessible amazon workspace (see below) using ‘File’ -> ‘Load from URL’.

Make sure you select the appropriate reference genome build in IGV (top left corner of IGV): in this case hg38.

UHR hisat2 alignment:

http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/alignments/hisat2/UHR.bam

HBR hisat2 alignment:

http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/alignments/hisat2/HBR.bam

You may wish to customize the track names as you load them in to keep them straight. Do this by right-clicking on the alignment track and choosing ‘Rename Track’.

Go to an example gene locus on chr22:

  • e.g. EIF3L, NDUFA6, and RBX1 have nice coverage
  • e.g. SULT4A1 and GTSE1 are differentially expressed. Are they up-regulated or down-regulated in the brain (HBR) compared to cancer cell lines (UHR)?
  • Mouse over some reads and use the read group (RG) flag to determine which replicate the reads come from. What other details can you learn about each read and its alignment to the reference genome.

Exercise

Try to find a variant position in the RNAseq data:

  • HINT: DDX17 is a highly expressed gene with several variants in its 3 prime UTR.
  • Other highly expressed genes you might explore are: NUP50, CYB5R3, and EIF3L (all have at least one transcribed variant).
  • Are these variants previously known (e.g., present in dbSNP)?
  • How should we interpret the allele frequency of each variant? Remember that we have rather unusual samples here in that they are actually pooled RNAs corresponding to multiple individuals (genotypes).
  • Take note of the genomic position of your variant. We will need this later.
IGV visualization example (DDX17 3 prime region)

IVG-DDX17

BAM Read Counting

Using one of the variant positions identified above, count the number of supporting reference and variant reads. First, use samtools mpileup to visualize a region of alignment with a variant.

cd $RNA_HOME
mkdir bam_readcount
cd bam_readcount

Create faidx indexed reference sequence file for use with mpileup

echo $RNA_REF_FASTA
samtools faidx $RNA_REF_FASTA

Run samtools mpileup on a region of interest

samtools mpileup -f $RNA_REF_FASTA -r 22:18918457-18918467 $RNA_ALIGN_DIR/UHR.bam $RNA_ALIGN_DIR/HBR.bam

Each line consists of chromosome, 1-based coordinate, reference base, the number of reads covering the site, read bases and base qualities. At the read base column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, ACGTN for a mismatch on the forward strand and acgtn for a mismatch on the reverse strand. A pattern \+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. See samtools pileup/mpileup documentation for more explanation of the output:

Now, use bam-readcount to count reference and variant bases at a specific position. First, create a bed file with some positions of interest (we will create a file called snvs.bed using the echo command).

It will contain a single line specifying a variant position on chr22 e.g.:

22:38483683-38483683

Create the bed file

echo "22 38483683 38483683"
echo "22 38483683 38483683" > snvs.bed

Run bam-readcount on this list for the tumor and normal merged bam files

bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/UHR.bam 2>/dev/null
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/HBR.bam 2>/dev/null

Now, run it again, but ignore stderr and redirect stdout to a file:

bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/UHR.bam 2>/dev/null 1>UHR_bam-readcounts.txt
bam-readcount -l snvs.bed -f $RNA_REF_FASTA $RNA_ALIGN_DIR/HBR.bam 2>/dev/null 1>HBR_bam-readcounts.txt

From this output you could parse the read counts for each base

cat UHR_bam-readcounts.txt | perl -ne '@data=split("\t", $_); @Adata=split(":", $data[5]); @Cdata=split(":", $data[6]); @Gdata=split(":", $data[7]); @Tdata=split(":", $data[8]); print "UHR Counts\t$data[0]\t$data[1]\tA: $Adata[1]\tC: $Cdata[1]\tT: $Tdata[1]\tG: $Gdata[1]\n";'
cat HBR_bam-readcounts.txt | perl -ne '@data=split("\t", $_); @Adata=split(":", $data[5]); @Cdata=split(":", $data[6]); @Gdata=split(":", $data[7]); @Tdata=split(":", $data[8]); print "HBR Counts\t$data[0]\t$data[1]\tA: $Adata[1]\tC: $Cdata[1]\tT: $Tdata[1]\tG: $Gdata[1]\n";'

If reading perl code isn’t your favorite thing to do, here’s a bam-readcount tutorial that uses python to parse output from bam-readcount to identify a Omicron SARS-CoV-2 variant of concern from raw sequence data.


PRACTICAL EXERCISE 7

Assignment: Index your bam files from Practical Exercise 6 and visualize in IGV.

  • Hint: As before, it may be simplest to just index and visualize the combined/merged bam files HCC1395_normal.bam and HCC1395_tumor.bam.
  • If this works, you should have two BAM files that can be loaded into IGV from the following location on your cloud instance:
    • http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/practice/alignments/hisat2/

Questions

  • Load your merged normal and tumor BAM files into IGV. Navigate to this location on chromosome 22: ‘chr22:38,466,394-38,508,115’. What do you see here? How would you describe the direction of transcription for the two genes? Does the reported strand for the reads aligned to each of these genes appear to make sense? How do you modify IGV settings to see the strand clearly?
  • How can we modify IGV to color reads by Read Group? How many read groups are there for each sample (tumor & normal)? What are your read group names for the tumor sample?
  • What are the options for visualizing splicing or alternative splicing patterns in IGV? Navigate to this location on chromosome 22: ‘chr22:40,363,200-40,367,500’. What splicing event do you see?

Solution: When you are ready you can check your approach against the Solutions.

IGV | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

IGV


Introduction

Description of the lab Welcome to the lab for Genome Visualization! This lab will introduce you to the Integrative Genomics Viewer, one of the most popular visualization tools for High Throughput Sequencing (HTS) data.

Lecture files that accompany this tutorial:

After this lab, you will be able to:

Things to know before you start:

Thought-provoking question goes here.`

Requirements

Compatibility

This tutorial was most recently updated for IGV v2.16.2, which is available on the IGV Download page. It is recommended that you use this version. Most other recent versions will work but their may be slight differences.

Data Set for IGV

We will be using publicly available Illumina sequence data from the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Additional information on this cell line can be found here: HCC1143 (tumor, TNM stage IIA, grade 3, primary ductal carcinoma) and HCC1143/BL (matched normal EBV transformed lymphoblast cell line).


Visualization Part 1: Getting familiar with IGV

We will be visualizing read alignments using IGV, a popular visualization tool for HTS data.

First, lets familiarize ourselves with it.

Get familiar with the interface

Load a Genome and some Data Tracks

By default, IGV loads the Human GRCh38/hg38 reference genome. If you work with another version of the human genome, or another organism altogether, you can change the genome by clicking the drop down menu in the upper-left. For this lab, we will be using Human GRCh37/hg19.

We will also load additional tracks from Server using (File -> Load from Server...):

Load hg19 genome and additional data tracks

Load hg19 genome and additional data tracks

You should see listing of chromosomes in this reference genome. Choose 1, for chromosome 1.

Chromosome chooser

Chromosome chooser

Navigate to chr1:10,000-11,000 by entering this into the location field (in the top-left corner of the interface) and clicking Go. This shows a window of chromosome 1 that is 1,000 base pairs wide and beginning at position 10,000.

Navigition using Location text field. Sequence displayed as thin coloured rectangles.

Navigition using Location text field. Sequence displayed as thin coloured rectangles.

IGV displays the sequence of letters in a genome as a sequence of colours (e.g. A = green, C = blue, etc.). This makes repetitive sequences, like the ones found at the start of this region, easy to identify. Zoom in a bit more using the + button to see the individual bases of the reference genome sequence.

You can navigate to a gene of interest by typing it in the same box the genomic coordinates are in and pressing Enter/Return. Try it for your favourite gene, or BRCA1 if you can not decide.

Gene model

Gene model

Genes are represented as lines and boxes. Lines represent intronic regions, and boxes represent exonic regions. The arrows indicate the direction/strand of transcription for the gene. When an exon box become narrower in height, this indicates a UTR.

When loaded, tracks are stacked on top of each other. You can identify which track is which by consulting the label to the left of each track.

Region Lists

Sometimes, it is really useful to save where you are, or to load regions of interest. For this purpose, there is a Region Navigator in IGV. To access it, click Regions > Region Navigator. While you browse around the genome, you can save some bookmarks by pressing the Add button at any time.

Bookmarks in IGV

Bookmarks in IGV

Loading Read Alignments

We will be using the breast cancer cell line HCC1143 to visualize alignments. For speed, only a small portion of chr21 will be loaded (19M:20M).

HCC1143 Alignments to hg19:

Copy the files to your local drive, and in IGV choose File > Load from File..., select the bam file, and click OK. Note that the bam and index files must be in the same directory for IGV to load these properly.

Load BAM track from File

Load BAM track from File

Visualizing read alignments

Navigate to a narrow window on chromosome 21: chr21:19,480,041-19,480,386.

To start our exploration, right click on the read alignment track, and select the following options:

Experiment with the various settings by right clicking the read alignment track and toggling the options. Think about which would be best for specific tasks (e.g. quality control, SNP calling, CNV finding).

Changing how read alignments are sorted, grouped, and colored

Changing how read alignments are sorted, grouped, and colored

You will see reads represented by grey or white bars stacked on top of each other, where they were aligned to the reference genome. The reads are pointed to indicate their orientation (i.e. the strand on which they are mapped). Mouse over any read and notice that a lot of information is available. To toggle read display from hover to click, select the yellow box and change the setting.

Changing how read information is shown (i.e. on hover, click, never)

Changing how read information is shown (i.e. on hover, click, never)

Once you select a read, you will learn what many of these metrics mean, and how to use them to assess the quality of your datasets. At each base that the read sequence mismatches the reference, the colour of the base represents the letter that exists in the read (using the same colour legend used for displaying the reference).

Viewing read information for a single aligned read

Viewing read information for a single aligned read


Visualization Part 2: Inspecting SNPs, SNVs, and SVs

In this section we will be looking in detail at 8 positions in the genome, and determining whether they represent real events or artifacts.

Two neighbouring SNPs

Example1. Good quality SNVs/SNPs

Example1. Good quality SNVs/SNPs

Notes:

Question(s):

* What does *Shade base by quality* do? How might this be helpful?
* How does Color by *read strand* help?

Homopolymer region with indel

Navigate to position chr21:19,518,412-19,518,497

Example 2a

Example 2a

Example 2b

Example 2b

Notes:

Coverage by GC

Navigate to position chr21:19,611,925-19,631,555. Note that the range contains areas where coverage drops to zero in a few places.

Example 3

Example 3

Question:

* Why are there blue and red reads throughout the alignments?

Heterozygous SNPs on different alleles

Navigate to region chr21:19,666,833-19,667,007

Example 4

Example 4

Note:

Low mapping quality

Navigate to region chr21:19,800,320-19,818,162

Load repeats

Load repeats

Example 5

Example 5

Notes:

Homozygous deletion

Navigate to region chr21:19,324,469-19,331,468

Example 6

Example 6

Notes:

Mis-alignment

Navigate to region chr21:19,102,154-19,103,108

Example 7

Example 7

Notes:

Translocation

Navigate to region chr21:19,089,694-19,095,362

Example 8

Example 8

Notes:


Visualization Part 3: Automating Tasks in IGV

We can use the Tools menu to invoke running a batch script. Batch scripts are described on the IGV website:

Download the batch script and the attribute file for our dataset:

Now run the file from the Tools menu:

Automation

Automation

Notes:

Contributors/acknowledgements

Malachi Griffith, Sorana Morrissy, Jim Robinson, Ben Ainscough, Jason Walker, Obi Griffith