Welcome to the blog

# Posts

My thoughts and ideas

Reference Genomes | Griffith Lab

## RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

# Reference Genomes

### FASTA/FASTQ/GTF mini lecture

If you would like a refresher on common file formats such as FASTA, FASTQ, and GTF files, we have made a mini lecture briefly covering these.

### Obtain a reference genome from Ensembl, iGenomes, NCBI or UCSC.

In this example analysis we will use the human GRCh38 version of the genome from Ensembl. Furthermore, we are actually going to perform the analysis using only a single chromosome (chr22) and the ERCC spike-in to make it run faster.

First we will create the necessary working directory.

cd $RNA_HOME echo$RNA_REFS_DIR
mkdir -p $RNA_REFS_DIR  The complete data from which these files were obtained can be found at: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/. You could use wget to download the Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz file, then unzip/untar. We have prepared this simplified reference for you. It contains chr22 (and ERCC transcript) fasta files in both a single combined file and individual files. Download the reference genome file to the rnaseq working directory cd$RNA_REFS_DIR
wget http://genomedata.org/rnaseq-tutorial/fasta/GRCh38/chr22_with_ERCC92.fa
ls



View the first 10 lines of this file. Why does it look like this?

head chr22_with_ERCC92.fa



How many lines and characters are in this file? How long is this chromosome (in bases and Mbp)?

wc chr22_with_ERCC92.fa



View 10 lines from approximately the middle of this file. What is the significance of the upper and lower case characters?

head -n 425000 chr22_with_ERCC92.fa | tail



What is the count of each base in the entire reference genome file (skipping the header lines for each sequence)?

cat chr22_with_ERCC92.fa | grep -v ">" | perl -ne 'chomp $_;$bases{$_}++ for split //; if (eof){print "$_ $bases{$_}\n" for sort keys %bases}'



Note: Instead of the above, you might consider getting reference genomes and associated annotations from UCSC. e.g., UCSC GRCh38 download.

Wherever you get them from, remember that the names of your reference sequences (chromosomes) must those matched in your annotation gtf files (described in the next section).

View a list of all sequences in our reference genome fasta file.

grep ">" chr22_with_ERCC92.fa



Assignment: Use a commandline scripting approach of your choice to further examine our chr22 reference genome file and answer the following questions.

Questions:

• How many bases on chromosome 22 correspond to repetitive elements?
• What is the percentage of the whole length?
• How many occurences of the EcoRI (GAATTC) restriction site are present in the chromosome 22 sequence?

Hint: Each question can be tackled using approaches similar to those above, using the file ‘chr22_with_ERCC92.fa’ as a starting point. Hint: To make things simpler, first produce a file with only the chr22 sequence. Hint: Remember that repetitive elements in the sequence are represented in lower case

Solution: When you are ready you can check your approach against the Solutions.

Introduction to Inputs | Griffith Lab

## RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

# Introduction to Inputs

### Module 1 - Key concepts

• Review central dogma, RNA sequencing, RNAseq study design, library construction strategies, biological vs technical replicates, alignment strategies, etc.

### Module 1 - Learning objectives

• Introduction to the theory and practice of RNA sequencing (RNA-seq) analysis
• Rationale for sequencing RNA
• Challenges specific to RNA-seq
• General goals and themes of RNA-seq analysis work flows
• Common technical questions related to RNA-seq analysis
• Getting help outside of this course
• Introduction to the RNA-seq hands on tutorial

• ## Reference Genomes

### FASTA/FASTQ/GTF mini lecture

If you would like a refresher on common...

• ## Introduction to Inputs

### Module 1 - Key concepts

• Review central dogma, RNA sequencing, RNAseq study design, library construction strategies, biological vs...