<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://www.rnabio.org//feed.xml" rel="self" type="application/atom+xml" /><link href="http://www.rnabio.org//" rel="alternate" type="text/html" /><updated>2025-12-03T19:16:37+00:00</updated><id>http://www.rnabio.org//feed.xml</id><title type="html">Griffith Lab</title><subtitle>The RNAbio.org site is meant to accompany RNA-seq workshops delivered at various times during the year at various places (New York, Toronto, Germany, Glasgow, etc) in collaboration with various bioinformatics workshop organizations (CSHL, CBW, Physalia, PR Informatics, etc.). It can also be used as a standalone online course. The goal of the resource is to provide a comprehensive introduction to RNA-seq, NGS data, bioinformatics, cloud computing, BAM/BED/VCF file format, read alignment, data QC, expression estimation, differential expression analysis, reference-free analysis, data visualization, transcript assembly, etc.</subtitle><author><name>Zachary Skidmore</name></author><entry><title type="html">Single-cell RNA-seq - CSHL legacy version</title><link href="http://www.rnabio.org//module-10-archive/0010/01/01/scRNA/" rel="alternate" type="text/html" title="Single-cell RNA-seq - CSHL legacy version" /><published>0010-01-01T00:00:00+00:00</published><updated>0010-01-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-10-archive/0010/01/01/scRNA</id><content type="html" xml:base="http://www.rnabio.org//module-10-archive/0010/01/01/scRNA/"><![CDATA[<h2 id="exercise-a-complete-seurat-workflow">Exercise: A Complete Seurat Workflow</h2>

<p>In this exercise, we will analyze and interpret a small scRNA-seq data set consisting of three bone marrow samples. Two of the samples are from the same patient, but differ in that one sample was enriched for a particular cell type. The goal of this analysis is to determine what cell types are present in the three samples, and how the samples and patients differ. This was drawn in part from the Seurat vignettes at <a href="https://satijalab.org/seurat/vignettes.html">https://satijalab.org/seurat/vignettes.html</a>.</p>

<h3 id="step-1-preparation">Step 1: Preparation</h3>

<p>Working at the linux command line in your home directory (/home/ubuntu/workspace), create a new directory for your output files called “scrna”. The full path to this directory will be /home/ubuntu/workspace/scrna. The command is:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/workspace/scRNA_data
<span class="nb">cd</span> ~/workspace/scRNA_data
wget <span class="nt">-r</span> <span class="nt">-N</span> <span class="nt">--no-parent</span> <span class="nt">-nH</span> <span class="nt">--reject</span> zip <span class="nt">-R</span> <span class="s2">"index.html*"</span> <span class="nt">--cut-dirs</span><span class="o">=</span>2 http://genomedata.org/rnaseq-tutorial/scrna/
<span class="nb">cd</span> ~/workspace
<span class="nb">mkdir </span>scrna
<span class="nb">cd </span>scrna
wget http://genomedata.org/rnaseq-tutorial/scrna/PlotMarkers.r
</code></pre></div></div>

<p>Start R, then load some R libraries as follows</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="s2">"Seurat"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"sctransform"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"RColorBrewer"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggthemes"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"cowplot"</span><span class="p">);</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"data.table"</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>Create a vector of convenient sample names, such as “A”, “B”, and “C”:</p>

<p><code class="language-plaintext highlighter-rouge">samples = c("A","B","C");</code></p>

<p>Create a variable called <code class="language-plaintext highlighter-rouge">outdir</code> to specify your output directory:</p>

<p><code class="language-plaintext highlighter-rouge">outdir = "/home/ubuntu/workspace/scrna";</code></p>

<h3 id="step-2-read-in-the-feature-barcode-matrices-generated-by-the-cellranger-pipeline">Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline</h3>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data.10x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">();</span><span class="w"> </span><span class="c1"># first declare an empty list in which to hold the feature-barcode matrices</span><span class="w">
</span><span class="n">data.10x</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Read10X</span><span class="p">(</span><span class="n">data.dir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/workspace/scRNA_data/ND050119_CD34_3pV3/filtered_feature_bc_matrix"</span><span class="p">);</span><span class="w">
</span><span class="n">data.10x</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Read10X</span><span class="p">(</span><span class="n">data.dir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/workspace/scRNA_data/ND050119_WBM_3pV3/filtered_feature_bc_matrix"</span><span class="p">);</span><span class="w">
</span><span class="n">data.10x</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Read10X</span><span class="p">(</span><span class="n">data.dir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"~/workspace/scRNA_data/ND050819_WBM_3pV3/filtered_feature_bc_matrix"</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-3-convert-each-feature-barcode-matrix-to-a-seurat-object">Step 3: Convert each feature-barcode matrix to a Seurat object</h3>

<p>This simultaneously performs some initial filtering in order to exclude genes that are expressed in fewer than 100 cells, and to exclude cells that contain fewer than 700 expressed genes. Note that min.cells=10 and min.features=100 are more common parameters at this stage, but we are filtering more aggressively in order to make the data set smaller. At this step, we also create a “DataSet” identity for each cell.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna.list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">();</span><span class="w"> </span><span class="c1"># First create an empty list to hold the Seurat objects</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CreateSeuratObject</span><span class="p">(</span><span class="n">counts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.10x</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="n">min.cells</span><span class="o">=</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">min.features</span><span class="o">=</span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">project</span><span class="o">=</span><span class="n">samples</span><span class="p">[</span><span class="m">1</span><span class="p">]);</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">1</span><span class="p">]][[</span><span class="s2">"DataSet"</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="m">1</span><span class="p">];</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CreateSeuratObject</span><span class="p">(</span><span class="n">counts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.10x</span><span class="p">[[</span><span class="m">2</span><span class="p">]],</span><span class="w"> </span><span class="n">min.cells</span><span class="o">=</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">min.features</span><span class="o">=</span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">project</span><span class="o">=</span><span class="n">samples</span><span class="p">[</span><span class="m">2</span><span class="p">]);</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">2</span><span class="p">]][[</span><span class="s2">"DataSet"</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="m">2</span><span class="p">];</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">3</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CreateSeuratObject</span><span class="p">(</span><span class="n">counts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.10x</span><span class="p">[[</span><span class="m">3</span><span class="p">]],</span><span class="w"> </span><span class="n">min.cells</span><span class="o">=</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">min.features</span><span class="o">=</span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">project</span><span class="o">=</span><span class="n">samples</span><span class="p">[</span><span class="m">3</span><span class="p">]);</span><span class="w">
</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">3</span><span class="p">]][[</span><span class="s2">"DataSet"</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="m">3</span><span class="p">];</span><span class="w">
</span></code></pre></div></div>

<p>Aside: Note that you can do this more efficiently, especially if you have many samples, using a ‘for’ loop:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">data.10x</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">scrna.list</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CreateSeuratObject</span><span class="p">(</span><span class="n">counts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.10x</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="n">min.cells</span><span class="o">=</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">min.features</span><span class="o">=</span><span class="m">700</span><span class="p">,</span><span class="w"> </span><span class="n">project</span><span class="o">=</span><span class="n">samples</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span><span class="w">
    </span><span class="n">scrna.list</span><span class="p">[[</span><span class="n">i</span><span class="p">]][[</span><span class="s2">"DataSet"</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="n">i</span><span class="p">];</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Finally, remove the raw data to save memory (these objects get large!):</p>

<p><code class="language-plaintext highlighter-rouge">rm(data.10x);</code></p>

<h3 id="step-4-merge-the-seurat-objects-into-a-single-object">Step 4. Merge the Seurat objects into a single object</h3>

<p>We will call this object <code class="language-plaintext highlighter-rouge">scrna</code>. We also give it a project name (here, “CSHL”), and prepend the appropriate data set name to each cell barcode. For example, if a barcode from data set “B” is originally <code class="language-plaintext highlighter-rouge">AATCTATCTCTC</code>, it will now be <code class="language-plaintext highlighter-rouge">B_AATCTATCTCTC</code>. Then clean up some space by removing <code class="language-plaintext highlighter-rouge">scrna.list</code>. Finally, save the merged object as an RDS file. Should you need to load this file into R at any time, it can be done using the <code class="language-plaintext highlighter-rouge">readRDS</code> command.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">2</span><span class="p">]],</span><span class="n">scrna.list</span><span class="p">[[</span><span class="m">3</span><span class="p">]]),</span><span class="w"> </span><span class="n">add.cell.ids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="s2">"B"</span><span class="p">,</span><span class="s2">"C"</span><span class="p">),</span><span class="w"> </span><span class="n">project</span><span class="o">=</span><span class="s2">"CSHL"</span><span class="p">);</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">scrna.list</span><span class="p">);</span><span class="w"> </span><span class="c1"># save some memory</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="p">)</span><span class="w"> </span><span class="c1"># examine the structure of the Seurat object meta data</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/MergedSeuratObject.rds"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">));</span><span class="w">
</span></code></pre></div></div>

<h3 id="aside-on-accessing-the-seurat-object-meta-data-which-is-stored-in-scrnametadata">Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data</h3>

<p>Meta data can be used to hold the following information (and more) for your data set:</p>

<ul>
  <li>Summary statistics</li>
  <li>Sample name</li>
  <li>Cluster membership for each cell</li>
  <li>Cell cycle phase for each cell</li>
  <li>Batch or sample for each cell</li>
  <li>Other custom annotations for each cell</li>
</ul>

<p>You can access and query the meta data using commands such as:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="p">[[]];</span><span class="w">
</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="p">;</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="p">);</span><span class="w"> </span><span class="c1"># Examine structure and contents of meta data</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nFeature_RNA</span><span class="p">);</span><span class="w"> </span><span class="c1"># Access genes (“Features”) for each cell</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">);</span><span class="w"> </span><span class="c1"># Access number of UMIs for each cell:</span><span class="w">
</span><span class="n">levels</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">scrna</span><span class="p">);</span><span class="w"> </span><span class="c1"># List the items in the current default cell identity class</span><span class="w">
</span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">seurat_clusters</span><span class="p">));</span><span class="w"> </span><span class="c1"># How many clusters are there? Note that there will not be any clusters in the meta data until you perform clustering.</span><span class="w">
</span><span class="n">unique</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">Batch</span><span class="p">);</span><span class="w"> </span><span class="c1"># What batches are included in this data set?</span><span class="w">
</span><span class="n">scrna</span><span class="o">$</span><span class="n">NewIdentity</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vector_of_annotations</span><span class="p">;</span><span class="w"> </span><span class="c1"># Assign new cell annotations to a new "identity class" in the meta data</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-5-quality-control-plots">Step 5. Quality control plots</h3>

<p>Plot the distributions of several quality-control variables in order to choose appropriate filtering thresholds. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. However, you will need to manually calculate the mitochondrial transcript percentage and ribosomal transcript percentage for each cell, and add them to the Seurat object meta data, as shown below.</p>

<p>Calculate the mitochondrial transcript percentage for each cell:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mito.genes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="n">pattern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"^MT-"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">);</span><span class="w">
</span><span class="n">percent.mito</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Matrix</span><span class="o">::</span><span class="n">colSums</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GetAssayData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'counts'</span><span class="p">)[</span><span class="n">mito.genes</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Matrix</span><span class="o">::</span><span class="n">colSums</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GetAssayData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'counts'</span><span class="p">));</span><span class="w">
</span><span class="n">scrna</span><span class="p">[[</span><span class="s1">'percent.mito'</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">percent.mito</span><span class="p">;</span><span class="w">
</span></code></pre></div></div>

<p>Calculate the ribosomal transcript percentage for each cell:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ribo.genes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="n">pattern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"^RP[SL][[:digit:]]"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">);</span><span class="w">
</span><span class="n">percent.ribo</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Matrix</span><span class="o">::</span><span class="n">colSums</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GetAssayData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'counts'</span><span class="p">)[</span><span class="n">ribo.genes</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Matrix</span><span class="o">::</span><span class="n">colSums</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">GetAssayData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'counts'</span><span class="p">));</span><span class="w">
</span><span class="n">scrna</span><span class="p">[[</span><span class="s1">'percent.ribo'</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">percent.ribo</span><span class="p">;</span><span class="w">
</span></code></pre></div></div>

<p>Plot as violin plots, which will be located in, for example, ~/workspace/scrna/VlnPlot.pdf  All figures can be downloaded using the scp command, or viewed on the AWS server.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VlnPlot.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">13</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">vln</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">VlnPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"percent.mito"</span><span class="p">,</span><span class="w"> </span><span class="s2">"percent.ribo"</span><span class="p">),</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="o">=</span><span class="s2">"DataSet"</span><span class="p">);</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">vln</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">

</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VlnPlot.nCount.25Kmax.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">vln</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">VlnPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nCount_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="o">=</span><span class="s2">"DataSet"</span><span class="p">,</span><span class="w"> </span><span class="n">y.max</span><span class="o">=</span><span class="m">25000</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">vln</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">

</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VlnPlot.nFeature.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">vln</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">VlnPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nFeature_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="o">=</span><span class="s2">"DataSet"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">vln</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p>QUESTIONS:</p>

<ol>
  <li>Excessive mitochondrial transcripts can indicate the presence of dead cells, which tend to cluster together. Based on the distribution of mitochondrial transcripts, what filter threshold would you set for mitochondrial transcripts? One approach is to start with a lenient threshold, work through the analysis, and determine later whether your data still contains clusters of dead cells.</li>
  <li>Compare the distribution of ribosomal transcripts, total transcripts, and genes in each sample. Are differences in these parameters necessarily a technical artifact, or might they contain information about the biology of the samples?</li>
</ol>

<p>Next, we will use Seurat’s FeatureScatter function to create scatterplots of the relationships among QC variables. This can be helpful in selecting filtering thresholds. More generally, this is a very useful wrapper function that can be used to visualize relationships between any pair of quantitative variables in the Seurat object (including expression levels, etc).</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/Scatter1.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">scatter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeatureScatter</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">feature1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nCount_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">feature2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"percent.mito"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">scatter</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">

</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/Scatter2.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">scatter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeatureScatter</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">feature1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nCount_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">feature2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"percent.ribo"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">scatter</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">

</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/Scatter3.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">scatter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeatureScatter</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">feature1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nCount_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">feature2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nFeature_RNA"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">scatter</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-6-calculate-a-cell-cycle-score-for-each-cell">Step 6. Calculate a cell cycle score for each cell</h3>

<p>This can be used to determine whether heterogeneity in cell cycle phase is driving the tSNE/UMAP layout and/or clustering. This may or may not be obscuring the signal you care about, depending on your analysis goals and the nature of the data. (If necessary, it can be removed in a later step.) It is also useful for determining whether certain populations of cells are more proliferative than others. The list of cell cycle genes, and the scoring method, was taken from Tirosh I, et al. (2016).</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cell.cycle.tirosh</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s2">"http://genomedata.org/rnaseq-tutorial/scrna/CellCycleTiroshSymbol2ID.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">);</span><span class="w"> </span><span class="c1"># read in the list of genes</span><span class="w">
</span><span class="n">s.genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cell.cycle.tirosh</span><span class="o">$</span><span class="n">Gene.Symbol</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">cell.cycle.tirosh</span><span class="o">$</span><span class="n">List</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G1/S"</span><span class="p">)];</span><span class="w"> </span><span class="c1"># create a vector of S-phase genes</span><span class="w">
</span><span class="n">g2m.genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cell.cycle.tirosh</span><span class="o">$</span><span class="n">Gene.Symbol</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">cell.cycle.tirosh</span><span class="o">$</span><span class="n">List</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G2/M"</span><span class="p">)];</span><span class="w"> </span><span class="c1"># create a vector of G2/M-phase genes</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">CellCycleScoring</span><span class="p">(</span><span class="n">object</span><span class="o">=</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">s.features</span><span class="o">=</span><span class="n">s.genes</span><span class="p">,</span><span class="w"> </span><span class="n">g2m.features</span><span class="o">=</span><span class="n">g2m.genes</span><span class="p">,</span><span class="w"> </span><span class="n">set.ident</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-7-filter-the-cells-to-remove-debris-dead-cells-and-probable-doublets">Step 7. Filter the cells to remove debris, dead cells, and probable doublets</h3>

<p><strong>QUESTION:</strong> How many cells are there in each sample before filtering? The ‘table’ function may come in handy.</p>

<p>First calculate some basic statistics on the various QC parameters, which can be helpful for choosing cutoffs. For example:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">min</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nFeature_RNA</span><span class="p">);</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nFeature_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">max</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nFeature_RNA</span><span class="p">)</span><span class="w">    
</span><span class="n">s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nFeature_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">min1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">max1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">m1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">s1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">)</span><span class="w">
</span><span class="n">Count93</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">nCount_RNA</span><span class="p">,</span><span class="w"> </span><span class="m">0.93</span><span class="p">)</span><span class="w"> </span><span class="c1"># calculate value in the 93rd percentile</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Feature stats:"</span><span class="p">,</span><span class="n">min</span><span class="p">,</span><span class="n">m</span><span class="p">,</span><span class="n">max</span><span class="p">,</span><span class="n">s</span><span class="p">));</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"UMI stats:"</span><span class="p">,</span><span class="n">min1</span><span class="p">,</span><span class="n">m1</span><span class="p">,</span><span class="n">max1</span><span class="p">,</span><span class="n">s1</span><span class="p">,</span><span class="n">Count93</span><span class="p">));</span><span class="w">
</span></code></pre></div></div>

<p>Now, filter the data using the subset function and your chosen thresholds. Note that for large data sets with diverse samples, it may be beneficial to use sample-specific thresholds for some parameters. If you are not sure what thresholds to use, the following will work well for the purposes of this course:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">subset</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nFeature_RNA</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">700</span><span class="w">  </span><span class="o">&amp;</span><span class="w"> </span><span class="n">nCount_RNA</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">Count93</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">percent.mito</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>QUESTION: How many cells are there in each sample after filtering?</p>

<h3 id="step-8-optional-subset-the-data">Step 8. <strong>[Optional]</strong> Subset the data</h3>

<p>If necessary, you can subset the data set to N cells (2000, 5000, etc) to make it more manageable:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">subcells</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">Cells</span><span class="p">(</span><span class="n">scrna</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">cells</span><span class="o">=</span><span class="n">subcells</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-9-normalize-the-data-detect-variable-genes-and-scale-the-data">Step 9. Normalize the data, detect variable genes, and scale the data</h3>

<p>Normalize the data:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">NormalizeData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">normalization.method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"LogNormalize"</span><span class="p">,</span><span class="w"> </span><span class="n">scale.factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e6</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>QUESTION: What does LogNormalize do mathematically? Are there other normalization options available?</p>

<p>Now identify and plot the most variable genes, which will be used for downstream analyses. This is a critical step that reduces the contribution of noise. Consider adjusting the cutoffs if you think (often based on prior knowledge of your experimental system) that important genes are being excluded.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FindVariableFeatures</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">selection.method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'vst'</span><span class="p">,</span><span class="w"> </span><span class="n">mean.cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="m">8</span><span class="p">),</span><span class="w"> </span><span class="n">dispersion.cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Number of Variable Features: "</span><span class="p">,</span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">VariableFeatures</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">))));</span><span class="w">

</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VG.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">useDingbats</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">vg</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">VariableFeaturePlot</span><span class="p">(</span><span class="n">scrna</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">vg</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p>Scale and center the data:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ScaleData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">),</span><span class="w"> </span><span class="n">verbose</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>Alternatively, you can scale the data and simultaneously remove unwanted signal associated with variables such as cell cycle phase, ribosomal transcript content, etc. (This is slow, and cannot be done in the time allotted for this course.) To remove cell cycle signal, for instance:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># scrna &lt;- ScaleData(object = scrna, features = rownames(x = scrna), vars.to.regress = c("S.Score","G2M.Score"), display.progress=FALSE);</span><span class="w">
</span></code></pre></div></div>

<p>Save the normalized, scaled Seurat object:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">saveRDS</span><span class="p">(</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VST.rds"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">));</span><span class="w">
</span></code></pre></div></div>

<p><strong>DIGRESSION:</strong> How can you use Seurat-processed data with packages that are not compatible with Seurat? Other packages may require the data to be normalized in a specific way, and often require an expression matrix (not a Seurat object) as input. As an example, here we prepare an expression data matrix for use with the popular CNV-detection package CONICSmat:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna.cnv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">NormalizeData</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">normalization.method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"RC"</span><span class="p">,</span><span class="w"> </span><span class="n">scale.factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">);</span><span class="w">
</span><span class="n">data.cnv</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">GetAssayData</span><span class="p">(</span><span class="n">object</span><span class="o">=</span><span class="n">scrna.cnv</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="o">=</span><span class="s2">"data"</span><span class="p">);</span><span class="w"> </span><span class="c1"># get the normalized data</span><span class="w">
</span><span class="n">log2data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log2</span><span class="p">(</span><span class="n">data.cnv</span><span class="m">+1</span><span class="p">);</span><span class="w"> </span><span class="c1"># add 1 then take log2</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">log2data</span><span class="p">));</span><span class="w"> </span><span class="c1"># convert it to a data frame</span><span class="w">
</span><span class="n">cells</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">df</span><span class="p">));</span><span class="w">
</span><span class="n">genes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">rownames</span><span class="p">(</span><span class="n">df</span><span class="p">));</span><span class="w">
</span><span class="c1"># save as text files:</span><span class="w">
</span><span class="n">fwrite</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">genes</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"genes.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">col.names</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span><span class="n">fwrite</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cells</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cells.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">col.names</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span><span class="n">fwrite</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"exp.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">col.names</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-10-reduce-the-dimensionality-of-the-data-using-principal-component-analysis">Step 10. Reduce the dimensionality of the data using Principal Component Analysis</h3>

<p>Subsequent calculations, such as those used to derive the tSNE and UMAP projections, and the k-Nearest Neighbor graph used for clustering, are performed in a new space with fewer dimensions, namely, the principal components. Here, specify a relatively large number of principal components – more than you anticipate using for downstream analyses. Then use several techniques to characterize the components and estimate the number of principal components that captures the signal of interest while minimizing noise.</p>

<p>Perform Principal Component Analysis (PCA), and save the first 100 components:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">RunPCA</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">npcs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p><strong>OPTIONAL:</strong> Then run ProjectDim, which scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. This is not used elsewhere in this pipeline, but it can be useful for exploring genes that are not among the 2000 most highly variable genes selected above.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ProjectDim</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTION:</strong> What do the principal components “mean” from a biological standpoint? What genes contribute to the principal components? Do they represent biological processes of interest, or technical variables (such as mitochondrial transcripts) that suggest the data may need to be filtered differently?</p>

<p>There are several easy ways to investigate these questions. First, visualize the PCA “loadings.” Each “component” identified by PCA is a linear combination, or weighted sum, of the genes in the data set. Here, the “loadings” represent the weights of the genes in any given component. These plots tell you which genes contribute most to each component:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VizDimLoadings.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">);</span><span class="w">
</span><span class="n">vdl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">VizDimLoadings</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">vdl</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p>Second, use the DimHeatmap function to generate heatmaps that summarize the expression of the most highly weighted genes in each principal component. As noted in the Seurat documentation, “both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/PCA.heatmap.multi.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8.5</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">24</span><span class="p">);</span><span class="w">
</span><span class="n">hm.multi</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DimHeatmap</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">cells</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">balanced</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">);</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">hm.multi</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p>Finally, you can generate ranked lists of the genes in each principal component and perform functional enrichment or Gene Set Enrichment Analysis. (This <a href="https://toppgene.cchmc.org/enrichment.jsp">tool</a> offers a quick and easy way to determine functional enrichment from a list of genes.) For example, for the first principal component:</p>

<p><code class="language-plaintext highlighter-rouge">PClist_1 &lt;- names(sort(Loadings(object=scrna, reduction="pca")[,1], decreasing=TRUE));</code></p>

<p>Now, decide how many components to use in downstream analyses. This number usually varies from 5-50, depending on the number of cells and the complexity of the data set. Although there is no “correct” answer, using too few components risks missing meaningful signal, and using too many risks diluting meaningful signal with noise.</p>

<p>There are several ways to make an informed decision. The first is to use the principal component heatmaps generated above. Components that generate noisy heatmaps likely correspond to noise. The second method is to examine a plot of the standard deviations of the principle components, and to choose a cutoff to the left of the bend in this so-called “elbow plot.”</p>

<p>Generate an elbow plot of principal component standard deviations:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">elbow</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ElbowPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">)</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/PCA.elbow.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">);</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">elbow</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p>Next, use a bootstrapping technique called Jackstraw analysis to estimate a p-value for each component, print out a plot, and save the p-values to a file:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">JackStraw</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">num.replicate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="o">=</span><span class="m">30</span><span class="p">);</span><span class="w"> </span><span class="c1"># takes around 4 minutes</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ScoreJackStraw</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/PCA.jackstraw.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">js</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">JackStrawPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">js</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span><span class="n">pc.pval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">scrna</span><span class="o">@</span><span class="n">reductions</span><span class="o">$</span><span class="n">pca</span><span class="o">@</span><span class="n">jackstraw</span><span class="o">@</span><span class="n">overall.p.values</span><span class="p">;</span><span class="w"> </span><span class="c1"># get p-value for each PC</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">pc.pval</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/PCA.jackstraw.scores.xls"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s1">'\t'</span><span class="p">,</span><span class="w"> </span><span class="n">col.names</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-11-generate-2-dimensional-layouts-of-the-data-using-two-related-algorithms-t-sne-and-umap">Step 11. Generate 2-dimensional layouts of the data using two related algorithms, t-SNE and UMAP</h3>

<p>Use the number of principal components (nPC) you selected above.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nPC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">;</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">RunUMAP</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pca"</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nPC</span><span class="p">);</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">RunTSNE</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pca"</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nPC</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>Now, plot the tSNE and UMAP plots next to each other in one figure, and color each data set separately:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/UMAP.%d.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">,</span><span class="w"> </span><span class="n">nPC</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">);</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DimPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tsne"</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DataSet"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DimPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"umap"</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DataSet"</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">plot_grid</span><span class="p">(</span><span class="n">p1</span><span class="p">,</span><span class="w"> </span><span class="n">p2</span><span class="p">));</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTIONS:</strong></p>

<ol>
  <li>How do the data sets compare to each other? (We will further investigate these differences in subsequent steps.)</li>
  <li>How does the number of principal components used affect the layout?</li>
  <li>What are the chief sources of variation in this data, as suggested by the t-SNE and UMAP layouts? Are there confounding technical variables that may be driving the layouts? What are some likely technical variables?</li>
</ol>

<p>Color the t-SNE and UMAP plots by some potential confounding variables. Here’s an example in which we color each cell according to the number of UMIs it contains:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">feature.pal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">colorRampPalette</span><span class="p">(</span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">11</span><span class="p">,</span><span class="s2">"Spectral"</span><span class="p">))(</span><span class="m">50</span><span class="p">));</span><span class="w"> </span><span class="c1"># a useful color palette</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/umap.%d.colorby.UMI.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">,</span><span class="w"> </span><span class="n">nPC</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">);</span><span class="w">
</span><span class="n">fp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeaturePlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"nCount_RNA"</span><span class="p">),</span><span class="w"> </span><span class="n">cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">feature.pal</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"umap"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.title.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">());</span><span class="w"> </span><span class="c1"># the text after the ‘+’ simply removes the axis using ggplot syntax</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fp</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTION:</strong> What is the relationship between the principal components and the t-SNE/UMAP layout?</p>

<p>To investigate this, plot several principal components on the t-SNE/UMAP, for example the following code plots the first principal component and prints the plot to a file:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/UMAP.%d.colorby.PCs.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">,</span><span class="w"> </span><span class="n">nPC</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">redblue</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span><span class="s2">"gray"</span><span class="p">,</span><span class="s2">"red"</span><span class="p">);</span><span class="w"> </span><span class="c1"># another useful color scheme</span><span class="w">
</span><span class="n">fp1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeaturePlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'PC_1'</span><span class="p">,</span><span class="w"> </span><span class="n">cols</span><span class="o">=</span><span class="n">redblue</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"umap"</span><span class="p">)</span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.title.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">());</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fp1</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<h3 id="step-12-infer-cell-types">Step 12: Infer cell types</h3>

<p>There are many sophisticated methods for doing this (e.g. SingleR). But the simplest and most common approach is to plot the expression levels of marker genes for known cell types. Markers for bone-marrow-relevant cell types are provided in the file ~/workspace/scRNA_data/gene_lists_human_180502.csv. To plot three genes of your choice, GENE1, GENE2, and GENE3:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/geneplot.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">fp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FeaturePlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">GENE1</span><span class="p">,</span><span class="w"> </span><span class="n">GENE2</span><span class="p">,</span><span class="w"> </span><span class="n">GENE3</span><span class="p">),</span><span class="w"> </span><span class="n">cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"gray"</span><span class="p">,</span><span class="s2">"red"</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"umap"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.title.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">());</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fp</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p>Now use the code that we downloaded from <a href="http://genomedata.org/rnaseq-tutorial/scrna/PlotMarkers.r">here</a> to color the UMAP according to the expression of the markers in gene_lists_human_180502.csv:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">source</span><span class="p">(</span><span class="s2">"~/workspace/scrna/PlotMarkers.r"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>During the differential expression analysis in Step 14, which will take about 10 minutes to run, use these plots to make inferences about cell type.</p>

<h3 id="step-13-cluster-the-cells-using-a-graph-based-clustering-algorithm">Step 13: Cluster the cells using a graph-based clustering algorithm</h3>

<p>The first step is to generate the k-Nearest Neighbor (KNN) graph using the number of principal components chosen above (nPC). The second step is to partition the graph into “cliques” or clusters using the Louvain modularity optimization algorithm. At this step, the cluster resolution (cluster.res) may be specified. (Larger numbers generate more clusters.) While there is no “correct” number of clusters, it can be preferable to err on the side of too many clusters. For this exercise, please use the following:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nPC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">;</span><span class="w">
</span><span class="n">cluster.res</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">;</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FindNeighbors</span><span class="p">(</span><span class="n">object</span><span class="o">=</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">dims</span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="n">nPC</span><span class="p">);</span><span class="w">
</span><span class="n">scrna</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FindClusters</span><span class="p">(</span><span class="n">object</span><span class="o">=</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">resolution</span><span class="o">=</span><span class="n">cluster.res</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>The output of FindClusters is saved in scrna@meta.data$seurat_clusters. Note that this is reset each time clustering is performed. To ensure that each clustering result is saved, save the result as a new identity class, and give it a custom name that reflects the clustering resolution and number of principal components:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scrna</span><span class="p">[[</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"ClusterNames_%.1f_%dPC"</span><span class="p">,</span><span class="w"> </span><span class="n">cluster.res</span><span class="p">,</span><span class="w"> </span><span class="n">nPC</span><span class="p">)]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">Idents</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>Inspect the structure of the meta data, then save the Seurat object, which now contains t-SNE and UMAP coordinates, and clustering results:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">str</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="p">);</span><span class="w">
</span><span class="n">saveRDS</span><span class="p">(</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/VST.PCA.UMAP.TSNE.CLUST.rds"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">));</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTION:</strong> How many graph-based clusters are there? (This number is ‘n.graph’ and is used below.) How do they relate to the 2-D layouts? How does this depend on the number of components and the clustering resolution?</p>

<p>First plot graph-based clusters on 2-D layouts:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n.graph</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">scrna</span><span class="p">[[</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"ClusterNames_%.1f_%dPC"</span><span class="p">,</span><span class="n">cluster.res</span><span class="p">,</span><span class="w"> </span><span class="n">nPC</span><span class="p">)]][,</span><span class="m">1</span><span class="p">]));</span><span class="w"> </span><span class="c1"># automatically get the number of clusters from a specific clustering run</span><span class="w">
</span></code></pre></div></div>

<p>Or more simply, use the most recent default clustering result:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n.graph</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">seurat_clusters</span><span class="p">));</span><span class="w">

</span><span class="n">rainbow.colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rainbow</span><span class="p">(</span><span class="n">n.graph</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">=</span><span class="m">0.6</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="o">=</span><span class="m">0.9</span><span class="p">);</span><span class="w"> </span><span class="c1"># color palette</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/UMAP.clusters.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">);</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DimPlot</span><span class="p">(</span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">reduction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"umap"</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"seurat_clusters"</span><span class="p">,</span><span class="w"> </span><span class="n">cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rainbow.colors</span><span class="p">,</span><span class="w"> </span><span class="n">pt.size</span><span class="o">=</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.title.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.text.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.x</span><span class="o">=</span><span class="n">element_blank</span><span class="p">(),</span><span class="n">axis.ticks.y</span><span class="o">=</span><span class="n">element_blank</span><span class="p">());</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">p</span><span class="p">);</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTION:</strong> Are there sample-specific clusters? How many cells are in each cluster and each sample?</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cluster.breakdown</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">table</span><span class="p">(</span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">DataSet</span><span class="p">,</span><span class="w"> </span><span class="n">scrna</span><span class="o">@</span><span class="n">meta.data</span><span class="o">$</span><span class="n">seurat_clusters</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p><strong>QUESTION:</strong> What do the clusters represent? How do they differ from each other? Start with a differential expression analysis:</p>

<h3 id="step-14-interpret-the-clustering-using-a-differential-gene-expression-deg-analysis">Step 14: Interpret the clustering using a differential gene expression (DEG) analysis</h3>

<p>Perform DEG analysis on all clusters simultaneously using the default differential expression test (Wilcoxon), then save the results to a file. This will take 10-15 minutes using the parameters here, which were chosen to make this step run faster, and are not necessarily ideal for all situations. (While this is running, use the plots generated in Step 12 to figure out what types of cells are present in this data set.) Now compute the DEGs and save to a file:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DEGs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FindAllMarkers</span><span class="p">(</span><span class="n">object</span><span class="o">=</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">logfc.threshold</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">min.diff.pct</span><span class="o">=</span><span class="m">.2</span><span class="p">);</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">DEGs</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/DEGs.Wilcox.xls"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>Examine the contents and structure of DEGs. How many DEGs are there? What values are contained in this output? Now choose the top 10 DEGs in each cluster, and print them to a heatmap using DoHeatmap and a red/white/blue color scheme:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top10</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">DEGs</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">cluster</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">top_n</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">wt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">avg_log2FC</span><span class="p">);</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"%s/heatmap.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">outdir</span><span class="p">),</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">15</span><span class="p">);</span><span class="w">
</span><span class="n">DoHeatmap</span><span class="p">(</span><span class="n">scrna</span><span class="p">,</span><span class="w"> </span><span class="n">features</span><span class="o">=</span><span class="n">top10</span><span class="o">$</span><span class="n">gene</span><span class="p">,</span><span class="w"> </span><span class="n">slot</span><span class="o">=</span><span class="s2">"scale.data"</span><span class="p">,</span><span class="w"> </span><span class="n">disp.min</span><span class="o">=</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="n">disp.max</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">group.by</span><span class="o">=</span><span class="s2">"ident"</span><span class="p">,</span><span class="w"> </span><span class="n">group.bar</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_fill_gradientn</span><span class="p">(</span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"white"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">));</span><span class="w">
</span><span class="n">dev.off</span><span class="p">();</span><span class="w">
</span></code></pre></div></div>

<p><strong>Questions pertaining to cell type inference and DEG analysis:</strong></p>

<ol>
  <li>What cell types are present?</li>
  <li>Plot some DEGs using FeaturePlot. What is more reliable or informative: the statistical significance or the log fold-change of the DEG?</li>
  <li>Do different clusters correspond to different cell types? (Should they?)</li>
  <li>Are the DEGs helpful for identifying cell types?</li>
  <li>Does cell type correlate with other parameters (e.g. UMI, number of genes, cell cycle phase, etc?)</li>
</ol>

<h2 id="independent-exercises-if-time-permits">Independent exercises, if time permits</h2>

<ul>
  <li>
    <p>Perform a sample-wise differential expression analysis. Then make a heatmap and perform functional enrichment analysis of the differentially expressed genes. Overall, how do the samples differ from each other?</p>
  </li>
  <li>
    <p>Experiment with the number of components, the clustering resolution, and the DEG filtering thresholds to understand how these parameters affect the results. What set of parameters provides the closest correspondence between cell type and cluster?</p>
  </li>
  <li>
    <p>Perform batch correction using the sample code provided during the lecture. Assign each sample to its own batch, and repeat the analysis. How does batch correction affect the result?</p>
  </li>
  <li>
    <p>Subset the T-cells, assign them to a new Seurat object, and re-analyze them in isolation. Does this improve your ability to resolve T-cell subsets?</p>
  </li>
</ul>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-10-Archive" /><summary type="html"><![CDATA[Exercise: A Complete Seurat Workflow In this exercise, we will analyze and interpret a small scRNA-seq data set consisting of three bone marrow samples. Two of the samples are from the same patient, but differ in that one sample was enriched for a particular cell type. The goal of this analysis is to determine what cell types are present in the three samples, and how the samples and patients differ. This was drawn in part from the Seurat vignettes at https://satijalab.org/seurat/vignettes.html. Step 1: Preparation Working at the linux command line in your home directory (/home/ubuntu/workspace), create a new directory for your output files called “scrna”. The full path to this directory will be /home/ubuntu/workspace/scrna. The command is: mkdir ~/workspace/scRNA_data cd ~/workspace/scRNA_data wget -r -N --no-parent -nH --reject zip -R "index.html*" --cut-dirs=2 http://genomedata.org/rnaseq-tutorial/scrna/ cd ~/workspace mkdir scrna cd scrna wget http://genomedata.org/rnaseq-tutorial/scrna/PlotMarkers.r Start R, then load some R libraries as follows library("Seurat"); library("sctransform"); library("dplyr"); library("RColorBrewer"); library("ggthemes"); library("ggplot2"); library("cowplot"); library("data.table"); Create a vector of convenient sample names, such as “A”, “B”, and “C”: samples = c("A","B","C"); Create a variable called outdir to specify your output directory: outdir = "/home/ubuntu/workspace/scrna"; Step 2: Read in the feature-barcode matrices generated by the cellranger pipeline data.10x = list(); # first declare an empty list in which to hold the feature-barcode matrices data.10x[[1]] &lt;- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_CD34_3pV3/filtered_feature_bc_matrix"); data.10x[[2]] &lt;- Read10X(data.dir = "~/workspace/scRNA_data/ND050119_WBM_3pV3/filtered_feature_bc_matrix"); data.10x[[3]] &lt;- Read10X(data.dir = "~/workspace/scRNA_data/ND050819_WBM_3pV3/filtered_feature_bc_matrix"); Step 3: Convert each feature-barcode matrix to a Seurat object This simultaneously performs some initial filtering in order to exclude genes that are expressed in fewer than 100 cells, and to exclude cells that contain fewer than 700 expressed genes. Note that min.cells=10 and min.features=100 are more common parameters at this stage, but we are filtering more aggressively in order to make the data set smaller. At this step, we also create a “DataSet” identity for each cell. scrna.list = list(); # First create an empty list to hold the Seurat objects scrna.list[[1]] = CreateSeuratObject(counts = data.10x[[1]], min.cells=100, min.features=700, project=samples[1]); scrna.list[[1]][["DataSet"]] = samples[1]; scrna.list[[2]] = CreateSeuratObject(counts = data.10x[[2]], min.cells=100, min.features=700, project=samples[2]); scrna.list[[2]][["DataSet"]] = samples[2]; scrna.list[[3]] = CreateSeuratObject(counts = data.10x[[3]], min.cells=100, min.features=700, project=samples[3]); scrna.list[[3]][["DataSet"]] = samples[3]; Aside: Note that you can do this more efficiently, especially if you have many samples, using a ‘for’ loop: for (i in 1:length(data.10x)) { scrna.list[[i]] = CreateSeuratObject(counts = data.10x[[i]], min.cells=100, min.features=700, project=samples[i]); scrna.list[[i]][["DataSet"]] = samples[i]; } Finally, remove the raw data to save memory (these objects get large!): rm(data.10x); Step 4. Merge the Seurat objects into a single object We will call this object scrna. We also give it a project name (here, “CSHL”), and prepend the appropriate data set name to each cell barcode. For example, if a barcode from data set “B” is originally AATCTATCTCTC, it will now be B_AATCTATCTCTC. Then clean up some space by removing scrna.list. Finally, save the merged object as an RDS file. Should you need to load this file into R at any time, it can be done using the readRDS command. scrna &lt;- merge(x=scrna.list[[1]], y=c(scrna.list[[2]],scrna.list[[3]]), add.cell.ids = c("A","B","C"), project="CSHL"); rm(scrna.list); # save some memory str(scrna@meta.data) # examine the structure of the Seurat object meta data saveRDS(scrna, file = sprintf("%s/MergedSeuratObject.rds", outdir)); Aside on accessing the Seurat object meta data, which is stored in scrna@meta.data Meta data can be used to hold the following information (and more) for your data set: Summary statistics Sample name Cluster membership for each cell Cell cycle phase for each cell Batch or sample for each cell Other custom annotations for each cell You can access and query the meta data using commands such as: scrna[[]]; scrna@meta.data; str(scrna@meta.data); # Examine structure and contents of meta data head(scrna@meta.data$nFeature_RNA); # Access genes (“Features”) for each cell head(scrna@meta.data$nCount_RNA); # Access number of UMIs for each cell: levels(x=scrna); # List the items in the current default cell identity class length(unique(scrna@meta.data$seurat_clusters)); # How many clusters are there? Note that there will not be any clusters in the meta data until you perform clustering. unique(scrna@meta.data$Batch); # What batches are included in this data set? scrna$NewIdentity &lt;- vector_of_annotations; # Assign new cell annotations to a new "identity class" in the meta data Step 5. Quality control plots Plot the distributions of several quality-control variables in order to choose appropriate filtering thresholds. The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. However, you will need to manually calculate the mitochondrial transcript percentage and ribosomal transcript percentage for each cell, and add them to the Seurat object meta data, as shown below. Calculate the mitochondrial transcript percentage for each cell: mito.genes &lt;- grep(pattern = "^MT-", x = rownames(x = scrna), value = TRUE); percent.mito &lt;- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[mito.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')); scrna[['percent.mito']] &lt;- percent.mito; Calculate the ribosomal transcript percentage for each cell: ribo.genes &lt;- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = scrna), value = TRUE); percent.ribo &lt;- Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')[ribo.genes, ]) / Matrix::colSums(x = GetAssayData(object = scrna, slot = 'counts')); scrna[['percent.ribo']] &lt;- percent.ribo; Plot as violin plots, which will be located in, for example, ~/workspace/scrna/VlnPlot.pdf All figures can be downloaded using the scp command, or viewed on the AWS server. pdf(sprintf("%s/VlnPlot.pdf", outdir), width = 13, height = 6); vln &lt;- VlnPlot(object = scrna, features = c("percent.mito", "percent.ribo"), pt.size=0, ncol = 2, group.by="DataSet"); print(vln); dev.off(); pdf(sprintf("%s/VlnPlot.nCount.25Kmax.pdf", outdir), width = 10, height = 10) vln &lt;- VlnPlot(object = scrna, features = "nCount_RNA", pt.size=0, group.by="DataSet", y.max=25000) print(vln) dev.off(); pdf(sprintf("%s/VlnPlot.nFeature.pdf", outdir), width = 10, height = 10) vln &lt;- VlnPlot(object = scrna, features = "nFeature_RNA", pt.size=0, group.by="DataSet") print(vln) dev.off() QUESTIONS: Excessive mitochondrial transcripts can indicate the presence of dead cells, which tend to cluster together. Based on the distribution of mitochondrial transcripts, what filter threshold would you set for mitochondrial transcripts? One approach is to start with a lenient threshold, work through the analysis, and determine later whether your data still contains clusters of dead cells. Compare the distribution of ribosomal transcripts, total transcripts, and genes in each sample. Are differences in these parameters necessarily a technical artifact, or might they contain information about the biology of the samples? Next, we will use Seurat’s FeatureScatter function to create scatterplots of the relationships among QC variables. This can be helpful in selecting filtering thresholds. More generally, this is a very useful wrapper function that can be used to visualize relationships between any pair of quantitative variables in the Seurat object (including expression levels, etc). pdf(sprintf("%s/Scatter1.pdf", outdir), width = 8, height = 6); scatter &lt;- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.mito", pt.size=0.1) print(scatter); dev.off(); pdf(sprintf("%s/Scatter2.pdf", outdir), width = 8, height = 6); scatter &lt;- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "percent.ribo", pt.size=0.1) print(scatter); dev.off(); pdf(sprintf("%s/Scatter3.pdf", outdir), width = 8, height = 6); scatter &lt;- FeatureScatter(object = scrna, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size=0.1) print(scatter); dev.off(); Step 6. Calculate a cell cycle score for each cell This can be used to determine whether heterogeneity in cell cycle phase is driving the tSNE/UMAP layout and/or clustering. This may or may not be obscuring the signal you care about, depending on your analysis goals and the nature of the data. (If necessary, it can be removed in a later step.) It is also useful for determining whether certain populations of cells are more proliferative than others. The list of cell cycle genes, and the scoring method, was taken from Tirosh I, et al. (2016). cell.cycle.tirosh &lt;- read.csv("http://genomedata.org/rnaseq-tutorial/scrna/CellCycleTiroshSymbol2ID.csv", header=TRUE); # read in the list of genes s.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G1/S")]; # create a vector of S-phase genes g2m.genes = cell.cycle.tirosh$Gene.Symbol[which(cell.cycle.tirosh$List == "G2/M")]; # create a vector of G2/M-phase genes scrna &lt;- CellCycleScoring(object=scrna, s.features=s.genes, g2m.features=g2m.genes, set.ident=FALSE) Step 7. Filter the cells to remove debris, dead cells, and probable doublets QUESTION: How many cells are there in each sample before filtering? The ‘table’ function may come in handy. First calculate some basic statistics on the various QC parameters, which can be helpful for choosing cutoffs. For example: min &lt;- min(scrna@meta.data$nFeature_RNA); m &lt;- median(scrna@meta.data$nFeature_RNA) max &lt;- max(scrna@meta.data$nFeature_RNA) s &lt;- sd(scrna@meta.data$nFeature_RNA) min1 &lt;- min(scrna@meta.data$nCount_RNA) max1 &lt;- max(scrna@meta.data$nCount_RNA) m1 &lt;- mean(scrna@meta.data$nCount_RNA) s1 &lt;- sd(scrna@meta.data$nCount_RNA) Count93 &lt;- quantile(scrna@meta.data$nCount_RNA, 0.93) # calculate value in the 93rd percentile print(paste("Feature stats:",min,m,max,s)); print(paste("UMI stats:",min1,m1,max1,s1,Count93)); Now, filter the data using the subset function and your chosen thresholds. Note that for large data sets with diverse samples, it may be beneficial to use sample-specific thresholds for some parameters. If you are not sure what thresholds to use, the following will work well for the purposes of this course: scrna &lt;- subset(x = scrna, subset = nFeature_RNA &gt; 700 &amp; nCount_RNA &lt; Count93 &amp; percent.mito &lt; 0.1) QUESTION: How many cells are there in each sample after filtering? Step 8. [Optional] Subset the data If necessary, you can subset the data set to N cells (2000, 5000, etc) to make it more manageable: subcells &lt;- sample(Cells(scrna), size=N, replace=F) scrna &lt;- subset(scrna, cells=subcells) Step 9. Normalize the data, detect variable genes, and scale the data Normalize the data: scrna &lt;- NormalizeData(object = scrna, normalization.method = "LogNormalize", scale.factor = 1e6); QUESTION: What does LogNormalize do mathematically? Are there other normalization options available? Now identify and plot the most variable genes, which will be used for downstream analyses. This is a critical step that reduces the contribution of noise. Consider adjusting the cutoffs if you think (often based on prior knowledge of your experimental system) that important genes are being excluded. scrna &lt;- FindVariableFeatures(object = scrna, selection.method = 'vst', mean.cutoff = c(0.1,8), dispersion.cutoff = c(1, Inf)) print(paste("Number of Variable Features: ",length(x = VariableFeatures(object = scrna)))); pdf(sprintf("%s/VG.pdf", outdir), useDingbats=FALSE) vg &lt;- VariableFeaturePlot(scrna) print(vg); dev.off() Scale and center the data: scrna &lt;- ScaleData(object = scrna, features = rownames(x = scrna), verbose=FALSE); Alternatively, you can scale the data and simultaneously remove unwanted signal associated with variables such as cell cycle phase, ribosomal transcript content, etc. (This is slow, and cannot be done in the time allotted for this course.) To remove cell cycle signal, for instance: # scrna &lt;- ScaleData(object = scrna, features = rownames(x = scrna), vars.to.regress = c("S.Score","G2M.Score"), display.progress=FALSE); Save the normalized, scaled Seurat object: saveRDS(scrna, file = sprintf("%s/VST.rds", outdir)); DIGRESSION: How can you use Seurat-processed data with packages that are not compatible with Seurat? Other packages may require the data to be normalized in a specific way, and often require an expression matrix (not a Seurat object) as input. As an example, here we prepare an expression data matrix for use with the popular CNV-detection package CONICSmat: scrna.cnv &lt;- NormalizeData(object = scrna, normalization.method = "RC", scale.factor = 1e5); data.cnv &lt;- GetAssayData(object=scrna.cnv, slot="data"); # get the normalized data log2data = log2(data.cnv+1); # add 1 then take log2 df &lt;- as.data.frame(as.matrix(log2data)); # convert it to a data frame cells &lt;- as.data.frame(colnames(df)); genes &lt;- as.data.frame(rownames(df)); # save as text files: fwrite(x = genes, file = "genes.csv", col.names=FALSE); fwrite(x = cells, file = "cells.csv", col.names=FALSE); fwrite(x = df, file = "exp.csv", col.names=FALSE); Step 10. Reduce the dimensionality of the data using Principal Component Analysis Subsequent calculations, such as those used to derive the tSNE and UMAP projections, and the k-Nearest Neighbor graph used for clustering, are performed in a new space with fewer dimensions, namely, the principal components. Here, specify a relatively large number of principal components – more than you anticipate using for downstream analyses. Then use several techniques to characterize the components and estimate the number of principal components that captures the signal of interest while minimizing noise. Perform Principal Component Analysis (PCA), and save the first 100 components: scrna &lt;- RunPCA(object = scrna, npcs = 100, verbose = FALSE); OPTIONAL: Then run ProjectDim, which scores each gene in the dataset (including genes not included in the PCA) based on their correlation with the calculated components. This is not used elsewhere in this pipeline, but it can be useful for exploring genes that are not among the 2000 most highly variable genes selected above. scrna &lt;- ProjectDim(object = scrna) QUESTION: What do the principal components “mean” from a biological standpoint? What genes contribute to the principal components? Do they represent biological processes of interest, or technical variables (such as mitochondrial transcripts) that suggest the data may need to be filtered differently? There are several easy ways to investigate these questions. First, visualize the PCA “loadings.” Each “component” identified by PCA is a linear combination, or weighted sum, of the genes in the data set. Here, the “loadings” represent the weights of the genes in any given component. These plots tell you which genes contribute most to each component: pdf(sprintf("%s/VizDimLoadings.pdf", outdir), width = 8, height = 30); vdl &lt;- VizDimLoadings(object = scrna, dims = 1:3) print(vdl); dev.off(); Second, use the DimHeatmap function to generate heatmaps that summarize the expression of the most highly weighted genes in each principal component. As noted in the Seurat documentation, “both cells and genes are ordered according to their PCA scores. Setting cells.use to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated gene sets. pdf(sprintf("%s/PCA.heatmap.multi.pdf", outdir), width = 8.5, height = 24); hm.multi &lt;- DimHeatmap(object = scrna, dims = 1:12, cells = 500, balanced = TRUE); print(hm.multi); dev.off(); Finally, you can generate ranked lists of the genes in each principal component and perform functional enrichment or Gene Set Enrichment Analysis. (This tool offers a quick and easy way to determine functional enrichment from a list of genes.) For example, for the first principal component: PClist_1 &lt;- names(sort(Loadings(object=scrna, reduction="pca")[,1], decreasing=TRUE)); Now, decide how many components to use in downstream analyses. This number usually varies from 5-50, depending on the number of cells and the complexity of the data set. Although there is no “correct” answer, using too few components risks missing meaningful signal, and using too many risks diluting meaningful signal with noise. There are several ways to make an informed decision. The first is to use the principal component heatmaps generated above. Components that generate noisy heatmaps likely correspond to noise. The second method is to examine a plot of the standard deviations of the principle components, and to choose a cutoff to the left of the bend in this so-called “elbow plot.” Generate an elbow plot of principal component standard deviations: elbow &lt;- ElbowPlot(object = scrna) pdf(sprintf("%s/PCA.elbow.pdf", outdir), width = 6, height = 8); print(elbow); dev.off(); Next, use a bootstrapping technique called Jackstraw analysis to estimate a p-value for each component, print out a plot, and save the p-values to a file: scrna &lt;- JackStraw(object = scrna, num.replicate = 100, dims=30); # takes around 4 minutes scrna &lt;- ScoreJackStraw(object = scrna, dims = 1:30) pdf(sprintf("%s/PCA.jackstraw.pdf", outdir), width = 10, height = 6); js &lt;- JackStrawPlot(object = scrna, dims = 1:30) print(js); dev.off(); pc.pval &lt;- scrna@reductions$pca@jackstraw@overall.p.values; # get p-value for each PC write.table(pc.pval, file=sprintf("%s/PCA.jackstraw.scores.xls", outdir, date), quote=FALSE, sep='\t', col.names=TRUE); Step 11. Generate 2-dimensional layouts of the data using two related algorithms, t-SNE and UMAP Use the number of principal components (nPC) you selected above. nPC = 10; scrna &lt;- RunUMAP(object = scrna, reduction = "pca", dims = 1:nPC); scrna &lt;- RunTSNE(object = scrna, reduction = "pca", dims = 1:nPC); Now, plot the tSNE and UMAP plots next to each other in one figure, and color each data set separately: pdf(sprintf("%s/UMAP.%d.pdf", outdir, nPC), width = 10, height = 8); p1 &lt;- DimPlot(object = scrna, reduction = "tsne", group.by = "DataSet", pt.size=0.1) p2 &lt;- DimPlot(object = scrna, reduction = "umap", group.by = "DataSet", pt.size=0.1) print(plot_grid(p1, p2)); dev.off(); QUESTIONS: How do the data sets compare to each other? (We will further investigate these differences in subsequent steps.) How does the number of principal components used affect the layout? What are the chief sources of variation in this data, as suggested by the t-SNE and UMAP layouts? Are there confounding technical variables that may be driving the layouts? What are some likely technical variables? Color the t-SNE and UMAP plots by some potential confounding variables. Here’s an example in which we color each cell according to the number of UMIs it contains: feature.pal = rev(colorRampPalette(brewer.pal(11,"Spectral"))(50)); # a useful color palette pdf(sprintf("%s/umap.%d.colorby.UMI.pdf", outdir, nPC), width = 10, height = 8); fp &lt;- FeaturePlot(object = scrna, features = c("nCount_RNA"), cols = feature.pal, pt.size=0.1, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); # the text after the ‘+’ simply removes the axis using ggplot syntax print(fp); dev.off(); QUESTION: What is the relationship between the principal components and the t-SNE/UMAP layout? To investigate this, plot several principal components on the t-SNE/UMAP, for example the following code plots the first principal component and prints the plot to a file: pdf(sprintf("%s/UMAP.%d.colorby.PCs.pdf", outdir, nPC), width = 12, height = 6); redblue=c("blue","gray","red"); # another useful color scheme fp1 &lt;- FeaturePlot(object = scrna, features = 'PC_1', cols=redblue, pt.size=0.1, reduction = "umap")+ theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); print(fp1); dev.off(); Step 12: Infer cell types There are many sophisticated methods for doing this (e.g. SingleR). But the simplest and most common approach is to plot the expression levels of marker genes for known cell types. Markers for bone-marrow-relevant cell types are provided in the file ~/workspace/scRNA_data/gene_lists_human_180502.csv. To plot three genes of your choice, GENE1, GENE2, and GENE3: pdf(sprintf("%s/geneplot.pdf", outdir), height=6, width=6); fp &lt;- FeaturePlot(object = scrna, features = c(GENE1, GENE2, GENE3), cols = c("gray","red"), ncol=2, reduction = "umap") + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); print(fp); dev.off(); Now use the code that we downloaded from here to color the UMAP according to the expression of the markers in gene_lists_human_180502.csv: source("~/workspace/scrna/PlotMarkers.r") During the differential expression analysis in Step 14, which will take about 10 minutes to run, use these plots to make inferences about cell type. Step 13: Cluster the cells using a graph-based clustering algorithm The first step is to generate the k-Nearest Neighbor (KNN) graph using the number of principal components chosen above (nPC). The second step is to partition the graph into “cliques” or clusters using the Louvain modularity optimization algorithm. At this step, the cluster resolution (cluster.res) may be specified. (Larger numbers generate more clusters.) While there is no “correct” number of clusters, it can be preferable to err on the side of too many clusters. For this exercise, please use the following: nPC = 10; cluster.res = 0.2; scrna &lt;- FindNeighbors(object=scrna, dims=1:nPC); scrna &lt;- FindClusters(object=scrna, resolution=cluster.res); The output of FindClusters is saved in scrna@meta.data$seurat_clusters. Note that this is reset each time clustering is performed. To ensure that each clustering result is saved, save the result as a new identity class, and give it a custom name that reflects the clustering resolution and number of principal components: scrna[[sprintf("ClusterNames_%.1f_%dPC", cluster.res, nPC)]] &lt;- Idents(object = scrna); Inspect the structure of the meta data, then save the Seurat object, which now contains t-SNE and UMAP coordinates, and clustering results: str(scrna@meta.data); saveRDS(scrna, file = sprintf("%s/VST.PCA.UMAP.TSNE.CLUST.rds", outdir)); QUESTION: How many graph-based clusters are there? (This number is ‘n.graph’ and is used below.) How do they relate to the 2-D layouts? How does this depend on the number of components and the clustering resolution? First plot graph-based clusters on 2-D layouts: n.graph = length(unique(scrna[[sprintf("ClusterNames_%.1f_%dPC",cluster.res, nPC)]][,1])); # automatically get the number of clusters from a specific clustering run Or more simply, use the most recent default clustering result: n.graph = length(unique(scrna@meta.data$seurat_clusters)); rainbow.colors = rainbow(n.graph, s=0.6, v=0.9); # color palette pdf(sprintf("%s/UMAP.clusters.pdf", outdir), width = 10, height = 6); p &lt;- DimPlot(object = scrna, reduction = "umap", group.by = "seurat_clusters", cols = rainbow.colors, pt.size=0.1, label=TRUE) + theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank()); print(p); dev.off(); QUESTION: Are there sample-specific clusters? How many cells are in each cluster and each sample? cluster.breakdown &lt;- table(scrna@meta.data$DataSet, scrna@meta.data$seurat_clusters); QUESTION: What do the clusters represent? How do they differ from each other? Start with a differential expression analysis: Step 14: Interpret the clustering using a differential gene expression (DEG) analysis Perform DEG analysis on all clusters simultaneously using the default differential expression test (Wilcoxon), then save the results to a file. This will take 10-15 minutes using the parameters here, which were chosen to make this step run faster, and are not necessarily ideal for all situations. (While this is running, use the plots generated in Step 12 to figure out what types of cells are present in this data set.) Now compute the DEGs and save to a file: DEGs &lt;- FindAllMarkers(object=scrna, logfc.threshold=1, min.diff.pct=.2); write.table(DEGs, file=sprintf("%s/DEGs.Wilcox.xls", outdir), quote=FALSE, sep="\t", row.names=FALSE); Examine the contents and structure of DEGs. How many DEGs are there? What values are contained in this output? Now choose the top 10 DEGs in each cluster, and print them to a heatmap using DoHeatmap and a red/white/blue color scheme: top10 &lt;- DEGs %&gt;% group_by(cluster) %&gt;% top_n(n = 10, wt = avg_log2FC); pdf(sprintf("%s/heatmap.pdf", outdir), height=20, width=15); DoHeatmap(scrna, features=top10$gene, slot="scale.data", disp.min=-2, disp.max=2, group.by="ident", group.bar=TRUE) + scale_fill_gradientn(colors = c("blue", "white", "red")) + theme(axis.text.y = element_text(size = 10)); dev.off(); Questions pertaining to cell type inference and DEG analysis: What cell types are present? Plot some DEGs using FeaturePlot. What is more reliable or informative: the statistical significance or the log fold-change of the DEG? Do different clusters correspond to different cell types? (Should they?) Are the DEGs helpful for identifying cell types? Does cell type correlate with other parameters (e.g. UMI, number of genes, cell cycle phase, etc?) Independent exercises, if time permits Perform a sample-wise differential expression analysis. Then make a heatmap and perform functional enrichment analysis of the differentially expressed genes. Overall, how do the samples differ from each other? Experiment with the number of components, the clustering resolution, and the DEG filtering thresholds to understand how these parameters affect the results. What set of parameters provides the closest correspondence between cell type and cluster? Perform batch correction using the sample code provided during the lecture. Assign each sample to its own batch, and repeat the analysis. How does batch correction affect the result? Subset the T-cells, assign them to a new Seurat object, and re-analyze them in isolation. Does this improve your ability to resolve T-cell subsets?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Log into Compute Canada</title><link href="http://www.rnabio.org//module-09-appendix/0009/12/02/Log_into_ComputeCanada/" rel="alternate" type="text/html" title="Log into Compute Canada" /><published>0009-12-02T00:00:00+00:00</published><updated>0009-12-02T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/12/02/Log_into_ComputeCanada</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/12/02/Log_into_ComputeCanada/"><![CDATA[<h2 id="signing-into-compute-canada-for-the-course">Signing into Compute Canada for the course</h2>
<p>In order to sign into your Compute Canada instance, you will need a valid user ID and password for Compute Canada. These should have been provided to you by the instructors.</p>

<h2 id="logging-in-with-ssh-maclinux">Logging in with ssh (Mac/Linux)</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh user#@login1.CBW.calculquebec.cloud
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">user#</code> is the name of a user on the system you are logging into. <code class="language-plaintext highlighter-rouge">login1.CBW.calculquebec.cloud</code> is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password.</p>

<h2 id="logging-in-with-putty-windows">Logging in with putty (Windows)</h2>

<p>To log in on windows, you must first install putty. Once you have putty installed, you can log in using the following parameters. If you would like photos of where to input these parameters, please refer <a href="https://github.com/bioinformatics-ca/RNAseq_2020/blob/master/CC_cloud.md">here</a>.</p>

<p>Session-hostname: <code class="language-plaintext highlighter-rouge">login1.CBW.calculquebec.cloud</code></p>

<p>Connection-Data-Auto-login username: <code class="language-plaintext highlighter-rouge">user#</code></p>

<p><code class="language-plaintext highlighter-rouge">user#</code> is the name of a user on the system you are logging into. <code class="language-plaintext highlighter-rouge">login1.CBW.calculquebec.cloud</code> is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password.</p>

<h2 id="copying-files-to-your-computer">Copying files to your computer</h2>

<ul>
  <li>To copy files from an instance, use scp in a similar fashion (in this case to copy a file called nice_alignments.bam):</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scp user#@login1.CBW.calculquebec.cloud:nice_alignments.bam <span class="nb">.</span>
</code></pre></div></div>

<h2 id="using-jupyter-notebook-or-jupyterlab">Using Jupyter Notebook or JupyterLab</h2>

<p>Everything created in your workspace on the cloud is also available by a web server using Jupyter Notebooks or JupyterLab. You can also perform python/R analysis and access an interactive command-line terminal via JupyterLab. Simply go to the following in your browser and choose Jupyter Notebook (or JupyterLab) in the User Interface dropdown menu. For simply browsing and downloading of files you can select Number of cores = 1 and Memory (MB) = 3200. For analysis in JupyterLab you select Number of cores = 4 and Memory (MB) = 32000. NOTE: Be aware that if you request resources from both your terminal/putty (e.g., <code class="language-plaintext highlighter-rouge">salloc</code> requests) and also via Jupyter. These are additive. Make sure to terminate any terminal or Jupyter session not in use. It is important to log out once you finish Jupyter session to release the resources. If you only close the browser window, your Jupyter session is still running and using the resources.</p>

<p><a href="https://jupyter.cbw.calculquebec.cloud/">https://jupyter.cbw.calculquebec.cloud/</a></p>

<h2 id="file-system-layout">File system layout</h2>

<p>When you log in, you will be in your home directory (e.g., /home/user##). You will notice that you have three directories: “CourseData”, “projects”, and “scratch”. For the purposes of this course, we will mostly be working in your home directory and making use of some data files in the <code class="language-plaintext highlighter-rouge">CourseData</code> directory.</p>

<h2 id="how-to-request-and-use-a-compute-node">How to request and use a compute node</h2>

<p>After you log into the cluster, you will be on the login node. This has very limited compute and memory resources. Do NOT run anything on the login node. You can access a compute node with an interactive session using <code class="language-plaintext highlighter-rouge">salloc</code> command. For example, <code class="language-plaintext highlighter-rouge">salloc --mem 24000M -c 4 -t 8:0:0</code></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">--mem</span>: the real memory <span class="o">(</span><span class="k">in </span>megabytes<span class="o">)</span> required per node.
<span class="nt">-c</span> | <span class="nt">--cpus-per-task</span>: number of processors required.
<span class="nt">-t</span> | <span class="nt">--time</span>: limit on the total run <span class="nb">time </span>of the job allocation.
</code></pre></div></div>

<p>The above command requests an interactive session with 4 cores and 32000M memory for 8 hours. Once the job is allocated, you will be on one of the compute nodes.</p>

<p>After you have received your compute node, you will need to load the software that we will be using for this workshop.</p>

<p>This can be done with the following command.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>module load samtools/1.10 bam-readcount/0.8.0 hisat2/2.2.0 stringtie/2.1.0 gffcompare/0.11.6 tophat/2.1.1 kallisto/0.46.1 fastqc/0.11.8 multiqc/1.8 picard/2.20.6 flexbar/3.5.0 RSeQC/3.0.1 bedops/2.4.39 ucsctools/399 r/4.0.0 python/3.7.4 bam-readcount/0.8.0 HTSeq/1.18.1 regtools/0.5.2

</code></pre></div></div>

<h2 id="getting-information-on-your-compute-jobs">Getting information on your compute jobs</h2>
<p>The following command allow you to see all current jobs requested by your user and cancel a job if needed. This could be needed if you get connected from your compute session and you wind up with “zombie” jobs that you are no longer connected to. The first command can be used to find the job id needed for the second command.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>squeue <span class="nt">-u</span> <span class="nv">$user</span>
scancel <span class="nv">$jobid</span>

</code></pre></div></div>

<p>When you are done with the compute node, make sure to type <code class="language-plaintext highlighter-rouge">exit</code> to exit the node and free up the resources you allocated for the node.</p>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Signing into Compute Canada for the course In order to sign into your Compute Canada instance, you will need a valid user ID and password for Compute Canada. These should have been provided to you by the instructors. Logging in with ssh (Mac/Linux) ssh user#@login1.CBW.calculquebec.cloud user# is the name of a user on the system you are logging into. login1.CBW.calculquebec.cloud is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password. Logging in with putty (Windows) To log in on windows, you must first install putty. Once you have putty installed, you can log in using the following parameters. If you would like photos of where to input these parameters, please refer here. Session-hostname: login1.CBW.calculquebec.cloud Connection-Data-Auto-login username: user# user# is the name of a user on the system you are logging into. login1.CBW.calculquebec.cloud is the address of the linux system on Compute Canada that you are logging into. Instead of the using public DNS name, you could also use the IP address if you know that. When you are prompted you will need to enter your password. Copying files to your computer To copy files from an instance, use scp in a similar fashion (in this case to copy a file called nice_alignments.bam): scp user#@login1.CBW.calculquebec.cloud:nice_alignments.bam . Using Jupyter Notebook or JupyterLab Everything created in your workspace on the cloud is also available by a web server using Jupyter Notebooks or JupyterLab. You can also perform python/R analysis and access an interactive command-line terminal via JupyterLab. Simply go to the following in your browser and choose Jupyter Notebook (or JupyterLab) in the User Interface dropdown menu. For simply browsing and downloading of files you can select Number of cores = 1 and Memory (MB) = 3200. For analysis in JupyterLab you select Number of cores = 4 and Memory (MB) = 32000. NOTE: Be aware that if you request resources from both your terminal/putty (e.g., salloc requests) and also via Jupyter. These are additive. Make sure to terminate any terminal or Jupyter session not in use. It is important to log out once you finish Jupyter session to release the resources. If you only close the browser window, your Jupyter session is still running and using the resources. https://jupyter.cbw.calculquebec.cloud/ File system layout When you log in, you will be in your home directory (e.g., /home/user##). You will notice that you have three directories: “CourseData”, “projects”, and “scratch”. For the purposes of this course, we will mostly be working in your home directory and making use of some data files in the CourseData directory. How to request and use a compute node After you log into the cluster, you will be on the login node. This has very limited compute and memory resources. Do NOT run anything on the login node. You can access a compute node with an interactive session using salloc command. For example, salloc --mem 24000M -c 4 -t 8:0:0 --mem: the real memory (in megabytes) required per node. -c | --cpus-per-task: number of processors required. -t | --time: limit on the total run time of the job allocation. The above command requests an interactive session with 4 cores and 32000M memory for 8 hours. Once the job is allocated, you will be on one of the compute nodes. After you have received your compute node, you will need to load the software that we will be using for this workshop. This can be done with the following command. module load samtools/1.10 bam-readcount/0.8.0 hisat2/2.2.0 stringtie/2.1.0 gffcompare/0.11.6 tophat/2.1.1 kallisto/0.46.1 fastqc/0.11.8 multiqc/1.8 picard/2.20.6 flexbar/3.5.0 RSeQC/3.0.1 bedops/2.4.39 ucsctools/399 r/4.0.0 python/3.7.4 bam-readcount/0.8.0 HTSeq/1.18.1 regtools/0.5.2 Getting information on your compute jobs The following command allow you to see all current jobs requested by your user and cancel a job if needed. This could be needed if you get connected from your compute session and you wind up with “zombie” jobs that you are no longer connected to. The first command can be used to find the job id needed for the second command. squeue -u $user scancel $jobid When you are done with the compute node, make sure to type exit to exit the node and free up the resources you allocated for the node.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Strand Settings</title><link href="http://www.rnabio.org//module-09-appendix/0009/12/01/StrandSettings/" rel="alternate" type="text/html" title="Strand Settings" /><published>0009-12-01T00:00:00+00:00</published><updated>0009-12-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/12/01/StrandSettings</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/12/01/StrandSettings/"><![CDATA[<h3 id="strand-related-settings">Strand-related settings</h3>

<p>There are various strand-related settings for RNA-seq tools that must be adjusted to account for library construction strategy. The following table provides read orientation codes and software settings for commonly used RNA-seq analysis tools including: IGV, TopHat, HISAT2, HTSeq, Picard, Kallisto, StringTie, and others. Each of these explanations/settings is provided for several commonly used RNA-seq library construction kits that produce either stranded or unstranded data.</p>

<p><em>NOTE</em>: A useful tool to infer strandedness of your raw sequence data is the <a href="https://github.com/betsig/how_are_we_stranded_here">check_strandedness tool</a>. We provide a tutorial for using this tool <a href="/module-01-inputs/0001/05/01/RNAseq_Data/#determining-the-strandedness-of-rna-seq-data">here</a>.</p>

<p><em>NOTE</em>: In the table below, the list of methods/kits for specific strand settings <strong>assumes that these kits are used as specified by their manufacturer</strong>. It is very possible that a sequencing provider/core may make modifications to these kits. For example, in one case we obtained RNAseq data processed with NEBNext Ultra II Directional kit (dUTP method). However instead of using the NEB hairpin adapters, IDT xGen UDI-UMI adapters were substituted, and this <a href="https://www.idtdna.com/pages/support/faqs/can-the-xgen-unique-dual-index-umi-adapters-be-used-for-rna-seq">results in the insert strandedness being flipped</a> (from RF/fr-firststrand to FR/fr-secondstrand). Because this level of detail is not always provided it is highly recommended to <a href="https://github.com/betsig/how_are_we_stranded_here">confirm your data’s strandedness empirically</a>.</p>

<table>
  <thead>
    <tr>
      <th><strong>Tool</strong></th>
      <th><strong>RF/fr-firststrand stranded (dUTP)</strong></th>
      <th><strong>FR/fr-secondstrand stranded (Ligation)</strong></th>
      <th><strong>Unstranded</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>check_strandedness (output)</strong></td>
      <td>RF/fr-firststrand</td>
      <td>FR/fr-secondstrand</td>
      <td>unstranded</td>
    </tr>
    <tr>
      <td><strong>IGV (5p to 3p read orientation code)</strong></td>
      <td>F2R1</td>
      <td>F1R2</td>
      <td>F2R1 or F1R2</td>
    </tr>
    <tr>
      <td><strong>TopHat (<code class="language-plaintext highlighter-rouge">--library-type</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">fr-firststrand</code></td>
      <td><code class="language-plaintext highlighter-rouge">fr-secondstrand</code></td>
      <td><code class="language-plaintext highlighter-rouge">fr-unstranded</code></td>
    </tr>
    <tr>
      <td><strong>HISAT2 (<code class="language-plaintext highlighter-rouge">--rna-strandness</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">R/RF</code></td>
      <td><code class="language-plaintext highlighter-rouge">F/FR</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>HTSeq (<code class="language-plaintext highlighter-rouge">--stranded</code>/<code class="language-plaintext highlighter-rouge">-s</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">reverse</code></td>
      <td><code class="language-plaintext highlighter-rouge">yes</code></td>
      <td>no</td>
    </tr>
    <tr>
      <td><strong>STAR</strong></td>
      <td>n/a (STAR doesn’t use library strandedness info for mapping)</td>
      <td>NONE</td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>Picard CollectRnaSeqMetrics (<code class="language-plaintext highlighter-rouge">STRAND_SPECIFICITY parameter</code>)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">SECOND_READ_TRANSCRIPTION_STRAND</code></td>
      <td><code class="language-plaintext highlighter-rouge">FIRST_READ_TRANSCRIPTION_STRAND</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>Kallisto quant (parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">--rf-stranded</code></td>
      <td><code class="language-plaintext highlighter-rouge">--fr-stranded</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>StringTie (parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">--rf</code></td>
      <td><code class="language-plaintext highlighter-rouge">--fr</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>FeatureCounts (<code class="language-plaintext highlighter-rouge">-s</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">2</code></td>
      <td><code class="language-plaintext highlighter-rouge">1</code></td>
      <td><code class="language-plaintext highlighter-rouge">0</code></td>
    </tr>
    <tr>
      <td><strong>RSEM (<code class="language-plaintext highlighter-rouge">–forward-prob</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">0</code></td>
      <td><code class="language-plaintext highlighter-rouge">1</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.5</code></td>
    </tr>
    <tr>
      <td><strong>Salmon (<code class="language-plaintext highlighter-rouge">--libType</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">ISR</code> (assuming paired-end with inward read orientation)</td>
      <td><code class="language-plaintext highlighter-rouge">ISF</code> (assuming paired-end with inward read orientation)</td>
      <td><code class="language-plaintext highlighter-rouge">IU</code> (assuming paired-end with inward read orientation)</td>
    </tr>
    <tr>
      <td><strong>Trinity (<code class="language-plaintext highlighter-rouge">–SS_lib_type</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">RF</code></td>
      <td><code class="language-plaintext highlighter-rouge">FR</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>MGI CWL YAML (<code class="language-plaintext highlighter-rouge">strand</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">first</code></td>
      <td><code class="language-plaintext highlighter-rouge">second</code></td>
      <td>NONE</td>
    </tr>
    <tr>
      <td><strong>WASHU WDL YAML (<code class="language-plaintext highlighter-rouge">strand</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">first</code></td>
      <td><code class="language-plaintext highlighter-rouge">second</code></td>
      <td><code class="language-plaintext highlighter-rouge">unstranded</code></td>
    </tr>
    <tr>
      <td><strong>RegTools (<code class="language-plaintext highlighter-rouge">strand</code> parameter)</strong></td>
      <td><code class="language-plaintext highlighter-rouge">-s RF</code></td>
      <td><code class="language-plaintext highlighter-rouge">-s FR</code></td>
      <td><code class="language-plaintext highlighter-rouge">-s XS</code></td>
    </tr>
    <tr>
      <td><strong>Example kits</strong></td>
      <td><strong>Example methods/kits:</strong> dUTP, NSR, NNSR, Illumina TruSeq Strand Specific Total RNA, NEBNext Ultra II Directional, Watchmaker RNA Library Prep Kit with Polaris Depletion</td>
      <td><strong>Example methods/kits:</strong> Ligation, Standard SOLiD, NuGEN Encore, 10X 5’ scRNA data</td>
      <td><strong>Example kits/data:</strong> Standard Illumina, NuGEN OvationV2, SMARTer universal low input RNA kit (TaKara), GDC normalized TCGA data</td>
    </tr>
  </tbody>
</table>

<h3 id="notes">Notes</h3>

<p>To identify which <code class="language-plaintext highlighter-rouge">--library-type</code> setting to use with TopHat, Illumina specifically documents the types in the ‘RNA Sequencing Analysis with TopHat’ Booklet. For the TruSeq RNA Sample Prep Kit, the appropriate library type is <code class="language-plaintext highlighter-rouge">fr-unstranded</code>. For TruSeq stranded sample prep kits, the library type is specified as <code class="language-plaintext highlighter-rouge">fr-firststrand</code>. These posts are also very informative: <a href="https://onetipperday.blogspot.com/2012/07/how-to-tell-which-library-type-to-use.html">How to tell which library type to use (fr-firststrand or fr-secondstrand)?</a> and <a href="https://www.biostars.org/p/56958/">How to determine if a library Is strand-specific</a>. Another suggestion is to view aligned reads in IGV and determine the read orientation by one of two methods. First, you can have IGV color alignments according to strand using the ‘Color alignments’ by ‘First-of-pair strand’ setting. Second, to get more detailed information you can hover your cursor over a read aligned to an exon. ‘F2 R1’ means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5’ to 3’) this would denote a fr-firststrand setting in TopHat, i.e. “the right-most end of the fragment (in transcript coordinates) is the first sequenced”. For a negative DNA strand transcript (3’ to 5’) this would denote a fr-secondstrand setting in TopHat. ‘F1 R2’ means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but its simply the inverse for ‘F1 R2’ mapping. Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically ‘RF’ orientation is reserved for large-insert mate-pair libraries. Other orientations like ‘FF’ and ‘RR’ seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference. Additional details are provided in the TopHat manual.</p>

<p>For HTSeq, the htseq-count manual indicates that for the <code class="language-plaintext highlighter-rouge">--stranded</code> option, <code class="language-plaintext highlighter-rouge">stranded=no</code> means that a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For <code class="language-plaintext highlighter-rouge">stranded=yes</code> and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For <code class="language-plaintext highlighter-rouge">stranded=reverse</code>, these rules are reversed.</p>

<p>For the ‘CollectRnaSeqMetrics’ sub-command of Picard, the Picard manual indicates that one should use <code class="language-plaintext highlighter-rouge">FIRST_READ_TRANSCRIPTION_STRAND</code> if the reads are expected to be on the transcription strand.</p>

<h3 id="example-data-providers">Example data providers</h3>

<p>Examples (from check_strandedness) that we have observed from different providers (note that these could be changed by the provider at any time, so you should always check your own data):</p>

<ul>
  <li>Boston Gene: RF/fr-firststrand</li>
  <li>Personalis: RF/fr-firststrand</li>
  <li>WASHU CLE Lab: RF/fr-firststrand</li>
  <li>Caris: RF/fr-firststrand</li>
  <li>Tempus: FR/fr-secondstrand</li>
  <li>IGM @ Nationwide Children’s Hospital: FR/fr-secondstrand</li>
</ul>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Strand-related settings There are various strand-related settings for RNA-seq tools that must be adjusted to account for library construction strategy. The following table provides read orientation codes and software settings for commonly used RNA-seq analysis tools including: IGV, TopHat, HISAT2, HTSeq, Picard, Kallisto, StringTie, and others. Each of these explanations/settings is provided for several commonly used RNA-seq library construction kits that produce either stranded or unstranded data. NOTE: A useful tool to infer strandedness of your raw sequence data is the check_strandedness tool. We provide a tutorial for using this tool here. NOTE: In the table below, the list of methods/kits for specific strand settings assumes that these kits are used as specified by their manufacturer. It is very possible that a sequencing provider/core may make modifications to these kits. For example, in one case we obtained RNAseq data processed with NEBNext Ultra II Directional kit (dUTP method). However instead of using the NEB hairpin adapters, IDT xGen UDI-UMI adapters were substituted, and this results in the insert strandedness being flipped (from RF/fr-firststrand to FR/fr-secondstrand). Because this level of detail is not always provided it is highly recommended to confirm your data’s strandedness empirically. Tool RF/fr-firststrand stranded (dUTP) FR/fr-secondstrand stranded (Ligation) Unstranded check_strandedness (output) RF/fr-firststrand FR/fr-secondstrand unstranded IGV (5p to 3p read orientation code) F2R1 F1R2 F2R1 or F1R2 TopHat (--library-type parameter) fr-firststrand fr-secondstrand fr-unstranded HISAT2 (--rna-strandness parameter) R/RF F/FR NONE HTSeq (--stranded/-s parameter) reverse yes no STAR n/a (STAR doesn’t use library strandedness info for mapping) NONE NONE Picard CollectRnaSeqMetrics (STRAND_SPECIFICITY parameter) SECOND_READ_TRANSCRIPTION_STRAND FIRST_READ_TRANSCRIPTION_STRAND NONE Kallisto quant (parameter) --rf-stranded --fr-stranded NONE StringTie (parameter) --rf --fr NONE FeatureCounts (-s parameter) 2 1 0 RSEM (–forward-prob parameter) 0 1 0.5 Salmon (--libType parameter) ISR (assuming paired-end with inward read orientation) ISF (assuming paired-end with inward read orientation) IU (assuming paired-end with inward read orientation) Trinity (–SS_lib_type parameter) RF FR NONE MGI CWL YAML (strand parameter) first second NONE WASHU WDL YAML (strand parameter) first second unstranded RegTools (strand parameter) -s RF -s FR -s XS Example kits Example methods/kits: dUTP, NSR, NNSR, Illumina TruSeq Strand Specific Total RNA, NEBNext Ultra II Directional, Watchmaker RNA Library Prep Kit with Polaris Depletion Example methods/kits: Ligation, Standard SOLiD, NuGEN Encore, 10X 5’ scRNA data Example kits/data: Standard Illumina, NuGEN OvationV2, SMARTer universal low input RNA kit (TaKara), GDC normalized TCGA data Notes To identify which --library-type setting to use with TopHat, Illumina specifically documents the types in the ‘RNA Sequencing Analysis with TopHat’ Booklet. For the TruSeq RNA Sample Prep Kit, the appropriate library type is fr-unstranded. For TruSeq stranded sample prep kits, the library type is specified as fr-firststrand. These posts are also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)? and How to determine if a library Is strand-specific. Another suggestion is to view aligned reads in IGV and determine the read orientation by one of two methods. First, you can have IGV color alignments according to strand using the ‘Color alignments’ by ‘First-of-pair strand’ setting. Second, to get more detailed information you can hover your cursor over a read aligned to an exon. ‘F2 R1’ means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5’ to 3’) this would denote a fr-firststrand setting in TopHat, i.e. “the right-most end of the fragment (in transcript coordinates) is the first sequenced”. For a negative DNA strand transcript (3’ to 5’) this would denote a fr-secondstrand setting in TopHat. ‘F1 R2’ means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but its simply the inverse for ‘F1 R2’ mapping. Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically ‘RF’ orientation is reserved for large-insert mate-pair libraries. Other orientations like ‘FF’ and ‘RR’ seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference. Additional details are provided in the TopHat manual. For HTSeq, the htseq-count manual indicates that for the --stranded option, stranded=no means that a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed. For the ‘CollectRnaSeqMetrics’ sub-command of Picard, the Picard manual indicates that one should use FIRST_READ_TRANSCRIPTION_STRAND if the reads are expected to be on the transcription strand. Example data providers Examples (from check_strandedness) that we have observed from different providers (note that these could be changed by the provider at any time, so you should always check your own data): Boston Gene: RF/fr-firststrand Personalis: RF/fr-firststrand WASHU CLE Lab: RF/fr-firststrand Caris: RF/fr-firststrand Tempus: FR/fr-secondstrand IGM @ Nationwide Children’s Hospital: FR/fr-secondstrand]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Complete Result Sets</title><link href="http://www.rnabio.org//module-09-appendix/0009/11/01/CompleteResultSets/" rel="alternate" type="text/html" title="Complete Result Sets" /><published>0009-11-01T00:00:00+00:00</published><updated>0009-11-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/11/01/CompleteResultSets</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/11/01/CompleteResultSets/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>The following links provide examples of complete result sets for different interations of this coures. These are meant to be the complete set of result files obtained by the instructor running through all the commands of the course. The files are made available in the same file/directory structure as you should get from following the instructions yourself.</p>

<ul>
  <li><a href="http://genomedata.org/rnaseq-tutorial/results/cbw2020/">CBW June 2020 (Virtual)</a></li>
</ul>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Introduction The following links provide examples of complete result sets for different interations of this coures. These are meant to be the complete set of result files obtained by the instructor running through all the commands of the course. The files are made available in the same file/directory structure as you should get from following the instructions yourself. CBW June 2020 (Virtual)]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bioinformatics Best Practices</title><link href="http://www.rnabio.org//module-09-appendix/0009/10/01/Bioinformatics_Best_Practices/" rel="alternate" type="text/html" title="Bioinformatics Best Practices" /><published>0009-10-01T00:00:00+00:00</published><updated>0009-10-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/10/01/Bioinformatics_Best_Practices</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/10/01/Bioinformatics_Best_Practices/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>This <em>best practices</em> guide provides a basic overview of useful practices and tools for managing bioinformatics environments and analysis development.</p>

<h2 id="managing-your-analysis-with-notebooks">Managing Your Analysis with Notebooks</h2>

<p>Similar to the use of a laboratory notebook, taking notes about the procedures and analysis you performed is critical to reproducible science. There are a number of scientific computing notebooks available, but the most popular by far is the <a href="http://jupyter.org/">Jupyter Notebook</a>.</p>

<p>Jupyter supports interactive data science and scientific computer across a small number of languages, although the most popular use of Jupyter is with <a href="https://www.python.org/">Python</a>, as the Jupyter notebook is built upon the Python-based <a href="https://ipython.org/">iPython Notebook</a>.</p>

<h3 id="example-notebooks">Example notebooks</h3>

<p>A <a href="https://try.jupyter.org">live version of Jupyter</a> is available to try online, and provides several example notebooks in a few different languages. You can also check out a <a href="https://gist.github.com/ahwagner/595291c53ddaf8da64e995ad3a555d54">real analysis</a> of Guide to Pharmacology gene family data for incorporation into the <a href="http://www.dgidb.org/faq">Drug-Gene Interaction Database</a>.</p>

<h2 id="versioning-code-with-git-and-github">Versioning Code with Git and GitHub</h2>

<p><a href="https://git-scm.com/">Git</a> is a distributed version control system that allows users to make changes to code while simultaneously documenting those changes and preserving a history, allowing code to be rolled back to a previous version quickly and safely. <a href="https://github.com/">GitHub</a> is a freemium, online repository hosting service. You may use GitHub to track projects, discuss issues, document applications, and review code. GitHub is one of the best ways to share your projects, and should be used from the very onset of a project. Some forethought should be given in creating and managing a repository, however, as GitHub is not a good place to share very large or sensitive data files. See the <a href="https://docs.github.com/en/get-started/quickstart/hello-world">10-minute introduction to using GitHub</a>.</p>

<h2 id="managing-your-compute-environment">Managing Your Compute Environment</h2>

<p>One of the most challenging aspects of bioinformatics workflows is reproducibility. In addition to documenting your analysis with a notebook, providing a copy of your compute environment limits variability in results, allowing for future reproduction of results. A world of options exist to handle this, although some of the most common options are presented.</p>

<p><a href="https://aws.amazon.com/ec2/">AWS Elastic Cloud Computing</a> is a useful service for creating entire virtual machines that can easily be copied and distributed. This option does require a paid account with Amazon, and the costs of storing the images and running instances may add up over time, especially if every analysis is stored in a separate image. Additionally, this option does not isolate the analysis environment from the system environment, potentially leading to changes in analysis output as system libraries are updated over time. The RNA-seq wiki makes heavy use of AWS as a distribution platform.</p>

<p><a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a> is a general-purpose full virtualizer that allows you to emulate a computer, complete with virtual disks, a virtual operating system, and any data and applications stored therein. It has the advantage of creating machines that are stored and run on local hardware (e.g. your personal workstation), but the extra overhead of running a virtual computer on top of a host operating system can considerably slow performance of tools stored on the virtual machine, and thus is best used for testing or demonstration purposes.</p>

<p><a href="https://docs.docker.com/engine/understanding-docker/">Docker</a> packages apps and their dependencies into <em>containers</em> which may be <em>docked</em> to a docker engine running on a computer. Docker engines are available on all major operating systems, and allow software to remain infrastructure independent while sharing a filespace and system resources with other docked containers.  This is a much more efficient approach than guest virtual machines, and containers may be docked locally or on cloud-based infrastructure.</p>

<p><a href="http://conda.pydata.org/docs/">Conda</a> is a language-agnostic package, dependency and environment management system. It is included in the data-science-focused distribution of Conda, <a href="https://www.anaconda.com/about-us">Anaconda</a>. Anaconda is based on Python and R packages for the analysis of scientific, large-scale data. Bioinformaticians also commonly use <a href="https://bioconda.github.io/">Bioconda</a>, which add channels to Conda with bioinformatics tools (such as the popular sequence alignment tool BWA).</p>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Introduction This best practices guide provides a basic overview of useful practices and tools for managing bioinformatics environments and analysis development. Managing Your Analysis with Notebooks Similar to the use of a laboratory notebook, taking notes about the procedures and analysis you performed is critical to reproducible science. There are a number of scientific computing notebooks available, but the most popular by far is the Jupyter Notebook. Jupyter supports interactive data science and scientific computer across a small number of languages, although the most popular use of Jupyter is with Python, as the Jupyter notebook is built upon the Python-based iPython Notebook. Example notebooks A live version of Jupyter is available to try online, and provides several example notebooks in a few different languages. You can also check out a real analysis of Guide to Pharmacology gene family data for incorporation into the Drug-Gene Interaction Database. Versioning Code with Git and GitHub Git is a distributed version control system that allows users to make changes to code while simultaneously documenting those changes and preserving a history, allowing code to be rolled back to a previous version quickly and safely. GitHub is a freemium, online repository hosting service. You may use GitHub to track projects, discuss issues, document applications, and review code. GitHub is one of the best ways to share your projects, and should be used from the very onset of a project. Some forethought should be given in creating and managing a repository, however, as GitHub is not a good place to share very large or sensitive data files. See the 10-minute introduction to using GitHub. Managing Your Compute Environment One of the most challenging aspects of bioinformatics workflows is reproducibility. In addition to documenting your analysis with a notebook, providing a copy of your compute environment limits variability in results, allowing for future reproduction of results. A world of options exist to handle this, although some of the most common options are presented. AWS Elastic Cloud Computing is a useful service for creating entire virtual machines that can easily be copied and distributed. This option does require a paid account with Amazon, and the costs of storing the images and running instances may add up over time, especially if every analysis is stored in a separate image. Additionally, this option does not isolate the analysis environment from the system environment, potentially leading to changes in analysis output as system libraries are updated over time. The RNA-seq wiki makes heavy use of AWS as a distribution platform. VirtualBox is a general-purpose full virtualizer that allows you to emulate a computer, complete with virtual disks, a virtual operating system, and any data and applications stored therein. It has the advantage of creating machines that are stored and run on local hardware (e.g. your personal workstation), but the extra overhead of running a virtual computer on top of a host operating system can considerably slow performance of tools stored on the virtual machine, and thus is best used for testing or demonstration purposes. Docker packages apps and their dependencies into containers which may be docked to a docker engine running on a computer. Docker engines are available on all major operating systems, and allow software to remain infrastructure independent while sharing a filespace and system resources with other docked containers. This is a much more efficient approach than guest virtual machines, and containers may be docked locally or on cloud-based infrastructure. Conda is a language-agnostic package, dependency and environment management system. It is included in the data-science-focused distribution of Conda, Anaconda. Anaconda is based on Python and R packages for the analysis of scientific, large-scale data. Bioinformaticians also commonly use Bioconda, which add channels to Conda with bioinformatics tools (such as the popular sequence alignment tool BWA).]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">POSIT Setup</title><link href="http://www.rnabio.org//module-09-appendix/0009/09/03/POSIT_Setup/" rel="alternate" type="text/html" title="POSIT Setup" /><published>0009-09-03T00:00:00+00:00</published><updated>0009-09-03T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/09/03/POSIT_Setup</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/09/03/POSIT_Setup/"><![CDATA[<h2 id="posit-setup-for-use-in-cri-2024-and-2025-workshop">Posit setup for use in CRI 2024 and 2025 workshop</h2>

<p>This tutorial explains how Posit cloud RStudio was configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Posit RStudio.</p>

<p>A Posit workspace was already created by the workshop organizers. We used Posit projects with 16GB RAM and 2 cores for the workshop with OS Ubuntu 20.04. Using these configurations, we created a template file that has all the raw data files uploaded along with the R packages needed for the workshop. From the student side, the intention is to make copies off this template so that they have an RStudio environment with the raw data files that has the packages pre-installed.</p>

<h2 id="upload-raw-data">Upload raw data</h2>

<p>Folders for uploading raw data were created using the RStudio terminal. Files were either uploaded from a local laptop/ storage1 location using the <code class="language-plaintext highlighter-rouge">Upload</code> feature in the bottom right pane of the RStudio window; or downloaded from <a href="http://genomedata.org">genomedata.org</a> using <code class="language-plaintext highlighter-rouge">wget</code> from the RStudio terminal.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>data
<span class="nb">mkdir </span>outdir
<span class="nb">mkdir </span>outdir_single_cell_rna
<span class="nb">mkdir </span>package_installation

<span class="nb">cd </span>data
<span class="nb">mkdir </span>single_cell_rna
<span class="nb">mkdir </span>bulk_rna
</code></pre></div></div>

<h3 id="files-in-single_cell_rna">Files in single_cell_rna</h3>
<ul>
  <li>CellRanger outputs for reps1,3,5 (uploaded from <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/counts_gex/sample_filtered_feature_bc_matrix.h5.zip</code>)</li>
  <li>BCR and TCR clonotypes (uploaded from <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_b_posit.zip</code> and <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_t_posit.zip</code>)</li>
  <li>MSigDB <code class="language-plaintext highlighter-rouge">M8: cell type signature gene sets</code> (downloaded GMT file from <a href="https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2023.2.Mm/m8.all.v2023.2.Mm.symbols.gmt">MSigDB website</a> to laptop and then uploaded to single_cell_rna folder)</li>
  <li>CONICSmat mm10 chr arms positions file (downloaded file from CONICSmat GitHub - <a href="https://github.com/diazlab/CONICS/blob/master/chromosome_full_positions_mm10.txt">chromosome_full_positions_mm10.txt</a> to laptop and then uploaded to single_cell_rna folder)</li>
  <li>VarTrix file with barcodes and tumor calls (uploaded from <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI_Updated_Barcodes.tsv</code>) -&gt; might not need this so may remove.</li>
  <li>VarTrix output files (uploaded all matrices and the barcodes files from <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/vartrix_outputs_for_CRI.zip</code> - uploaded to a <code class="language-plaintext highlighter-rouge">cancer_cell_id</code> folder in <code class="language-plaintext highlighter-rouge">data/single_cell_rna/</code>)</li>
  <li>Mouse variants VCF file (uploaded file from <code class="language-plaintext highlighter-rouge">/storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/exome/output_updated/final_basic_filtered_annotated.vcf</code>)</li>
</ul>

<p>Posit requires all files to be zipped prior to uploading and automatically unzips the folder after the upload. After uploading the files, made a folder for the cellranger outputs, and moved the <code class="language-plaintext highlighter-rouge">.h5</code> files there. Will also download inferCNV files using <code class="language-plaintext highlighter-rouge">wget</code></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#organize cellranger outputs</span>
<span class="nb">cd</span> /cloud/project/data/single_cell_rna
<span class="nb">mkdir </span>cellranger_outputs
<span class="nb">mv</span> <span class="k">*</span>.h5 cellranger_outputs

<span class="c">#download inferCNV reference files and organize all reference files</span>
<span class="nb">mkdir </span>reference_files
<span class="nb">mv </span>m8.all.v2023.2.Mm.symbols.gmt reference_files
<span class="nb">mv </span>Tumor_Calls_per_Variants_for_CRI.tsv reference_files
<span class="nb">cd </span>reference_files
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_id.infercnv_positions
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_name.infercnv_positions

<span class="c">#organize vartrix files</span>
<span class="nb">cd</span> /cloud/project/data/single_cell_rna
<span class="nb">mkdir </span>cancer_cell_id 
<span class="nb">cd </span>cancer_cell_id
wget http://genomedata.org/cri-workshop/somatic_variants_exome/mcb6c-exome-somatic.variants.annotated.clean.tsv

</code></pre></div></div>

<h3 id="files-in-bulk_rna">Files in bulk_rna</h3>
<ul>
  <li>Batch correction file (downloaded from genomedata - <a href="http://genomedata.org/rnaseq-tutorial/batch_correction/GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv">GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv</a>)</li>
  <li>DE analysis files (downloaded from genomedata - <a href="http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/ENSG_ID2Name.txt">ENSG_ID2Name.txt</a> and <a href="http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/gene_read_counts_table_all_final.tsv">gene_read_counts_table_all_final.tsv</a>)</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /cloud/project/data/bulk_rna
wget http://genomedata.org/rnaseq-tutorial/batch_correction/GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/ENSG_ID2Name.txt
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/gene_read_counts_table_all_final.tsv
</code></pre></div></div>

<h3 id="back-up-files">Back-up files</h3>
<ul>
  <li>Created folder in <code class="language-plaintext highlighter-rouge">outdir/single_cell_rna</code> called <code class="language-plaintext highlighter-rouge">backup_files</code>. Ran through QA/QC assessment and celltyping modules and added <code class="language-plaintext highlighter-rouge">preprocessed_object.rds</code> Seurat object from there to backup_files.</li>
</ul>

<h2 id="installing-packages">Installing packages</h2>

<p>All package installations are from CRAN or BioConductor or GitHub pages, except for CytoTRACE. That was downloaded to the <code class="language-plaintext highlighter-rouge">package_installation</code> folder and then installed using <code class="language-plaintext highlighter-rouge">devtools</code>.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Download CytoTRACE tar.gz file</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz"</span><span class="p">,</span><span class="w"> </span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"package_installation/CytoTRACE_0.3.3.tar.gz"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Installing package installers</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"devtools"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"BiocManager"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Bulk RNA seq libraries</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"genefilter"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"data.table"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"AnnotationDbi"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"org.Hs.eg.db"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"GO.db"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"gage"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"sva"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"gridExtra"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"edgeR"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"UpSetR"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"DESeq2"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"gtable"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"apeglm"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Intro to R packages</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"tidyr"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"stringr"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"tidyverse"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"MASS"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"ggpubr"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Single-cell RNA seq libraries</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"sva"</span><span class="p">)</span><span class="w"> </span><span class="c1">#need this for cytotrace</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_local</span><span class="p">(</span><span class="s2">"package_installation/CytoTRACE_0.3.3.tar.gz"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"Seurat"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"dplyr"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"Matrix"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"hdf5r"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"bench"</span><span class="p">)</span><span class="w"> </span><span class="c1"># to mark time</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"viridis"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"R.utils"</span><span class="p">)</span><span class="w">
</span><span class="n">remotes</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"satijalab/seurat-wrappers"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"celldex"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"SingleR"</span><span class="p">)</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"immunogenomics/presto"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"EnhancedVolcano"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"clusterProfiler"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"org.Mm.eg.db"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"msigdbr"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"BiocGenerics"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"DelayedArray"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"DelayedMatrixStats"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"limma"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"lme4"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"S4Vectors"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"SingleCellExperiment"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"SummarizedExperiment"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"batchelor"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"HDF5Array"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"terra"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"ggrastr"</span><span class="p">)</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"cole-trapnell-lab/monocle3"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"beanplot"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"mixtools"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"pheatmap"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"zoo"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"squash"</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"showtext"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"biomaRt"</span><span class="p">)</span><span class="w">
</span><span class="n">BiocManager</span><span class="o">::</span><span class="n">install</span><span class="p">(</span><span class="s2">"scran"</span><span class="p">)</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"diazlab/CONICS/CONICSmat"</span><span class="p">,</span><span class="w"> </span><span class="n">dep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"gprofiler2"</span><span class="p">)</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="n">repo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ncborcherding/scRepertoire"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Posit setup for use in CRI 2024 and 2025 workshop This tutorial explains how Posit cloud RStudio was configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Posit RStudio. A Posit workspace was already created by the workshop organizers. We used Posit projects with 16GB RAM and 2 cores for the workshop with OS Ubuntu 20.04. Using these configurations, we created a template file that has all the raw data files uploaded along with the R packages needed for the workshop. From the student side, the intention is to make copies off this template so that they have an RStudio environment with the raw data files that has the packages pre-installed. Upload raw data Folders for uploading raw data were created using the RStudio terminal. Files were either uploaded from a local laptop/ storage1 location using the Upload feature in the bottom right pane of the RStudio window; or downloaded from genomedata.org using wget from the RStudio terminal. mkdir data mkdir outdir mkdir outdir_single_cell_rna mkdir package_installation cd data mkdir single_cell_rna mkdir bulk_rna Files in single_cell_rna CellRanger outputs for reps1,3,5 (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/counts_gex/sample_filtered_feature_bc_matrix.h5.zip) BCR and TCR clonotypes (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_b_posit.zip and /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_t_posit.zip) MSigDB M8: cell type signature gene sets (downloaded GMT file from MSigDB website to laptop and then uploaded to single_cell_rna folder) CONICSmat mm10 chr arms positions file (downloaded file from CONICSmat GitHub - chromosome_full_positions_mm10.txt to laptop and then uploaded to single_cell_rna folder) VarTrix file with barcodes and tumor calls (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI_Updated_Barcodes.tsv) -&gt; might not need this so may remove. VarTrix output files (uploaded all matrices and the barcodes files from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/vartrix_outputs_for_CRI.zip - uploaded to a cancer_cell_id folder in data/single_cell_rna/) Mouse variants VCF file (uploaded file from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/exome/output_updated/final_basic_filtered_annotated.vcf) Posit requires all files to be zipped prior to uploading and automatically unzips the folder after the upload. After uploading the files, made a folder for the cellranger outputs, and moved the .h5 files there. Will also download inferCNV files using wget #organize cellranger outputs cd /cloud/project/data/single_cell_rna mkdir cellranger_outputs mv *.h5 cellranger_outputs #download inferCNV reference files and organize all reference files mkdir reference_files mv m8.all.v2023.2.Mm.symbols.gmt reference_files mv Tumor_Calls_per_Variants_for_CRI.tsv reference_files cd reference_files wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_id.infercnv_positions wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_name.infercnv_positions #organize vartrix files cd /cloud/project/data/single_cell_rna mkdir cancer_cell_id cd cancer_cell_id wget http://genomedata.org/cri-workshop/somatic_variants_exome/mcb6c-exome-somatic.variants.annotated.clean.tsv Files in bulk_rna Batch correction file (downloaded from genomedata - GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv) DE analysis files (downloaded from genomedata - ENSG_ID2Name.txt and gene_read_counts_table_all_final.tsv) cd /cloud/project/data/bulk_rna wget http://genomedata.org/rnaseq-tutorial/batch_correction/GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/ENSG_ID2Name.txt wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/gene_read_counts_table_all_final.tsv Back-up files Created folder in outdir/single_cell_rna called backup_files. Ran through QA/QC assessment and celltyping modules and added preprocessed_object.rds Seurat object from there to backup_files. Installing packages All package installations are from CRAN or BioConductor or GitHub pages, except for CytoTRACE. That was downloaded to the package_installation folder and then installed using devtools. #Download CytoTRACE tar.gz file download.file("https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz", destfile = "package_installation/CytoTRACE_0.3.3.tar.gz") # Installing package installers install.packages("devtools") install.packages("BiocManager") # Bulk RNA seq libraries BiocManager::install("genefilter") install.packages("dplyr") install.packages("ggplot2") install.packages("data.table") BiocManager::install("AnnotationDbi") BiocManager::install("org.Hs.eg.db") BiocManager::install("GO.db") BiocManager::install("gage") BiocManager::install("sva") install.packages("gridExtra") BiocManager::install("edgeR") install.packages("UpSetR") BiocManager::install("DESeq2") install.packages("gtable") BiocManager::install("apeglm") # Intro to R packages install.packages("tidyr") install.packages("stringr") install.packages("ggplot2") install.packages("dplyr") install.packages("tidyverse") install.packages("MASS") install.packages("ggpubr") # Single-cell RNA seq libraries BiocManager::install("sva") #need this for cytotrace devtools::install_local("package_installation/CytoTRACE_0.3.3.tar.gz") install.packages("Seurat") install.packages("ggplot2") install.packages("dplyr") install.packages("Matrix") install.packages("hdf5r") install.packages("bench") # to mark time install.packages("viridis") install.packages("R.utils") remotes::install_github("satijalab/seurat-wrappers") BiocManager::install("celldex") BiocManager::install("SingleR") devtools::install_github("immunogenomics/presto") BiocManager::install("EnhancedVolcano") BiocManager::install("clusterProfiler") BiocManager::install("org.Mm.eg.db") install.packages("msigdbr") BiocManager::install("BiocGenerics") BiocManager::install("DelayedArray") BiocManager::install("DelayedMatrixStats") BiocManager::install("limma") BiocManager::install("lme4") BiocManager::install("S4Vectors") BiocManager::install("SingleCellExperiment") BiocManager::install("SummarizedExperiment") BiocManager::install("batchelor") BiocManager::install("HDF5Array") BiocManager::install("terra") BiocManager::install("ggrastr") devtools::install_github("cole-trapnell-lab/monocle3") install.packages("beanplot") install.packages("mixtools") install.packages("pheatmap") install.packages("zoo") install.packages("squash") install.packages("showtext") BiocManager::install("biomaRt") BiocManager::install("scran") devtools::install_github("diazlab/CONICS/CONICSmat", dep = FALSE) install.packages("gprofiler2") devtools::install_github(repo = "ncborcherding/scRepertoire")]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">GCP Setup</title><link href="http://www.rnabio.org//module-09-appendix/0009/09/02/GCP_Setup/" rel="alternate" type="text/html" title="GCP Setup" /><published>0009-09-02T00:00:00+00:00</published><updated>0009-09-02T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/09/02/GCP_Setup</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/09/02/GCP_Setup/"><![CDATA[<h1 id="under-development">UNDER DEVELOPMENT</h1>

<h2 id="google-cloud-platform-setup-for-use-in-workshop">Google Cloud Platform setup for use in workshop</h2>

<p>This tutorial explains how a Google Cloud Instance can be configured from scratch for the course.  This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Google GCP.</p>

<h2 id="create-a-google-cloud-account">Create a Google Cloud account</h2>

<ol>
  <li>You will need a Google account (personal or institutional)</li>
  <li>Use the above email account to log into the Google Cloud Console: https://console.cloud.google.com/.
Note: Any GCP account needs to be linked to an actual person/credit card account or institutional billing account.</li>
  <li>Create a Google Cloud Project connected to a billing source</li>
  <li>Optional - Set up an IAM account. Details to be resolved…</li>
  <li>Request limit increases. You need to be able to spin up at least one instance for every student and TA/instructor. To find current limits and request increases in the console, go to: <code class="language-plaintext highlighter-rouge">IAM &amp; Admin</code> -&gt; <code class="language-plaintext highlighter-rouge">Quotas</code>.</li>
  <li>In the GCP console: Go to <code class="language-plaintext highlighter-rouge">Compute Engine</code> -&gt; <code class="language-plaintext highlighter-rouge">VM instances</code>.</li>
</ol>

<h2 id="start-with-existing-base-image">Start with existing base image</h2>

<ol>
  <li><code class="language-plaintext highlighter-rouge">Create Instance</code></li>
  <li>Give the Instance a Name (e.g. <code class="language-plaintext highlighter-rouge">rnabio-course-2023</code>)</li>
  <li>Select Machine Type (e.g. E2 Series: <code class="language-plaintext highlighter-rouge">e2-standard-2</code>)</li>
  <li>Change the Boot disk to Ubuntu -&gt; Ubuntu 20.04 LTS (x86/64). Change size to 250 GB.</li>
  <li>Under Firewall, select these options: <code class="language-plaintext highlighter-rouge">Allow HTTP traffic</code> and <code class="language-plaintext highlighter-rouge">Allow HTTPS traffic</code></li>
  <li>Hit the <code class="language-plaintext highlighter-rouge">Create</code> button</li>
</ol>

<h2 id="install-the-google-cloud-sdk-gcloud-authenticate-your-user-and-login-to-your-vm">Install the Google Cloud SDK (gcloud), authenticate your user, and login to your VM</h2>

<ol>
  <li>Install the Google Cloud commandline interface following instructions here: https://cloud.google.com/sdk/docs/install</li>
  <li>Use the following command and follow instructions to authenticate your user: <code class="language-plaintext highlighter-rouge">gcloud auth login</code></li>
  <li>Set the project to the google billing project created above as follows: <code class="language-plaintext highlighter-rouge">gcloud config set project $project_name</code></li>
  <li>Check the authentication configuration as follows: <code class="language-plaintext highlighter-rouge">gcloud config list</code></li>
  <li>Log into the instance using the instance name chosen above follows</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud compute ssh rnabio-course-2023
</code></pre></div></div>

<h2 id="set-up-the-ubuntu-user">Set up the ubuntu user:</h2>
<p>Logging into a Google VM of Ubuntu is a bitter different from AWS. By default you will login with your Google user name (or is it your username from the host machine you login from?) instead of using the “ubuntu” user.</p>

<p>Set password for the ubuntu user (and make note of this somewhere safe). Then change users to the “ubuntu” user before proceeding with the rest of this setup. Note that later if you login as another sudo user, if you need to, you should be able to reset the password associated with the ubuntu user.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>whoami
sudo passwd ubuntu
su ubuntu
cd ~

</code></pre></div></div>

<h2 id="perform-basic-linux-configuration">Perform basic linux configuration</h2>

<ul>
  <li>To allow installation of bioinformatics tools some basic dependencies must be installed first.</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get upgrade
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> <span class="nb">install </span>make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcurl4-openssl-dev
<span class="nb">sudo ln</span> <span class="nt">-s</span> /usr/include/jsoncpp/json/ /usr/include/json
<span class="nb">sudo </span>timedatectl set-timezone America/Chicago
</code></pre></div></div>

<ul>
  <li>logout and log back in for changes to take effect.</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exit
exit
gcloud compute ssh rnabio-course-2023
su ubuntu
cd ~

</code></pre></div></div>

<h2 id="add-ubuntu-user-to-docker-group">Add ubuntu user to docker group</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>usermod <span class="nt">-aG</span> docker ubuntu
</code></pre></div></div>

<p>Then exit shell and log back into instance.</p>

<h2 id="install-any-desired-informatics-tools">Install any desired informatics tools</h2>

<ul>
  <li><strong>NOTE:</strong> R in particular is a slow install.</li>
  <li><strong>NOTE:</strong></li>
</ul>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
</span></code></pre></div></div>

<ul>
  <li>Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add <code class="language-plaintext highlighter-rouge">export RNA_HOME=~/workspace/rnaseq</code> to the .bashrc file. See <a href="https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc">https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc</a>.</li>
  <li><strong>NOTE:</strong> In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a <code class="language-plaintext highlighter-rouge">man ls</code> and if the problem exists, add the following to .bashrc:</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">MANPAGER</span><span class="o">=</span>less
</code></pre></div></div>

<h3 id="install-rna-seq-software">Install RNA-seq software</h3>

<ul>
  <li>These install instructions should be identical to those found on <a href="https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation">https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation</a> except that each tool is installed in <code class="language-plaintext highlighter-rouge">/home/ubuntu/bin/</code> and its install location is exported to the $PATH variable for easy access.</li>
</ul>

<h4 id="create-directory-to-install-software-to-and-setup-path-variables">Create directory to install software to and setup path variables</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/bin
<span class="nb">cd </span>bin
<span class="nv">WORKSPACE</span><span class="o">=</span>/home/ubuntu/workspace
<span class="nv">HOME</span><span class="o">=</span>/home/ubuntu
</code></pre></div></div>

<h4 id="install-samtools">Install <a href="http://www.htslib.org/">SAMtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2
bunzip2 samtools-1.16.1.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> samtools-1.16.1.tar
<span class="nb">cd </span>samtools-1.16.1
make
./samtools
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/samtools-1.16.1:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bam-readcount">Install <a href="https://github.com/genome/bam-readcount">bam-readcount</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">export </span><span class="nv">SAMTOOLS_ROOT</span><span class="o">=</span>/home/ubuntu/bin/samtools-1.16.1
git clone https://github.com/genome/bam-readcount 
<span class="nb">cd </span>bam-readcount
<span class="nb">mkdir </span>build
<span class="nb">cd </span>build
cmake ..
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bam-readcount/build/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-hisat2">Install <a href="https://daehwankimlab.github.io/hisat2/">HISAT2</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">uname</span> <span class="nt">-m</span>
<span class="nb">cd</span> ~/bin
curl <span class="nt">-s</span> https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download <span class="o">&gt;</span> hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
<span class="nb">cd </span>hisat2-2.2.1
./hisat2 <span class="nt">-h</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/hisat2-2.2.1:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-stringtie">Install <a href="https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual">StringTie</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz
<span class="nb">tar</span> <span class="nt">-xzvf</span> stringtie-2.1.6.tar.gz
<span class="nb">cd </span>stringtie-2.1.6
make release
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/stringtie-2.1.6:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-gffcompare">Install <a href="http://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare">gffcompare</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
<span class="nb">tar</span> <span class="nt">-xzvf</span> gffcompare-0.12.6.Linux_x86_64.tar.gz
<span class="nb">cd </span>gffcompare-0.12.6.Linux_x86_64/
./gffcompare
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-htseq-count">Install <a href="https://htseq.readthedocs.io/en/master/install.html">htseq-count</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>python3-htseq
</code></pre></div></div>

<h4 id="make-sure-that-openssl-is-on-correct-version">Make sure that OpenSSL is on correct version</h4>

<p>TopHat will not install if the version of OpenSSL is too old.</p>

<p>To get version:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl version
</code></pre></div></div>

<p>If version is <code class="language-plaintext highlighter-rouge">OpenSSL 1.1.1f</code>, then it needs to be updated using the following steps.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
<span class="nb">tar</span> <span class="nt">-zxf</span> openssl-1.1.1g.tar.gz <span class="o">&amp;&amp;</span> <span class="nb">cd </span>openssl-1.1.1g
./config
make
make <span class="nb">test
sudo mv</span> /usr/bin/openssl ~/tmp <span class="c">#in case install goes wrong</span>
<span class="nb">sudo </span>make <span class="nb">install
sudo ln</span> <span class="nt">-s</span> /usr/local/bin/openssl /usr/bin/openssl
<span class="nb">sudo </span>ldconfig
</code></pre></div></div>

<p>Again, from the terminal issue the command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl version
</code></pre></div></div>

<p>Your output should be as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OpenSSL 1.1.1g  21 Apr 2020
</code></pre></div></div>

<p>Then create <code class="language-plaintext highlighter-rouge">~/.wgetrc</code> file and add to it
<code class="language-plaintext highlighter-rouge">ca_certificate=/etc/ssl/certs/ca-certificates.crt</code> using vim or nano.</p>

<h4 id="install-tophat">Install <a href="https://ccb.jhu.edu/software/tophat/index.shtml">TopHat</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> tophat-2.1.1.Linux_x86_64.tar.gz
<span class="nb">cd </span>tophat-2.1.1.Linux_x86_64/
./gtf_to_fasta
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-kallisto">Install <a href="https://pachterlab.github.io/kallisto/">kallisto</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> kallisto_linux-v0.44.0.tar.gz
<span class="nb">cd </span>kallisto_linux-v0.44.0/
./kallisto
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/kallisto_linux-v0.44.0:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-fastqc">Install <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
<span class="nb">cd </span>FastQC/
<span class="nb">chmod </span>755 fastqc
./fastqc <span class="nt">--help</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/FastQC:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="intall-a-particular-version-of-numpy-that-hopefully-works-with-all-the-dependencies-that-rely-on-it">Intall a particular version of numpy that hopefully works with all the dependencies that rely on it</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
pip <span class="nb">install</span> <span class="nt">--force-reinstall</span> <span class="nt">-v</span> <span class="s2">"numpy==1.24.1"</span>

</code></pre></div></div>

<h4 id="install-multiqc">Install <a href="http://multiqc.info/">MultiQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/.local/bin:<span class="nv">$PATH</span>
pip3 <span class="nb">install </span>multiqc
multiqc <span class="nt">--help</span>

</code></pre></div></div>

<h4 id="install-picard">Install <a href="https://broadinstitute.github.io/picard/">Picard</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar <span class="nt">-O</span> picard.jar
java <span class="nt">-jar</span> ~/bin/picard.jar
</code></pre></div></div>

<h4 id="install-flexbar">Install <a href="https://github.com/seqan/flexbar">Flexbar</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>flexbar
</code></pre></div></div>

<h4 id="install-regtools">Install <a href="https://github.com/griffithlab/regtools#regtools">Regtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/griffithlab/regtools
<span class="nb">cd </span>regtools/
<span class="nb">mkdir </span>build
<span class="nb">cd </span>build/
cmake ..
make
./regtools
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/regtools/build:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-rseqc">Install <a href="http://rseqc.sourceforge.net/">RSeQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip3 <span class="nb">install </span>RSeQC
~/.local/bin/read_GC.py
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/.local/bin/:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bedops">Install <a href="https://bedops.readthedocs.io/en/latest/">bedops</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>bedops_linux_x86_64-v2.4.40
<span class="nb">cd </span>bedops_linux_x86_64-v2.4.40
wget <span class="nt">-c</span> https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2
<span class="nb">tar</span> <span class="nt">-jxvf</span> bedops_linux_x86_64-v2.4.40.tar.bz2
./bin/bedops
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-gtftogenepred">Install <a href="https://bioconda.github.io/recipes/ucsc-gtftogenepred/README.html">gtfToGenePred</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>gtfToGenePred
<span class="nb">cd </span>gtfToGenePred
wget <span class="nt">-c</span> http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
<span class="nb">chmod </span>a+x gtfToGenePred
./gtfToGenePred
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/gtfToGenePred:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-genepredtobed">Install <a href="https://bioconda.github.io/recipes/ucsc-genepredtobed/README.html">genePredToBed</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>genePredtoBed
<span class="nb">cd </span>genePredtoBed
wget <span class="nt">-c</span> http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
<span class="nb">chmod </span>a+x genePredToBed
./genePredToBed
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/genePredToBed:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-cell-ranger">Install <a href="https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation">Cell Ranger</a></h4>

<ul>
  <li>Must register to get download link</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget <span class="sb">`</span>download_link<span class="sb">`</span>
<span class="nb">tar</span> <span class="nt">-xzvf</span> cellranger-7.1.0.tar.gz
<span class="nb">cd </span>cellranger-7.1.0
./bin/cellranger
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/cellranger-7.1.0:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-tabix">Install <a href="http://www.htslib.org/download/">TABIX</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nb">install </span>tabix
</code></pre></div></div>

<h4 id="install-bwa">Install <a href="http://bio-bwa.sourceforge.net/bwa.shtml">BWA</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/lh3/bwa.git
<span class="nb">cd </span>bwa
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bwa:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bedtools">Install <a href="https://bedtools.readthedocs.io/en/latest/">bedtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> bedtools-2.30.0.tar.gz
<span class="nb">cd </span>bedtools2
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bedtools2/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bcftools">Install <a href="http://www.htslib.org/download/">BCFtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2
bunzip2 bcftools-1.16.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> bcftools-1.16.tar
<span class="nb">cd </span>bcftools-1.16
make
./bcftools
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bcftools-1.14:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-htslib">Install <a href="http://www.htslib.org/download/">htslib</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2
bunzip2 htslib-1.16.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> htslib-1.16.tar
<span class="nb">cd </span>htslib-1.16
make
./htsfile
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/htslib-1.14:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-peddy">Install <a href="https://github.com/brentp/peddy">peddy</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/brentp/peddy
<span class="nb">cd </span>peddy
pip <span class="nb">install</span> <span class="nt">-r</span> requirements.txt
pip <span class="nb">install</span> <span class="nt">--editable</span> <span class="nb">.</span>
python <span class="nt">-m</span> peddy <span class="nt">-h</span>
</code></pre></div></div>

<h4 id="install-slivar">Install <a href="https://github.com/brentp/slivar">slivar</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar
<span class="nb">chmod</span> +x ./slivar
./slivar
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-strling">Install <a href="https://strling.readthedocs.io/en/latest/index.html">STRling</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling
<span class="nb">chmod</span> +x ./strling
./strling <span class="nt">-h</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h3 id="install-freebayes">Install <a href="https://github.com/freebayes/freebayes">freebayes</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>freebayes
</code></pre></div></div>

<h3 id="install-vcflib">Install <a href="https://github.com/vcflib/vcflib">vcflib</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>libvcflib-tools libvcflib-dev
</code></pre></div></div>

<h3 id="install-anaconda">Install <a href="https://www.anaconda.com/">Anaconda</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
bash Anaconda3-2022.10-Linux-x86_64.sh
</code></pre></div></div>

<p>Press Enter to review the license agreement. Then press and hold Enter to scroll.</p>

<p>Enter “yes” to agree to the license agreement.</p>

<p>Saved the installation to <code class="language-plaintext highlighter-rouge">/home/ubuntu/bin/anaconda3</code> and chose yes to initializng Anaconda3.</p>

<h3 id="install-vep">Install [VEP]</h3>

<p>Describes dependencies for VEP 108, used in this course for variant annotation. When running the VEP installer follow the prompts specified:</p>

<ol>
  <li>Do you want to install any cache files (y/n)? n [ENTER] (select number for homo_sapiens_vep_108_GRCh38.tar.gz) [ENTER]</li>
  <li>Do you want to install any FASTA files (y/n)? n [ENTER] (select number for homo_sapiens) [ENTER]</li>
  <li>Do you want to install any plugins (y/n)? n [ENTER]</li>
</ol>

<p>The VEP cache and FASTA files are very large ~25G or more. Probably do NOT want to install these as part of an image, but it should be possible to rerun this tool and install them later as needed.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/workspace
<span class="nb">cd</span> ~/bin
git clone https://github.com/Ensembl/ensembl-vep.git
<span class="nb">cd </span>ensembl-vep
perl INSTALL.pl <span class="nt">--CACHEDIR</span> ~/workspace/ensembl-vep/
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/ensembl-vep:<span class="nv">$PATH</span>
</code></pre></div></div>

<h3 id="set-up-jupyter-to-render-in-web-brower">Set up Jupyter to render in web brower</h3>

<p>Followed this <a href="https://dataschool.com/data-modeling-101/running-jupyter-notebook-on-an-ec2-server/">website</a></p>

<p>First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/anaconda3/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<p>Then you need to source the .bashrc for changes to take effect.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source</span> .bashrc
</code></pre></div></div>

<p>We then need to create our Jupyter configuration file. In order to create that file, you need to run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter notebook <span class="nt">--generate-config</span>
</code></pre></div></div>

<p>After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:</p>

<p>Enter the IPython command line:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ipython
</code></pre></div></div>

<p>Now follow these steps to generate your password:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">IPython.lib</span> <span class="kn">import</span> <span class="n">passwd</span>

<span class="n">passwd</span><span class="p">()</span>

<span class="nb">exit</span>
</code></pre></div></div>

<p>You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.</p>

<p>Next go into your jupyter config file:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/.jupyter/

vim jupyter_notebook_config.py
</code></pre></div></div>

<p>Note: You may need first to run <code class="language-plaintext highlighter-rouge">exit</code> in order to exit IPython otherwise the vim command may not be recognized by the terminal.</p>

<p>And add the following code:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conf <span class="o">=</span> get_config<span class="o">()</span>

conf.NotebookApp.ip <span class="o">=</span> <span class="s1">'0.0.0.0'</span>
conf.NotebookApp.password <span class="o">=</span> u<span class="s1">'YOUR PASSWORD HASH'</span>
conf.NotebookApp.port <span class="o">=</span> 8888
<span class="c"># Note: this code below should be put at the beginning of the document.</span>
</code></pre></div></div>

<p>We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/workspace
<span class="nb">mkdir </span>Jupyter_Notebooks
</code></pre></div></div>

<p>You can call this folder anything, for this example we call it <code class="language-plaintext highlighter-rouge">Notebooks</code></p>

<p>After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter notebook
</code></pre></div></div>

<p>From there you should be able to access your server by going to:</p>

<p><code class="language-plaintext highlighter-rouge">https://(your GCP public IP):8888/</code></p>

<p>Note that in order for this to work you need to have allowed external access to this machine over port 8888. In the GCP firewall settings this means adding a new firewall rull:</p>

<ul>
  <li>Name: <code class="language-plaintext highlighter-rouge">default-allow-jupyter</code></li>
  <li>Direction of traffic: <code class="language-plaintext highlighter-rouge">Ingress</code></li>
  <li>Action on match: <code class="language-plaintext highlighter-rouge">Allow</code></li>
  <li>Targets: <code class="language-plaintext highlighter-rouge">All instances in the network</code></li>
  <li>Source IP ranges: <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code></li>
  <li>Specified protocols and ports: <code class="language-plaintext highlighter-rouge">tcp:8888</code></li>
</ul>

<h4 id="install-r">Install <a href="http://www.r-project.org/">R</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nt">-y</span> remove r-base-core
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> remove r-base
<span class="nb">sudo </span>apt <span class="nb">install </span>dirmngr gnupg apt-transport-https ca-certificates software-properties-common
<span class="nb">sudo </span>apt-key adv <span class="nt">--keyserver</span> keyserver.ubuntu.com <span class="nt">--recv-keys</span> E298A3A825C0D65DFD57CBB651716619E084DAB9
<span class="nb">sudo </span>add-apt-repository <span class="s1">'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'</span>
<span class="nb">sudo </span>apt <span class="nb">install </span>r-base
R <span class="nt">--version</span>

<span class="c">#make R library location accessible</span>
<span class="nb">sudo chown</span> <span class="nt">-R</span> ubuntu:ubuntu /usr/local/lib/R/
<span class="nb">chmod</span> <span class="nt">-R</span> 775 /usr/local/lib/R

</code></pre></div></div>

<h4 id="r-libraries">R Libraries</h4>

<p>For this tutorial we require:</p>

<ul>
  <li><a href="https://cran.r-project.org/web/packages/devtools/index.html">devtools</a></li>
  <li><a href="https://cran.r-project.org/web/packages/dplyr/index.html">dplyr</a></li>
  <li><a href="http://cran.r-project.org/web/packages/gplots/index.html">gplots</a></li>
  <li><a href="https://ggplot2.tidyverse.org/">ggplot2</a></li>
  <li><a href="https://cran.r-project.org/web/packages/Seurat/index.html">Seurat</a></li>
  <li><a href="https://cran.r-project.org/web/packages/sctransform/index.html">sctransform</a></li>
  <li><a href="https://cran.r-project.org/web/packages/RColorBrewer/index.html">RColorBrewer</a></li>
  <li><a href="https://cran.r-project.org/package=ggthemes">ggthemes</a></li>
  <li><a href="https://cran.r-project.org/package=cowplot">cowplot</a></li>
  <li><a href="https://cran.r-project.org/web/packages/data.table/">data.table</a></li>
  <li><a href="https://cran.r-project.org/package=Rtsne">Rtsne</a></li>
  <li><a href="https://cran.r-project.org/web/packages/gridExtra/index.html">gridExtra</a></li>
  <li><a href="https://cran.r-project.org/web/packages/UpSetR/index.html">UpSetR</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
install.packages<span class="o">(</span>c<span class="o">(</span><span class="s2">"devtools"</span>,<span class="s2">"dplyr"</span>,<span class="s2">"gplots"</span>,<span class="s2">"ggplot2"</span>,<span class="s2">"Seurat"</span>,<span class="s2">"sctransform"</span>,<span class="s2">"RColorBrewer"</span>,<span class="s2">"ggthemes"</span>,<span class="s2">"cowplot"</span>,<span class="s2">"data.table"</span>,<span class="s2">"Rtsne"</span>,<span class="s2">"gridExtra"</span>,<span class="s2">"UpSetR"</span><span class="o">)</span>,repos<span class="o">=</span><span class="s2">"http://cran.us.r-project.org"</span><span class="o">)</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h4 id="bioconductor-libraries"><a href="http://www.bioconductor.org/">Bioconductor</a> libraries</h4>

<p>For this tutorial we require:</p>

<ul>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/genefilter.html">genefilter</a></li>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/ballgown.html">ballgown</a></li>
  <li><a href="http://www.bioconductor.org/packages/release/bioc/html/edgeR.html">edgeR</a></li>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">GenomicRanges</a></li>
  <li><a href="https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html">rhdf5</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/biomaRt.html">biomaRt</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/scran.html">scran</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/sva.html">sva</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/gage.html">gage</a></li>
  <li><a href="https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html">org.Hs.eg.db</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
<span class="c"># Install Bioconductor</span>
<span class="k">if</span> <span class="o">(!</span>requireNamespace<span class="o">(</span><span class="s2">"BiocManager"</span>, quietly <span class="o">=</span> TRUE<span class="o">))</span>
    install.packages<span class="o">(</span><span class="s2">"BiocManager"</span><span class="o">)</span>
BiocManager::install<span class="o">(</span>c<span class="o">(</span><span class="s2">"genefilter"</span>,<span class="s2">"ballgown"</span>,<span class="s2">"edgeR"</span>,<span class="s2">"GenomicRanges"</span>,<span class="s2">"rhdf5"</span>,<span class="s2">"biomaRt"</span>,<span class="s2">"scran"</span>,<span class="s2">"sva"</span>,<span class="s2">"gage"</span>,<span class="s2">"org.Hs.eg.db"</span><span class="o">))</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h4 id="install-sleuth">Install <a href="https://pachterlab.github.io/sleuth/download">Sleuth</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
install.packages<span class="o">(</span><span class="s2">"devtools"</span><span class="o">)</span>
devtools::install_github<span class="o">(</span><span class="s2">"pachterlab/sleuth"</span><span class="o">)</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h3 id="path-setup">Path setup</h3>

<p>Add the following lines to the .bashrc using vim to ensure that all tools install are in the ubuntu user path:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH
PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH
PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH
PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH
PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH
PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH
PATH=/home/ubuntu/bin/FastQC:$PATH
PATH=/home/ubuntu/.local/bin:$PATH
PATH=/home/ubuntu/bin/regtools/build:$PATH
PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH
PATH=/home/ubuntu/bin/gtfToGenePred:$PATH
PATH=/home/ubuntu/bin/genePredToBed:$PATH
PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH
PATH=/home/ubuntu/bin/bwa:$PATH
PATH=/home/ubuntu/bin/bedtools2/bin:$PATH
PATH=/home/ubuntu/bin/bcftools-1.14:$PATH
PATH=/home/ubuntu/bin/htslib-1.14:$PATH
PATH=/home/ubuntu/bin/ensembl-vep:$PATH

</code></pre></div></div>

<p>For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.</p>

<h3 id="set-up-apache-web-server">Set up Apache web server</h3>

<p>We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.</p>

<ul>
  <li>Edit config to allow files to be served from outside /usr/share and /var/www</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vim /etc/apache2/apache2.conf
</code></pre></div></div>

<ul>
  <li>Add the following content to apache2.conf</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;Directory /home/ubuntu/workspace/&gt;
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
&lt;/Directory&gt;
</code></pre></div></div>

<ul>
  <li>Edit vhost file</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vim /etc/apache2/sites-available/000-default.conf
</code></pre></div></div>

<ul>
  <li>Change document root in 000-default.conf</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DocumentRoot /home/ubuntu/workspace
</code></pre></div></div>

<ul>
  <li>Restart apache</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>service apache2 restart
</code></pre></div></div>

<p>Test by going to your instance’s public IP address in your browser.</p>

<h3 id="create-a-public-google-cloud-image-using-the-gcp-console">Create a public Google cloud image using the GCP console</h3>

<ol>
  <li>Under Compute Engine -&gt; Virtual Machines -&gt; VM Instances. Stop the instance.</li>
  <li>Under Compute Engine -&gt; Storage -&gt; Images. Create Image.</li>
  <li>Provide a name for the image (e.g. <code class="language-plaintext highlighter-rouge">rnabio-course-2023-v1</code>).</li>
  <li>Select <code class="language-plaintext highlighter-rouge">Source</code> -&gt; <code class="language-plaintext highlighter-rouge">Disk</code></li>
  <li>Under `Source disk’ -&gt; Choose the name of the stopped instance (e.g. rnabio-course-2023)</li>
  <li>Select <code class="language-plaintext highlighter-rouge">Location</code> -&gt; <code class="language-plaintext highlighter-rouge">Multi-regional</code></li>
  <li><code class="language-plaintext highlighter-rouge">Select location</code> -&gt; <code class="language-plaintext highlighter-rouge">us (multiple regions in the United States)</code></li>
  <li>Leave <code class="language-plaintext highlighter-rouge">Family</code> blank, but add a description.</li>
  <li><code class="language-plaintext highlighter-rouge">Encryption</code> -&gt; <code class="language-plaintext highlighter-rouge">Google-managed encryption key</code>.</li>
</ol>

<p>To make the image fully public execute the following Google SDK command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
gcloud compute images add-iam-policy-binding rnabio-course-2023-v2 <span class="nt">--member</span><span class="o">=</span><span class="s1">'allAuthenticatedUsers'</span> <span class="nt">--role</span><span class="o">=</span><span class="s1">'roles/compute.imageUser'</span>

</code></pre></div></div>

<p>To list the image from the command line:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud compute images list <span class="nt">--filter</span><span class="o">=</span><span class="s2">"name=rnabio-course-2023-v2"</span>
</code></pre></div></div>

<h3 id="current-public-google-images">Current Public Google Images</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rnabio-course-2023-v2</code></li>
</ul>

<h3 id="launch-student-instance-using-this-image">Launch student instance using this image</h3>
<p>To start a new VM with the public image above one can use the GCP console as was done above to create a new VM with vanilla ubuntu, except this time selecting the pre-configured image with all tool installed already.</p>

<p>We have been unable to get this to work using the Console. It seems listing custom public images is not working there… ?</p>

<p>From the command line you can launch an instance as follows (you should probably personalize the <code class="language-plaintext highlighter-rouge">malachi-course-2023</code> name used in two places of this command):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud compute instances create malachi-course-2023 <span class="nt">--zone</span><span class="o">=</span>us-central1-a <span class="nt">--machine-type</span><span class="o">=</span>e2-standard-4 <span class="nt">--network-interface</span><span class="o">=</span>network-tier<span class="o">=</span>PREMIUM,subnet<span class="o">=</span>default <span class="nt">--tags</span><span class="o">=</span>http-server,https-server <span class="nt">--create-disk</span><span class="o">=</span>auto-delete<span class="o">=</span><span class="nb">yes</span>,boot<span class="o">=</span><span class="nb">yes</span>,image<span class="o">=</span>projects/griffith-lab/global/images/rnabio-course-2023-v2,mode<span class="o">=</span>rw,size<span class="o">=</span>250,type<span class="o">=</span>pd-balanced,device-name<span class="o">=</span>malachi-course-2023 

</code></pre></div></div>

<h2 id="initial-setup-upon-first-login">Initial setup upon first login</h2>

<p>You will want to do everything on this VM as the “ubuntu” user. First set the password for that user and then change to it.</p>

<p>Login as follows</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud compute ssh ubuntu@malachi-course-2023

</code></pre></div></div>

<p>Test environment</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bwa mem
<span class="nb">env</span>
</code></pre></div></div>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[UNDER DEVELOPMENT Google Cloud Platform setup for use in workshop This tutorial explains how a Google Cloud Instance can be configured from scratch for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Google GCP. Create a Google Cloud account You will need a Google account (personal or institutional) Use the above email account to log into the Google Cloud Console: https://console.cloud.google.com/. Note: Any GCP account needs to be linked to an actual person/credit card account or institutional billing account. Create a Google Cloud Project connected to a billing source Optional - Set up an IAM account. Details to be resolved… Request limit increases. You need to be able to spin up at least one instance for every student and TA/instructor. To find current limits and request increases in the console, go to: IAM &amp; Admin -&gt; Quotas. In the GCP console: Go to Compute Engine -&gt; VM instances. Start with existing base image Create Instance Give the Instance a Name (e.g. rnabio-course-2023) Select Machine Type (e.g. E2 Series: e2-standard-2) Change the Boot disk to Ubuntu -&gt; Ubuntu 20.04 LTS (x86/64). Change size to 250 GB. Under Firewall, select these options: Allow HTTP traffic and Allow HTTPS traffic Hit the Create button Install the Google Cloud SDK (gcloud), authenticate your user, and login to your VM Install the Google Cloud commandline interface following instructions here: https://cloud.google.com/sdk/docs/install Use the following command and follow instructions to authenticate your user: gcloud auth login Set the project to the google billing project created above as follows: gcloud config set project $project_name Check the authentication configuration as follows: gcloud config list Log into the instance using the instance name chosen above follows gcloud compute ssh rnabio-course-2023 Set up the ubuntu user: Logging into a Google VM of Ubuntu is a bitter different from AWS. By default you will login with your Google user name (or is it your username from the host machine you login from?) instead of using the “ubuntu” user. Set password for the ubuntu user (and make note of this somewhere safe). Then change users to the “ubuntu” user before proceeding with the rest of this setup. Note that later if you login as another sudo user, if you need to, you should be able to reset the password associated with the ubuntu user. whoami sudo passwd ubuntu su ubuntu cd ~ Perform basic linux configuration To allow installation of bioinformatics tools some basic dependencies must be installed first. sudo apt-get update sudo apt-get upgrade sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcurl4-openssl-dev sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json sudo timedatectl set-timezone America/Chicago logout and log back in for changes to take effect. exit exit gcloud compute ssh rnabio-course-2023 su ubuntu cd ~ Add ubuntu user to docker group sudo usermod -aG docker ubuntu Then exit shell and log back into instance. Install any desired informatics tools NOTE: R in particular is a slow install. NOTE: - All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises. Paths to pre-installed tools can be added to the .bashrc file. It may also be convenient to add export RNA_HOME=~/workspace/rnaseq to the .bashrc file. See https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc. NOTE: In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc: export MANPAGER=less Install RNA-seq software These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in /home/ubuntu/bin/ and its install location is exported to the $PATH variable for easy access. Create directory to install software to and setup path variables mkdir ~/bin cd bin WORKSPACE=/home/ubuntu/workspace HOME=/home/ubuntu Install SAMtools cd ~/bin wget https://github.com/samtools/samtools/releases/download/1.16.1/samtools-1.16.1.tar.bz2 bunzip2 samtools-1.16.1.tar.bz2 tar -xvf samtools-1.16.1.tar cd samtools-1.16.1 make ./samtools export PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH Install bam-readcount cd ~/bin export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.16.1 git clone https://github.com/genome/bam-readcount cd bam-readcount mkdir build cd build cmake .. make export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH Install HISAT2 uname -m cd ~/bin curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download &gt; hisat2-2.2.1-Linux_x86_64.zip unzip hisat2-2.2.1-Linux_x86_64.zip cd hisat2-2.2.1 ./hisat2 -h export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH Install StringTie cd ~/bin wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.6.tar.gz tar -xzvf stringtie-2.1.6.tar.gz cd stringtie-2.1.6 make release export PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH Install gffcompare cd ~/bin wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz cd gffcompare-0.12.6.Linux_x86_64/ ./gffcompare export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH Install htseq-count sudo apt install python3-htseq Make sure that OpenSSL is on correct version TopHat will not install if the version of OpenSSL is too old. To get version: openssl version If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps. cd ~/bin wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz tar -zxf openssl-1.1.1g.tar.gz &amp;&amp; cd openssl-1.1.1g ./config make make test sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong sudo make install sudo ln -s /usr/local/bin/openssl /usr/bin/openssl sudo ldconfig Again, from the terminal issue the command: openssl version Your output should be as follows: OpenSSL 1.1.1g 21 Apr 2020 Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano. Install TopHat cd ~/bin wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz cd tophat-2.1.1.Linux_x86_64/ ./gtf_to_fasta export PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH Install kallisto cd ~/bin wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz tar -zxvf kallisto_linux-v0.44.0.tar.gz cd kallisto_linux-v0.44.0/ ./kallisto export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH Install FastQC cd ~/bin wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip unzip fastqc_v0.11.9.zip cd FastQC/ chmod 755 fastqc ./fastqc --help export PATH=/home/ubuntu/bin/FastQC:$PATH Intall a particular version of numpy that hopefully works with all the dependencies that rely on it cd ~/bin pip install --force-reinstall -v "numpy==1.24.1" Install MultiQC cd ~/bin export PATH=/home/ubuntu/.local/bin:$PATH pip3 install multiqc multiqc --help Install Picard cd ~/bin wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar java -jar ~/bin/picard.jar Install Flexbar sudo apt install flexbar Install Regtools cd ~/bin git clone https://github.com/griffithlab/regtools cd regtools/ mkdir build cd build/ cmake .. make ./regtools export PATH=/home/ubuntu/bin/regtools/build:$PATH Install RSeQC pip3 install RSeQC ~/.local/bin/read_GC.py export PATH=/home/ubuntu/.local/bin/:$PATH Install bedops cd ~/bin mkdir bedops_linux_x86_64-v2.4.40 cd bedops_linux_x86_64-v2.4.40 wget -c https://github.com/bedops/bedops/releases/download/v2.4.40/bedops_linux_x86_64-v2.4.40.tar.bz2 tar -jxvf bedops_linux_x86_64-v2.4.40.tar.bz2 ./bin/bedops export PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH Install gtfToGenePred cd ~/bin mkdir gtfToGenePred cd gtfToGenePred wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred chmod a+x gtfToGenePred ./gtfToGenePred export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH Install genePredToBed cd ~/bin mkdir genePredtoBed cd genePredtoBed wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed chmod a+x genePredToBed ./genePredToBed export PATH=/home/ubuntu/bin/genePredToBed:$PATH Install Cell Ranger Must register to get download link cd ~/bin wget `download_link` tar -xzvf cellranger-7.1.0.tar.gz cd cellranger-7.1.0 ./bin/cellranger export PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH Install TABIX sudo apt-get install tabix Install BWA cd ~/bin git clone https://github.com/lh3/bwa.git cd bwa make export PATH=/home/ubuntu/bin/bwa:$PATH Install bedtools cd ~/bin wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz tar -zxvf bedtools-2.30.0.tar.gz cd bedtools2 make export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH Install BCFtools cd ~/bin wget wget https://github.com/samtools/bcftools/releases/download/1.16/bcftools-1.16.tar.bz2 bunzip2 bcftools-1.16.tar.bz2 tar -xvf bcftools-1.16.tar cd bcftools-1.16 make ./bcftools export PATH=/home/ubuntu/bin/bcftools-1.14:$PATH Install htslib cd ~/bin wget https://github.com/samtools/htslib/releases/download/1.16/htslib-1.16.tar.bz2 bunzip2 htslib-1.16.tar.bz2 tar -xvf htslib-1.16.tar cd htslib-1.16 make ./htsfile export PATH=/home/ubuntu/bin/htslib-1.14:$PATH Install peddy cd ~/bin git clone https://github.com/brentp/peddy cd peddy pip install -r requirements.txt pip install --editable . python -m peddy -h Install slivar cd ~/bin wget https://github.com/brentp/slivar/releases/download/v0.2.7/slivar chmod +x ./slivar ./slivar export PATH=/home/ubuntu/bin:$PATH Install STRling cd ~/bin wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.1/strling chmod +x ./strling ./strling -h export PATH=/home/ubuntu/bin:$PATH Install freebayes sudo apt install freebayes Install vcflib sudo apt install libvcflib-tools libvcflib-dev Install Anaconda cd ~/bin wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh bash Anaconda3-2022.10-Linux-x86_64.sh Press Enter to review the license agreement. Then press and hold Enter to scroll. Enter “yes” to agree to the license agreement. Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3. Install [VEP] Describes dependencies for VEP 108, used in this course for variant annotation. When running the VEP installer follow the prompts specified: Do you want to install any cache files (y/n)? n [ENTER] (select number for homo_sapiens_vep_108_GRCh38.tar.gz) [ENTER] Do you want to install any FASTA files (y/n)? n [ENTER] (select number for homo_sapiens) [ENTER] Do you want to install any plugins (y/n)? n [ENTER] The VEP cache and FASTA files are very large ~25G or more. Probably do NOT want to install these as part of an image, but it should be possible to rerun this tool and install them later as needed. mkdir ~/workspace cd ~/bin git clone https://github.com/Ensembl/ensembl-vep.git cd ensembl-vep perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/ export PATH=/home/ubuntu/bin/ensembl-vep:$PATH Set up Jupyter to render in web brower Followed this website First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file: export PATH=/home/ubuntu/anaconda3/bin:$PATH Then you need to source the .bashrc for changes to take effect. source .bashrc We then need to create our Jupyter configuration file. In order to create that file, you need to run: jupyter notebook --generate-config After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython: Enter the IPython command line: ipython Now follow these steps to generate your password: from IPython.lib import passwd passwd() exit You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file. Next go into your jupyter config file: cd ~/.jupyter/ vim jupyter_notebook_config.py Note: You may need first to run exit in order to exit IPython otherwise the vim command may not be recognized by the terminal. And add the following code: conf = get_config() conf.NotebookApp.ip = '0.0.0.0' conf.NotebookApp.password = u'YOUR PASSWORD HASH' conf.NotebookApp.port = 8888 # Note: this code below should be put at the beginning of the document. We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run: cd ~/workspace mkdir Jupyter_Notebooks You can call this folder anything, for this example we call it Notebooks After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command: jupyter notebook From there you should be able to access your server by going to: https://(your GCP public IP):8888/ Note that in order for this to work you need to have allowed external access to this machine over port 8888. In the GCP firewall settings this means adding a new firewall rull: Name: default-allow-jupyter Direction of traffic: Ingress Action on match: Allow Targets: All instances in the network Source IP ranges: 0.0.0.0/0 Specified protocols and ports: tcp:8888 Install R sudo apt-get -y remove r-base-core sudo apt-get -y remove r-base sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' sudo apt install r-base R --version #make R library location accessible sudo chown -R ubuntu:ubuntu /usr/local/lib/R/ chmod -R 775 /usr/local/lib/R R Libraries For this tutorial we require: devtools dplyr gplots ggplot2 Seurat sctransform RColorBrewer ggthemes cowplot data.table Rtsne gridExtra UpSetR R install.packages(c("devtools","dplyr","gplots","ggplot2","Seurat","sctransform","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR"),repos="http://cran.us.r-project.org") quit(save="no") Bioconductor libraries For this tutorial we require: genefilter ballgown edgeR GenomicRanges rhdf5 biomaRt scran sva gage org.Hs.eg.db R # Install Bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db")) quit(save="no") Install Sleuth R install.packages("devtools") devtools::install_github("pachterlab/sleuth") quit(save="no") Path setup Add the following lines to the .bashrc using vim to ensure that all tools install are in the ubuntu user path: PATH=/home/ubuntu/bin/samtools-1.16.1:$PATH PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH PATH=/home/ubuntu/bin/stringtie-2.1.6:$PATH PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH PATH=/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH PATH=/home/ubuntu/bin/FastQC:$PATH PATH=/home/ubuntu/.local/bin:$PATH PATH=/home/ubuntu/bin/regtools/build:$PATH PATH=/home/ubuntu/bin/bedops_linux_x86_64-v2.4.40/bin:$PATH PATH=/home/ubuntu/bin/gtfToGenePred:$PATH PATH=/home/ubuntu/bin/genePredToBed:$PATH PATH=/home/ubuntu/bin/cellranger-7.1.0:$PATH PATH=/home/ubuntu/bin/bwa:$PATH PATH=/home/ubuntu/bin/bedtools2/bin:$PATH PATH=/home/ubuntu/bin/bcftools-1.14:$PATH PATH=/home/ubuntu/bin/htslib-1.14:$PATH PATH=/home/ubuntu/bin/ensembl-vep:$PATH For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex. Set up Apache web server We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80. Edit config to allow files to be served from outside /usr/share and /var/www sudo vim /etc/apache2/apache2.conf Add the following content to apache2.conf &lt;Directory /home/ubuntu/workspace/&gt; Options Indexes FollowSymLinks AllowOverride None Require all granted &lt;/Directory&gt; Edit vhost file sudo vim /etc/apache2/sites-available/000-default.conf Change document root in 000-default.conf DocumentRoot /home/ubuntu/workspace Restart apache sudo service apache2 restart Test by going to your instance’s public IP address in your browser. Create a public Google cloud image using the GCP console Under Compute Engine -&gt; Virtual Machines -&gt; VM Instances. Stop the instance. Under Compute Engine -&gt; Storage -&gt; Images. Create Image. Provide a name for the image (e.g. rnabio-course-2023-v1). Select Source -&gt; Disk Under `Source disk’ -&gt; Choose the name of the stopped instance (e.g. rnabio-course-2023) Select Location -&gt; Multi-regional Select location -&gt; us (multiple regions in the United States) Leave Family blank, but add a description. Encryption -&gt; Google-managed encryption key. To make the image fully public execute the following Google SDK command: gcloud compute images add-iam-policy-binding rnabio-course-2023-v2 --member='allAuthenticatedUsers' --role='roles/compute.imageUser' To list the image from the command line: gcloud compute images list --filter="name=rnabio-course-2023-v2" Current Public Google Images rnabio-course-2023-v2 Launch student instance using this image To start a new VM with the public image above one can use the GCP console as was done above to create a new VM with vanilla ubuntu, except this time selecting the pre-configured image with all tool installed already. We have been unable to get this to work using the Console. It seems listing custom public images is not working there… ? From the command line you can launch an instance as follows (you should probably personalize the malachi-course-2023 name used in two places of this command): gcloud compute instances create malachi-course-2023 --zone=us-central1-a --machine-type=e2-standard-4 --network-interface=network-tier=PREMIUM,subnet=default --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,image=projects/griffith-lab/global/images/rnabio-course-2023-v2,mode=rw,size=250,type=pd-balanced,device-name=malachi-course-2023 Initial setup upon first login You will want to do everything on this VM as the “ubuntu” user. First set the password for that user and then change to it. Login as follows gcloud compute ssh ubuntu@malachi-course-2023 Test environment bwa mem env]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">AWS Setup Before Course Start</title><link href="http://www.rnabio.org//module-09-appendix/0009/09/01/AWS_Setup/" rel="alternate" type="text/html" title="AWS Setup Before Course Start" /><published>0009-09-01T00:00:00+00:00</published><updated>0009-09-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/09/01/AWS_Setup</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/09/01/AWS_Setup/"><![CDATA[<h2 id="preamble---amazon-awsami-setup-for-use-in-workshop">Preamble - Amazon AWS/AMI setup for use in workshop</h2>

<p>This tutorial explains how Amazon cloud instances were configured for the course.  This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS.</p>

<p>A helpful introduction of AWS can be found <a href="https://rnabio.org/module-00-setup/0000/06/01/Intro_to_AWS/">here</a></p>

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#create-aws-account">Create AWS account</a>
    <ol>
      <li><a href="#set-up-security-group-if-needed">Set up security group (if needed)</a></li>
    </ol>
  </li>
  <li><a href="#start-with-existing-community-ami">Start with existing community AMI</a></li>
  <li><a href="#perform-basic-linux-configuration">Perform basic linux configuration</a></li>
  <li><a href="#add-ubuntu-user-to-docker-group">Add ubuntu user to docker group</a></li>
  <li><a href="#set-up-additional-storage-for-workspace">Set up additional storage for workspace</a></li>
  <li><a href="#install-any-desired-informatics-tools">Install any desired informatics tools</a>
    <ol>
      <li><a href="#install-rna-seq-software">Install RNA-seq software</a>
        <ol>
          <li><a href="#create-directory-to-install-software-to-and-setup-path-variables">Create directory to install software to and setup path variables</a></li>
          <li><a href="#install-samtools">Install SAMtools</a></li>
          <li><a href="#install-bam-readcount">Install bam-readcount</a></li>
          <li><a href="#install-hisat2">Install HISAT2</a></li>
          <li><a href="#install-stringtie">Install StringTie</a></li>
          <li><a href="#install-gffcompare">Install gffcompare</a></li>
          <li><a href="#install-htseq">Install HTSeq</a></li>
          <li><a href="#make-sure-that-openssl-is-on-correct-version">Make sure that OpenSSL is on correct version</a></li>
          <li><a href="#install-tophat">Install TopHat</a></li>
          <li><a href="#install-kallisto">Install kallisto</a></li>
          <li><a href="#install-fastqc">Install FastQC</a></li>
          <li><a href="#install-fastp">Install Fastp</a></li>
          <li><a href="#install-multiqc">Install MultiQC</a></li>
          <li><a href="#install-picard">Install Picard</a></li>
          <li><a href="#install-flexbar">Install Flexbar</a></li>
          <li><a href="#install-regtools">Install Regtools</a></li>
          <li><a href="#install-rseqc">Install RSeQC</a></li>
          <li><a href="#install-bedops">Install bedops</a></li>
          <li><a href="#install-gtftogenepred">Install gtfToGenePred</a></li>
          <li><a href="#install-genepredtobed">Install genePredToBed</a></li>
          <li><a href="#install-how_are_we_stranded_here">Install how_are_we_stranded_here</a></li>
          <li><a href="#install-cell-ranger">Install Cell Ranger</a></li>
          <li><a href="#install-tabix">Install TABIX</a></li>
          <li><a href="#install-bwa">Install BWA</a></li>
          <li><a href="#install-bedtools">Install bedtools</a></li>
          <li><a href="#install-bcftools">Install BCFtools</a></li>
          <li><a href="#install-htslib">Install htslib</a></li>
          <li><a href="#install-peddy">Install peddy</a></li>
          <li><a href="#install-slivar">Install slivar</a></li>
          <li><a href="#install-strling">Install STRling</a></li>
        </ol>
      </li>
      <li><a href="#install-freebayes">Install freebayes</a></li>
      <li><a href="#install-vcflib">Install vcflib</a></li>
      <li><a href="#install-anaconda">Install Anaconda</a></li>
      <li><a href="#install-vep">Install VEP</a></li>
      <li><a href="#set-up-jupyter-to-render-in-web-brower">Set up Jupyter to render in web brower</a></li>
      <li><a href="#install-r">Install R</a>
        <ol>
          <li><a href="#r-libraries">R Libraries</a></li>
          <li><a href="#bioconductor-libraries">Bioconductor libraries</a></li>
          <li><a href="#install-sleuth">Install Sleuth</a></li>
        </ol>
      </li>
      <li><a href="#install-softwares-for-germline-analyses">Install softwares for germline analyses</a>
        <ol>
          <li><a href="#install-gatk">Install gatk</a></li>
          <li><a href="#install-minimap2">Install minimap2</a></li>
          <li><a href="#install-nanoplot">Install NanoPlot</a></li>
          <li><a href="#install-varscan">Install Varscan</a></li>
          <li><a href="#install-fasplit">Install faSplit</a></li>
        </ol>
      </li>
      <li><a href="#install-packages-for-single-cell-atac-seq-lab">Install packages for single-cell ATAC-seq lab</a>
        <ol>
          <li><a href="#install-atacseqqc">Install ATACseqQC</a></li>
        </ol>
      </li>
      <li><a href="#install-packages-for-single-cell-rnaseq-lab">Install packages for single-cell RNAseq lab</a></li>
      <li><a href="#install-packages-for-variant-annotation-and-python-visualization-lab">Install packages for Variant annotation and python visualization lab</a></li>
    </ol>
  </li>
  <li><a href="#path-setup">Path setup</a></li>
  <li><a href="#set-up-apache-web-server">Set up Apache web server</a></li>
  <li><a href="#save-a-public-ami">Save a public AMI</a></li>
  <li><a href="#current-public-amis">Current Public AMIs</a></li>
  <li><a href="#create-iam-account">Create IAM account</a></li>
  <li><a href="#launch-student-instance">Launch student instance</a></li>
  <li><a href="#set-up-a-dynamic-dns-service">Set up a dynamic DNS service</a></li>
  <li><a href="#host-necessary-files-for-the-course">Host necessary files for the course</a></li>
  <li><a href="#after-course-reminders">After course reminders</a></li>
</ol>

<h2 id="create-aws-account">Create AWS account</h2>

<ol>
  <li>Create a new gmail account to use for the course</li>
  <li>Use the above email account to set up a new AWS/Amazon user account.
Note: Any AWS account needs to be linked to an actual person and credit card account.</li>
  <li>Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages.</li>
  <li>Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: <a href="http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/">http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/</a>. Note: You need to request an increase for each instance type and <em>region</em> you might use.</li>
  <li>Sign into AWS Management Console: <a href="http://aws.amazon.com/console/">http://aws.amazon.com/console/</a></li>
  <li>Go to EC2 services</li>
</ol>

<h3 id="set-up-security-group-if-needed">Set up security group (if needed)</h3>

<p>In general if no new web server is needed, you may pick an existing security group. The security group used for 2025 was “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. It has the following inbound and outbound rules that allows for connection to the Jupyter and R studio servers:</p>

<table>
  <thead>
    <tr>
      <th>Rule type</th>
      <th>IP version</th>
      <th>Type</th>
      <th>Protocol</th>
      <th>Port range</th>
      <th>Source / Destination</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Inbound Rule</strong></td>
      <td>IPv4</td>
      <td>Custom TCP</td>
      <td>TCP</td>
      <td>8888</td>
      <td>0.0.0.0/0</td>
    </tr>
    <tr>
      <td> </td>
      <td>IPv4</td>
      <td>Custom TCP</td>
      <td>TCP</td>
      <td>8787</td>
      <td>0.0.0.0/0</td>
    </tr>
    <tr>
      <td> </td>
      <td>IPv4</td>
      <td>HTTP</td>
      <td>TCP</td>
      <td>80</td>
      <td>0.0.0.0/0</td>
    </tr>
    <tr>
      <td> </td>
      <td>IPv4</td>
      <td>HTTPS</td>
      <td>TCP</td>
      <td>443</td>
      <td>0.0.0.0/0</td>
    </tr>
    <tr>
      <td> </td>
      <td>IPv6</td>
      <td>HTTP</td>
      <td>TCP</td>
      <td>80</td>
      <td>::/0</td>
    </tr>
    <tr>
      <td> </td>
      <td>IPv4</td>
      <td>SSH</td>
      <td>TCP</td>
      <td>22</td>
      <td>0.0.0.0/0</td>
    </tr>
    <tr>
      <td><strong>Outbound Rule</strong></td>
      <td>IPv4</td>
      <td>All traffic</td>
      <td>All</td>
      <td>All</td>
      <td>0.0.0.0/0</td>
    </tr>
  </tbody>
</table>

<h2 id="start-with-existing-community-ami">Start with existing community AMI</h2>

<ol>
  <li>Set up a Ubuntu instance
  1) Launch a fresh Ubuntu Instance (Ubuntu Server 22.04 LTS at the time of writing this). 
  2) Choose an instance type of <code class="language-plaintext highlighter-rouge">m6a.xlarge</code>. 
  3) Increase root volume (e.g., 60GB)(type:gp3) and add a second volume (e.g., 500GB)(type:gp3). 
  4) Choose appropriate security group (for 2025 course, choose security group “SSH/HTTP/Jupyter/Rstudio - with outbound rule”). 
  5) If necessary, create a new key pair, name and save it locally somewhere safe. 
  5) Review and Launch. Select ‘View Instances’. Take note of public IP address of newly launched instance.</li>
  <li>Change permissions on downloaded key pair with <code class="language-plaintext highlighter-rouge">chmod 400 [instructor-key].pem</code></li>
  <li>Login to instance with ubuntu user:</li>
</ol>

<p><code class="language-plaintext highlighter-rouge">ssh -i [instructor-key].pem ubuntu@[public.ip.address]</code></p>

<p><strong>Note for TAs when setting up these instances</strong></p>

<p>Usually the instances are setup in the following sequence: 
1) Email instructors to request any packages/programs that needs to be installed. 
2) Make a draft instance with the instructions above, configure and install everything. 
3) Make an AMI. 
4) Make an instructor instance by launching a new Ubuntu instance with the same specifications above, but using the AMI. 
5) Inform instructors and TAs to test their code on the instructor instance, send them the instructor key pem file. 
6) Modify the draft instance as needed. 
7) Once finalized, create an AMI. This AMI will be distributed to students for the course.</p>

<h2 id="perform-basic-linux-configuration">Perform basic linux configuration</h2>

<ul>
  <li>To allow installation of bioinformatics tools some basic dependencies must be installed first.</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get upgrade
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> <span class="nb">install </span>make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcairo2-dev
<span class="nb">sudo ln</span> <span class="nt">-s</span> /usr/include/jsoncpp/json/ /usr/include/json
<span class="nb">sudo </span>timedatectl set-timezone America/New_York
</code></pre></div></div>

<ul>
  <li>logout and log back in for changes to take affect.</li>
</ul>

<h2 id="add-ubuntu-user-to-docker-group">Add ubuntu user to docker group</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>usermod <span class="nt">-aG</span> docker ubuntu
</code></pre></div></div>

<p>Then exit shell and log back into instance.</p>

<h2 id="set-up-additional-storage-for-workspace">Set up additional storage for workspace</h2>

<p>We first need to setup the additional storage volume that we added when we created the instance.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create mountpoint for additional storage volume</span>
<span class="nb">cd</span> /
<span class="nb">sudo mkdir </span>workspace

<span class="c"># Mount ephemeral storage</span>
<span class="nb">cd
sudo </span>mkfs <span class="nt">-t</span> ext4 /dev/nvme1n1
<span class="nb">sudo </span>mount /dev/nvme1n1 /workspace
</code></pre></div></div>

<p>In order to make the workspace volume persistent, we need to edit the etc/fstab file in order. AWS provides instructions for how to do this <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html">here</a>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Make ephemeral storage mounts persistent</span>
<span class="c"># See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS</span>

<span class="c"># get UUID from sudo lsblk -f</span>
<span class="nv">UUID</span><span class="o">=</span><span class="si">$(</span><span class="nb">sudo </span>lsblk <span class="nt">-f</span> | <span class="nb">grep </span>nvme1n1 | <span class="nb">awk</span> <span class="o">{</span><span class="s1">'print $4'</span><span class="o">}</span><span class="si">)</span>
<span class="c">#if want to double check, can do 'echo $UUID' to see the UUID. </span>

<span class="c">#then add that UUID to /etc/fstab</span>
<span class="nb">echo</span> <span class="nt">-e</span> <span class="s2">"LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0</span><span class="se">\n</span><span class="s2">UUID=</span><span class="nv">$UUID</span><span class="s2"> /workspace ext4 defaults,nofail 0 2"</span> | <span class="nb">sudo tee</span> /etc/fstab
<span class="c">#'less /etc/fstab' , to see if the new line has been added</span>

<span class="c"># Change permissions on required drives</span>
<span class="nb">sudo chown</span> <span class="nt">-R</span> ubuntu:ubuntu /workspace

<span class="c"># Create symlink to the added volume in your home directory</span>
<span class="nb">cd</span> ~
<span class="nb">ln</span> <span class="nt">-s</span> /workspace workspace
</code></pre></div></div>

<h2 id="install-any-desired-informatics-tools">Install any desired informatics tools</h2>

<ul>
  <li><strong>NOTE:</strong> R in particular is a slow install.</li>
  <li><strong>NOTE:</strong></li>
</ul>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises.
</span></code></pre></div></div>

<ul>
  <li>Paths to pre-installed tools can be added to the .bashrc file.
    <ul>
      <li>A template .bashrc file: <a href="https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc">https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc</a></li>
      <li>For the draft instance we are setting up, it will be helpful to copy contents from this file directly into the .bashrc file: <a href="http://genomedata.org/rnaseq-tutorial/bashrc_copy">http://genomedata.org/rnaseq-tutorial/bashrc_copy</a>. Add additional tool paths on top of this.</li>
    </ul>
  </li>
  <li><strong>NOTE:</strong> (This didn’t happen during installation for the year 2023, but) In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a <code class="language-plaintext highlighter-rouge">man ls</code> and if the problem exists, add the following to .bashrc:</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">MANPAGER</span><span class="o">=</span>less
</code></pre></div></div>

<h3 id="install-rna-seq-software">Install RNA-seq software</h3>

<ul>
  <li>These install instructions should be identical to those found on <a href="https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation">https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation</a> except that each tool is installed in <code class="language-plaintext highlighter-rouge">/home/ubuntu/bin/</code> and its install location is exported to the $PATH variable for easy access.</li>
</ul>

<h4 id="create-directory-to-install-software-to-and-setup-path-variables">Create directory to install software to and setup path variables</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> ~/bin
<span class="nb">cd </span>bin
<span class="nv">WORKSPACE</span><span class="o">=</span>/home/ubuntu/workspace
<span class="nv">HOME</span><span class="o">=</span>/home/ubuntu
</code></pre></div></div>

<h4 id="install-samtools">Install <a href="http://www.htslib.org/">SAMtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~/bin
wget https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2
bunzip2 samtools-1.18.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> samtools-1.18.tar
<span class="nb">cd </span>samtools-1.18
make
./samtools
<span class="c">#add the following line to .bashrc</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/samtools-1.18:<span class="nv">$PATH</span>
<span class="nb">export </span><span class="nv">SAMTOOLS_ROOT</span><span class="o">=</span>/home/ubuntu/bin/samtools-1.18
</code></pre></div></div>

<h4 id="install-bam-readcount">Install <a href="https://github.com/genome/bam-readcount">bam-readcount</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/genome/bam-readcount 
<span class="nb">cd </span>bam-readcount
<span class="nb">mkdir </span>build
<span class="nb">cd </span>build
cmake ..
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bam-readcount/build/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-hisat2">Install <a href="https://daehwankimlab.github.io/hisat2/">HISAT2</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">uname</span> <span class="nt">-m</span>
<span class="nb">cd</span> ~/bin
curl <span class="nt">-s</span> https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download <span class="o">&gt;</span> hisat2-2.2.1-Linux_x86_64.zip
unzip hisat2-2.2.1-Linux_x86_64.zip
<span class="nb">cd </span>hisat2-2.2.1
./hisat2 <span class="nt">-h</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/hisat2-2.2.1:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-stringtie">Install <a href="https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual">StringTie</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz
<span class="nb">tar</span> <span class="nt">-xzvf</span> stringtie-2.2.1.tar.gz
<span class="nb">cd </span>stringtie-2.2.1
make release
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/stringtie-2.2.1:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-gffcompare">Install <a href="http://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare">gffcompare</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz
<span class="nb">tar</span> <span class="nt">-xzvf</span> gffcompare-0.12.6.Linux_x86_64.tar.gz
<span class="nb">cd </span>gffcompare-0.12.6.Linux_x86_64/
./gffcompare
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-htseq">Install <a href="https://htseq.readthedocs.io/en/master/install.html">HTSeq</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>HTSeq

<span class="c"># to check version of HTSeq</span>
<span class="c"># pip show HTSeq</span>
</code></pre></div></div>

<h4 id="make-sure-that-openssl-is-on-correct-version">Make sure that OpenSSL is on correct version</h4>
<p>TopHat will not install if the version of OpenSSL is too old.</p>

<p>To get version:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl version
</code></pre></div></div>

<p>If version is <code class="language-plaintext highlighter-rouge">OpenSSL 1.1.1f</code>, then it needs to be updated using the following steps.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz
<span class="nb">tar</span> <span class="nt">-zxf</span> openssl-1.1.1g.tar.gz <span class="o">&amp;&amp;</span> <span class="nb">cd </span>openssl-1.1.1g
./config
make
make <span class="nb">test
sudo mv</span> /usr/bin/openssl ~/tmp <span class="c">#in case install goes wrong</span>
<span class="nb">sudo </span>make <span class="nb">install
sudo ln</span> <span class="nt">-s</span> /usr/local/bin/openssl /usr/bin/openssl
<span class="nb">sudo </span>ldconfig
</code></pre></div></div>

<p>Again, from the terminal issue the command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl version
</code></pre></div></div>

<p>Your output should be as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OpenSSL 1.1.1g  21 Apr 2020
</code></pre></div></div>

<p>Then create <code class="language-plaintext highlighter-rouge">~/.wgetrc</code> file and add to it
<code class="language-plaintext highlighter-rouge">ca_certificate=/etc/ssl/certs/ca-certificates.crt</code> using vim or nano.</p>

<h4 id="install-tophat">Install <a href="https://ccb.jhu.edu/software/tophat/index.shtml">TopHat</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> tophat-2.1.1.Linux_x86_64.tar.gz
<span class="nb">cd </span>tophat-2.1.1.Linux_x86_64
./gtf_to_fasta
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="nv">$/</span>home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-kallisto">Install <a href="https://pachterlab.github.io/kallisto/">kallisto</a></h4>

<p>Note: There are a couple of arguments only supported in kallisto legacy versions (version before 0.50.0). Also how_are_we_stranded_here uses kallisto == 0.44.x. Thus, installation steps below if for 1 of the legacy versions. But if run into problem, consider using a more updated version.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> kallisto_linux-v0.44.0.tar.gz
<span class="nb">cd </span>kallisto_linux-v0.44.0
./kallisto
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/kallisto_linux-v0.44.0:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-fastqc">Install <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> default-jre
<span class="nb">cd</span> ~/bin
<span class="nb">sudo </span>wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
<span class="nb">sudo </span>unzip fastqc_v0.12.1.zip
<span class="nb">sudo chmod </span>755 fastqc
./fastqc <span class="nt">--help</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/FastQC:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-fastp">Install <a href="https://github.com/OpenGene/fastp">Fastp</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /home/ubuntu/bin
wget http://opengene.org/fastp/fastp
<span class="nb">chmod </span>a+x ./fastp
./fastp

<span class="c"># Add fastp to bashrc</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/fastp:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-multiqc">Install <a href="http://multiqc.info/">MultiQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
pip3 <span class="nb">install </span>multiqc
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/.local/bin:<span class="nv">$PATH</span>
multiqc <span class="nt">--help</span>

</code></pre></div></div>

<h4 id="install-picard">Install <a href="https://broadinstitute.github.io/picard/">Picard</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar <span class="nt">-O</span> picard.jar
java <span class="nt">-jar</span> ~/bin/picard.jar

<span class="nb">export </span><span class="nv">PICARD</span><span class="o">=</span>/home/ubuntu/bin/picard.jar
</code></pre></div></div>

<h4 id="install-flexbar">Install <a href="https://github.com/seqan/flexbar">Flexbar</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>flexbar
</code></pre></div></div>

<h4 id="install-regtools">Install <a href="https://github.com/griffithlab/regtools#regtools">Regtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/griffithlab/regtools
<span class="nb">cd </span>regtools/
<span class="nb">mkdir </span>build
<span class="nb">cd </span>build/
cmake ..
make
./regtools
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/regtools/build:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-rseqc">Install <a href="http://rseqc.sourceforge.net/">RSeQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip3 <span class="nb">install </span>RSeQC
~/.local/bin/read_GC.py
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>~/.local/bin/:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bedops">Install <a href="https://bedops.readthedocs.io/en/latest/">bedops</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>bedops_linux_x86_64-v2.4.41
<span class="nb">cd </span>bedops_linux_x86_64-v2.4.41
wget <span class="nt">-c</span> https://github.com/bedops/bedops/releases/download/v2.4.41/bedops_linux_x86_64-v2.4.41.tar.bz2
<span class="nb">tar</span> <span class="nt">-jxvf</span> bedops_linux_x86_64-v2.4.41.tar.bz2
./bin/bedops

<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>~/bin/bedops_linux_x86_64-v2.4.41/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-gtftogenepred">Install <a href="https://bioconda.github.io/recipes/ucsc-gtftogenepred/README.html">gtfToGenePred</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>gtfToGenePred
<span class="nb">cd </span>gtfToGenePred
wget <span class="nt">-c</span> http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
<span class="nb">chmod </span>a+x gtfToGenePred
./gtfToGenePred
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/gtfToGenePred:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-genepredtobed">Install <a href="https://bioconda.github.io/recipes/ucsc-genepredtobed/README.html">genePredToBed</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
<span class="nb">mkdir </span>genePredtoBed
<span class="nb">cd </span>genePredtoBed
wget <span class="nt">-c</span> http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
<span class="nb">chmod </span>a+x genePredToBed
./genePredToBed
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/genePredtoBed:<span class="nv">$PATH</span> 
<span class="c">#note: the path has lowercase 't' at in 'genePredtoBed'</span>
<span class="c">#genePredToBed </span>
</code></pre></div></div>

<h4 id="install-how_are_we_stranded_here">Install <a href="https://github.com/betsig/how_are_we_stranded_here">how_are_we_stranded_here</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip3 <span class="nb">install </span>git+https://github.com/kcotto/how_are_we_stranded_here.git
check_strandedness
</code></pre></div></div>

<h4 id="install-cell-ranger">Install <a href="https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation">Cell Ranger</a></h4>

<ul>
  <li>Check if it has been updated</li>
  <li>Must register to get download link</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget <span class="nt">-O</span> cellranger-9.0.1.tar.gz <span class="s2">"https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.1.tar.gz?Expires=1761476807&amp;Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&amp;Signature=EzblHLl1V4c8zY~f1gqgkJDfwXHdpXeUUsoqiipDGTTRc1cDXifB5CZdV2te0aGah4VoEUssh8ERdRpmLzMNZzSBdsjmX9t6FE3PwZ83c-cOKAVJK3-v7z8GhMH9HZMqVxEb3w6SwztkZmipGhCyG95yT2fv-sNQZssJmUEzx8Wnc4t69iS~u0uMrd4rj9EDkyw6CYCfkRGGoxCclUAyPqMBQGRbcCogVlSoVk6sc~UsD5vpXXoxlgoVRrThdEZ4DpUUtInLST8cvtS127nrWIJcJ1e1Jk8dSndpKvAHEwTGB~U21oyiOb8lZhXVsY7VCXKsvivRKaRXWj0kh8whOA__"</span>
<span class="nb">tar</span> <span class="nt">-xzvf</span> cellranger-9.0.1.tar.gz
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/cellranger-9.0.1:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-tabix">Install <a href="http://www.htslib.org/download/">TABIX</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nb">install </span>tabix
</code></pre></div></div>

<h4 id="install-bwa">Install <a href="http://bio-bwa.sourceforge.net/bwa.shtml">BWA</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/lh3/bwa.git
<span class="nb">cd </span>bwa
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bwa:<span class="nv">$PATH</span>
<span class="c">#bwa mem #to call bwa</span>
</code></pre></div></div>

<h4 id="install-bedtools">Install <a href="https://bedtools.readthedocs.io/en/latest/">bedtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz
<span class="nb">tar</span> <span class="nt">-zxvf</span> bedtools-2.31.0.tar.gz
<span class="nb">cd </span>bedtools2
make
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bedtools2/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-bcftools">Install <a href="http://www.htslib.org/download/">BCFtools</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
bunzip2 bcftools-1.18.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> bcftools-1.18.tar
<span class="nb">cd </span>bcftools-1.18
make
./bcftools
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/bcftools-1.18:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-htslib">Install <a href="http://www.htslib.org/download/">htslib</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
bunzip2 htslib-1.18.tar.bz2
<span class="nb">tar</span> <span class="nt">-xvf</span> htslib-1.18.tar
<span class="nb">cd </span>htslib-1.18
make
<span class="nb">sudo </span>make <span class="nb">install</span>
<span class="c">#htsfile --help</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/htslib-1.18:<span class="nv">$PATH</span>
</code></pre></div></div>

<h4 id="install-peddy">Install <a href="https://github.com/brentp/peddy">peddy</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
git clone https://github.com/brentp/peddy
<span class="nb">cd </span>peddy
pip <span class="nb">install</span> <span class="nt">-r</span> requirements.txt
pip <span class="nb">install</span> <span class="nt">--editable</span> <span class="nb">.</span>
</code></pre></div></div>

<h4 id="install-slivar">Install <a href="https://github.com/brentp/slivar">slivar</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/brentp/slivar/releases/download/v0.3.0/slivar
<span class="nb">chmod</span> +x ./slivar
</code></pre></div></div>

<h4 id="install-strling">Install <a href="https://strling.readthedocs.io/en/latest/index.html">STRling</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.2/strling
<span class="nb">chmod</span> +x ./strling
</code></pre></div></div>

<h3 id="install-freebayes">Install <a href="https://github.com/freebayes/freebayes">freebayes</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>freebayes
</code></pre></div></div>

<h3 id="install-vcflib">Install <a href="https://github.com/vcflib/vcflib">vcflib</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>libvcflib-tools libvcflib-dev
</code></pre></div></div>

<h3 id="install-anaconda">Install <a href="https://www.anaconda.com/">Anaconda</a></h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh 
bash Anaconda3-2023.09-0-Linux-x86_64.sh
</code></pre></div></div>

<p>Press Enter to review the license agreement. Then press and hold Enter to scroll.</p>

<p>Enter “yes” to agree to the license agreement.</p>

<p>Saved the installation to <code class="language-plaintext highlighter-rouge">/home/ubuntu/bin/anaconda3</code> and chose yes to initializng Anaconda3.</p>

<p>Add in bashrc:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/anaconda3/bin:<span class="nv">$PATH</span>
</code></pre></div></div>
<p>To see location of conda executable: 
which conda</p>

<h3 id="install-vep">Install <a href="">VEP</a></h3>
<p>Note: Install VEP in workspace because cache file for that takes a lot of space (~25G).</p>

<p>Describes dependencies for VEP 110, used in this course for variant annotation. When running the VEP installer follow the prompts specified:</p>

<ol>
  <li>Do you want to install any cache files (y/n)? n
(In case want to install cache file, choose ‘y’ [ENTER] (select number for homo_sapiens_vep_110_GRCh38.tar.gz) [ENTER] )</li>
  <li>Do you want to install any FASTA files (y/n)? y [ENTER] (select number for homo_sapiens) [ENTER]</li>
  <li>Do you want to install any plugins (y/n)? n [ENTER]</li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/workspace
<span class="nb">sudo </span>git clone https://github.com/Ensembl/ensembl-vep.git
<span class="nb">cd </span>ensembl-vep
<span class="nb">sudo </span>perl <span class="nt">-MCPAN</span> <span class="nt">-e</span><span class="s1">'install "LWP::Simple"'</span>
<span class="nb">sudo </span>perl INSTALL.pl <span class="nt">--CACHEDIR</span> ~/workspace/ensembl-vep/
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/workspace/ensembl-vep:<span class="nv">$PATH</span>
<span class="c">#vep --help</span>
</code></pre></div></div>

<h3 id="set-up-jupyter-to-render-in-web-brower">Set up Jupyter to render in web brower</h3>

<p>Followed this <a href="https://dataschool.com/data-modeling-101/running-jupyter-notebook-on-an-ec2-server/">website</a> and this <a href="https://jupyter-server.readthedocs.io/en/latest/operators/migrate-from-nbserver.html">website</a>
Note: The old jupyter notebook was split into jupyter-server and nbclassic. The steps to set up jupyter on ec2 in the first link therefore have been adapted based on suggestions in the second link to accommodate this migration.</p>

<p>First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/anaconda3/bin:<span class="nv">$PATH</span>
</code></pre></div></div>

<p>Then you need to source the .bashrc for changes to take effect.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source</span> .bashrc
</code></pre></div></div>

<p>We then need to create our Jupyter configuration file. In order to create that file, you need to run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter notebook <span class="nt">--generate-config</span>
</code></pre></div></div>

<hr />

<p><strong>Optional:</strong> After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython:</p>

<p>Enter the IPython command line:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ipython
</code></pre></div></div>

<p>Now follow these steps to generate your password:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">notebook.auth</span> <span class="kn">import</span> <span class="n">passwd</span>

<span class="n">passwd</span><span class="p">()</span>
</code></pre></div></div>

<p>You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file.</p>

<p>Run <code class="language-plaintext highlighter-rouge">exit</code> in order to exit IPython.</p>

<hr />

<p>Next, go into your jupyter config file (<code class="language-plaintext highlighter-rouge">/home/ubuntu/.jupyter/jupyter_server_config.py</code>):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> .jupyter

vim jupyter_notebook_config.py
</code></pre></div></div>

<p>And add the following code at the beginning of the document:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c <span class="o">=</span> get_config<span class="o">()</span> <span class="c">#add this line if it's not already in jupyter_notebook_config.py</span>

c.ServerApp.ip <span class="o">=</span> <span class="s1">'0.0.0.0'</span>
<span class="c">#c.ServerApp.password = u'YOUR PASSWORD HASH' #uncomment this line if decide to use password</span>
c.ServerApp.port <span class="o">=</span> 8888
</code></pre></div></div>

<hr />
<p><strong>Optional:</strong> We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>Notebooks
<span class="c"># You can call this folder anything, for this example we call it `Notebooks`.</span>
</code></pre></div></div>
<hr />

<p>After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbclassic
<span class="c"># to open jupyter lab on the instance</span>
jupyter lab
</code></pre></div></div>

<p>From there you should be able to access your server by going to:</p>

<p><code class="language-plaintext highlighter-rouge">http://(your AWS public dns):8888/</code>
or
<code class="language-plaintext highlighter-rouge">http://(your AWS public dns):8888/(tree?token=... - in the message generated while running 'jupyter nbclassic')</code></p>

<p>(Note: if ever run into problem accessing server, double check whether you are using http or https. If you didnt add https port in security group configuration step when create the instance, then you wouldn’t be able to access server with https.)</p>

<h4 id="install-r">Install <a href="http://www.r-project.org/">R</a></h4>

<p>Follow this guide from cran website to install R ver 4.4 <a href="https://cran.r-project.org/">website</a></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># update indices</span>
<span class="nb">sudo </span>apt update <span class="nt">-qq</span>
<span class="c"># install two helper packages we need</span>
<span class="nb">sudo </span>apt <span class="nb">install</span> <span class="nt">--no-install-recommends</span> software-properties-common dirmngr
<span class="c"># add the signing key (by Michael Rutter) for these repos</span>
<span class="c"># To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc </span>
<span class="c"># Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9</span>
wget <span class="nt">-qO-</span> https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | <span class="nb">sudo tee</span> <span class="nt">-a</span> /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
<span class="c"># add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed</span>
<span class="nb">sudo </span>add-apt-repository <span class="s2">"deb https://cloud.r-project.org/bin/linux/ubuntu </span><span class="si">$(</span>lsb_release <span class="nt">-cs</span><span class="si">)</span><span class="s2">-cran40/"</span>

<span class="c">#install R and its dependencies</span>
<span class="nb">sudo </span>apt <span class="nb">install</span> <span class="nt">--no-install-recommends</span> r-base
</code></pre></div></div>

<p>Note, linking the R-patched <code class="language-plaintext highlighter-rouge">bin</code> directory into your <code class="language-plaintext highlighter-rouge">PATH</code> may cause weird things to happen, such as man pages or <code class="language-plaintext highlighter-rouge">git log</code> to not display. This can be circumvented by directly linking the <code class="language-plaintext highlighter-rouge">R*</code> executables (<code class="language-plaintext highlighter-rouge">R</code>, <code class="language-plaintext highlighter-rouge">RScript</code>, <code class="language-plaintext highlighter-rouge">RCmd</code>, etc.) into a <code class="language-plaintext highlighter-rouge">PATH</code> directory.</p>

<h4 id="r-libraries">R Libraries</h4>

<p>For this tutorial we require:</p>

<ul>
  <li><a href="https://cran.r-project.org/web/packages/devtools/index.html">devtools</a></li>
  <li><a href="https://cran.r-project.org/web/packages/dplyr/index.html">dplyr</a></li>
  <li><a href="http://cran.r-project.org/web/packages/gplots/index.html">gplots</a></li>
  <li><a href="https://ggplot2.tidyverse.org/">ggplot2</a></li>
  <li><a href="https://cran.r-project.org/web/packages/sctransform/index.html">sctransform</a></li>
  <li><a href="https://cran.r-project.org/web/packages/Seurat/index.html">Seurat</a></li>
  <li><a href="https://cran.r-project.org/web/packages/RColorBrewer/index.html">RColorBrewer</a></li>
  <li><a href="https://cran.r-project.org/package=ggthemes">ggthemes</a></li>
  <li><a href="https://cran.r-project.org/package=cowplot">cowplot</a></li>
  <li><a href="https://cran.r-project.org/web/packages/data.table/">data.table</a></li>
  <li><a href="https://cran.r-project.org/package=Rtsne">Rtsne</a></li>
  <li><a href="https://cran.r-project.org/web/packages/gridExtra/index.html">gridExtra</a></li>
  <li><a href="https://cran.r-project.org/web/packages/UpSetR/index.html">UpSetR</a></li>
  <li><a href="https://tidyverse.tidyverse.org/">tidyverse</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
install.packages<span class="o">(</span>c<span class="o">(</span><span class="s2">"devtools"</span>,<span class="s2">"dplyr"</span>,<span class="s2">"gplots"</span>,<span class="s2">"ggplot2"</span>,<span class="s2">"sctransform"</span>,<span class="s2">"Seurat"</span>,<span class="s2">"RColorBrewer"</span>,<span class="s2">"ggthemes"</span>,<span class="s2">"cowplot"</span>,<span class="s2">"data.table"</span>,<span class="s2">"Rtsne"</span>,<span class="s2">"gridExtra"</span>,<span class="s2">"UpSetR"</span>,<span class="s2">"tidyverse"</span><span class="o">)</span>,repos<span class="o">=</span><span class="s2">"http://cran.us.r-project.org"</span><span class="o">)</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>
<p>Note: if asked if want to install in personal library, type ‘yes’.</p>

<h4 id="bioconductor-libraries"><a href="http://www.bioconductor.org/">Bioconductor</a> libraries</h4>

<p>For this tutorial we require:</p>

<ul>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/genefilter.html">genefilter</a></li>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/ballgown.html">ballgown</a></li>
  <li><a href="http://www.bioconductor.org/packages/release/bioc/html/edgeR.html">edgeR</a></li>
  <li><a href="http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">GenomicRanges</a></li>
  <li><a href="https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html">rhdf5</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/biomaRt.html">biomaRt</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/scran.html">scran</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/sva.html">sva</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/gage.html">gage</a></li>
  <li><a href="https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html">org.Hs.eg.db</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/DESeq2.html">DESeq2</a></li>
  <li><a href="https://bioconductor.org/packages/release/bioc/html/apeglm.html">apeglm</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
<span class="c"># Install Bioconductor</span>
<span class="k">if</span> <span class="o">(!</span>requireNamespace<span class="o">(</span><span class="s2">"BiocManager"</span>, quietly <span class="o">=</span> TRUE<span class="o">))</span>
    install.packages<span class="o">(</span><span class="s2">"BiocManager"</span><span class="o">)</span>
BiocManager::install<span class="o">(</span>c<span class="o">(</span><span class="s2">"genefilter"</span>,<span class="s2">"ballgown"</span>,<span class="s2">"edgeR"</span>,<span class="s2">"GenomicRanges"</span>,<span class="s2">"rhdf5"</span>,<span class="s2">"biomaRt"</span>,<span class="s2">"scran"</span>,<span class="s2">"sva"</span>,<span class="s2">"gage"</span>,<span class="s2">"org.Hs.eg.db"</span>,<span class="s2">"DESeq2"</span>,<span class="s2">"apeglm"</span>,<span class="s2">"clusterProfiler"</span>,<span class="s2">"enrichplot"</span>,<span class="s2">"pathview"</span><span class="o">))</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h4 id="install-sleuth">Install <a href="https://pachterlab.github.io/sleuth/download">Sleuth</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
install.packages<span class="o">(</span><span class="s2">"devtools"</span><span class="o">)</span>
devtools::install_github<span class="o">(</span><span class="s2">"pachterlab/sleuth"</span><span class="o">)</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h3 id="install-softwares-for-germline-analyses">Install softwares for germline analyses</h3>
<ul>
  <li>gatk</li>
  <li>minimap</li>
  <li>NanoPlot</li>
  <li>Varscan</li>
  <li>faSplit</li>
</ul>

<h4 id="install-gatk">Install <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4">gatk</a></h4>
<p>(Note: in cshl2023 version of the course, install this gatk 4.2.1.0 instead of an more updated ver since this work with the current Java version - Java ver 11)</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
wget https://github.com/broadinstitute/gatk/releases/download/4.2.1.0/gatk-4.2.1.0.zip
unzip gatk-4.2.1.0.zip

<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/gatk-4.2.1.0:<span class="nv">$PATH</span> <span class="c">#add to .bashrc</span>
gatk <span class="nt">--help</span>
gatk <span class="nt">--list</span>
</code></pre></div></div>

<h4 id="install-minimap2">Install <a href="https://github.com/lh3/minimap2">minimap2</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
curl <span class="nt">-L</span> https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | <span class="nb">tar</span> <span class="nt">-jxvf</span> -
./minimap2-2.26_x64-linux/minimap2

<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>/home/ubuntu/bin/minimap2-2.26_x64-linux:<span class="nv">$PATH</span> <span class="c">#add to .bashrc</span>
minimap2 <span class="nt">--help</span>

</code></pre></div></div>

<h4 id="install-nanoplot">Install <a href="https://github.com/wdecoster/NanoPlot">NanoPlot</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>NanoPlot
<span class="c">#which NanoPlot</span>
<span class="c">#NanoPlot -h</span>
</code></pre></div></div>

<h4 id="install-varscan">Install <a href="https://github.com/dkoboldt/varscan">Varscan</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> ~/bin
curl <span class="nt">-L</span> <span class="nt">-k</span> <span class="nt">-o</span> VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
java <span class="nt">-jar</span> ~/bin/VarScan.v2.4.2.jar

</code></pre></div></div>

<h4 id="install-fasplit">Install <a href="https://open.bioqueue.org/home/knowledge/showKnowledge/sig/ucsc-fasplit">faSplit</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda create <span class="nt">-n</span> fasplit_env bioconda::ucsc-fasplit
<span class="nb">source </span>activate fasplit_env
conda activate fasplit_env
<span class="c">#faSplit</span>
conda deactivate

</code></pre></div></div>
<h3 id="install-packages-for-single-cell-atac-seq-lab">Install packages for single-cell ATAC-seq lab</h3>
<p>To prevent dependencies conflicts, install packages for this lab in a conda environment.</p>

<p>Packages:</p>
<ul>
  <li><a href="https://kzhang.org/SnapATAC2/">SnapATAC2</a></li>
  <li><a href="https://scanpy.readthedocs.io/en/stable/">scanpy</a></li>
  <li><a href="https://pypi.org/project/MACS2/">macs2</a></li>
  <li><a href="https://github.com/KrishnaswamyLab/MAGIC">MAGIC</a></li>
  <li><a href="https://deeptools.readthedocs.io/en/develop/index.html">deepTools</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda create <span class="nt">--name</span> snapatac2_env <span class="nv">python</span><span class="o">=</span>3.11
<span class="nb">source </span>activate snapatac2_env
conda activate snapatac2_env

pip <span class="nb">install </span>snapatac2
<span class="c">#pip show snapatac2 </span>
pip <span class="nb">install </span>scanpy
pip <span class="nb">install </span>MACS2
pip <span class="nb">install</span> <span class="nt">--user</span> magic-impute
pip <span class="nb">install </span>deeptools 

conda deactivate

</code></pre></div></div>
<p>To run virtual environment in jupyter nbclassic, there are a few extra set up steps:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Step 1: Activate the Conda Environment of interest:</span>
conda activate snapatac2_env
<span class="c">#Step 2: Install Ipykernel: </span>
conda <span class="nb">install </span>ipykernel
<span class="c">#Step 3: Create a Jupyter Kernel for the environment</span>
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>snapatac2_env_kernel
<span class="sb">```</span>bash
Then run jupyter notebook as usual:
<span class="sb">```</span>bash
jupyter nbclassic
</code></pre></div></div>
<p>Access server by adding to the browser:
http://(your AWS public dns):8888/ 
We can either create a notebook using desired environment kernel, or just create a notebook using the default ipykernel and change kernel within the notebook itself.</p>

<h4 id="install-atacseqqc">Install <a href="https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html">ATACseqQC</a></h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
<span class="c">#if (!require("BiocManager", quietly = TRUE))</span>
    <span class="c">#install.packages("BiocManager")</span>
BiocManager::install<span class="o">(</span><span class="s2">"ATACseqQC"</span><span class="o">)</span>
quit<span class="o">(</span><span class="nv">save</span><span class="o">=</span><span class="s2">"no"</span><span class="o">)</span>
</code></pre></div></div>

<h3 id="install-packages-for-single-cell-rnaseq-lab">Install packages for single-cell RNAseq lab</h3>
<p>To prevent dependencies conflicts, install packages for this lab in a conda environment.</p>

<p>Packages:</p>
<ul>
  <li><a href="https://scanpy.readthedocs.io/en/stable/installation.html">scanpy leiden</a></li>
  <li><a href="https://pypi.org/project/gtfparse/1.2.1/">gtfparse 1.2.1</a></li>
  <li><a href="https://github.com/swolock/scrublet">scrublet</a></li>
  <li><a href="https://github.com/alugowski/fast_matrix_market/tree/main/python">fast_matrix_market</a></li>
  <li><a href="https://github.com/lilab-bcb/harmony-pytorch">harmony-pytorch</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda create <span class="nt">--name</span> scRNAseq_env <span class="nv">python</span><span class="o">=</span>3.11
<span class="nb">source </span>activate scRNAseq_env
conda activate scRNAseq_env
pip <span class="nb">install</span> <span class="s1">'scanpy[leiden]'</span>
pip <span class="nb">install </span><span class="nv">gtfparse</span><span class="o">==</span>1.2.0
pip <span class="nb">install </span>scrublet
pip <span class="nb">install </span>fast_matrix_market
pip <span class="nb">install </span>harmony-pytorch

conda <span class="nb">install </span>ipykernel
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>scRNAseq_env_kernel
conda deactivate
</code></pre></div></div>

<h3 id="install-packages-for-variant-annotation-and-python-visualization-lab">Install packages for Variant annotation and python visualization lab</h3>

<p>Packages:</p>
<ul>
  <li><a href="https://github.com/pyenv/pyenv">pyenv</a></li>
  <li><a href="https://virtualenv.pypa.io/en/latest/installation.html">virtualenv</a></li>
  <li><a href="https://virtualenv.pypa.io/en/latest/installation.html">pyenv</a></li>
  <li><a href="https://pypi.org/project/beautifulsoup4/">beautifulsoup4</a></li>
  <li><a href="https://pypi.org/project/requests/">requests</a></li>
  <li><a href="https://github.com/KarchinLab/open-cravat/issues/98">PyVCF</a></li>
  <li><a href="https://pypi.org/project/vcfpy/">vcfpy</a></li>
  <li><a href="https://github.com/vcftools/vcftools">vcftools</a></li>
  <li><a href="https://github.com/pysam-developers/pysam">Pysam</a></li>
  <li><a href="https://pypi.org/project/civicpy/">civicpy</a></li>
  <li><a href="https://pandas.pydata.org/docs/getting_started/install.html">pandas</a></li>
  <li><a href="https://jqlang.github.io/jq/download/">jq</a></li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#pyenv</span>
<span class="c">#note: after installation, add pyenv configs in .bashrc (see below)</span>
<span class="nb">cd</span> ~/bin
curl https://pyenv.run | bash

<span class="c">#virtualenv</span>
<span class="nb">sudo </span>apt <span class="nb">install </span>pipx
pipx <span class="nb">install </span>virtualenv

<span class="c">#beautifulsoup4</span>
pip <span class="nb">install </span>beautifulsoup4

<span class="c">#requests</span>
pip <span class="nb">install </span>requests

<span class="c">#vcfpy</span>
<span class="c">#installation note for vcfpy: https://github.com/KarchinLab/open-cravat/issues/98</span>
conda <span class="nb">install</span> <span class="nt">-c</span> bioconda open-cravat
pip <span class="nb">install </span>vcfpy

<span class="c">#vcftools</span>
<span class="c">#installation note: https://github.com/vcftools/vcftools/issues/188</span>
<span class="nb">cd</span> ~/bin
<span class="nb">sudo </span>apt-get <span class="nb">install </span>autoconf
git clone https://github.com/vcftools/vcftools.git
<span class="nb">cd </span>vcftools
./autogen.sh
./configure
make
<span class="nb">sudo </span>make <span class="nb">install</span>

<span class="c">#pysam</span>
<span class="c">#conda config --show channels #to see if 3 needed channels are already configured in the conda environment. if not, add:</span>
<span class="c">#conda config --add channels defaults</span>
<span class="c">#conda config --add channels conda-forge</span>
<span class="c">#conda config --add channels bioconda</span>
<span class="c">#conda install pysam</span>
pip <span class="nb">install </span>pysam
<span class="c">#installed at: ./bin/anaconda3/lib/python3.11/site-packages</span>

<span class="c">#civicpy</span>
pip <span class="nb">install </span>civicpy

<span class="c">#pandas</span>
pip <span class="nb">install </span>pandas

<span class="c">#jq</span>
<span class="nb">sudo </span>apt-get <span class="nb">install </span>jq

</code></pre></div></div>

<p>add pyenv configs in .bashrc</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## pyenv configs</span>
<span class="nb">export </span><span class="nv">PYENV_ROOT</span><span class="o">=</span><span class="s2">"/home/ubuntu/bin/.pyenv"</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$PYENV_ROOT</span><span class="s2">/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>

<span class="k">if </span><span class="nb">command</span> <span class="nt">-v</span> pyenv 1&gt;/dev/null 2&gt;&amp;1<span class="p">;</span> <span class="k">then
  </span><span class="nb">eval</span> <span class="s2">"</span><span class="si">$(</span>pyenv init -<span class="si">)</span><span class="s2">"</span>
<span class="k">fi</span>
</code></pre></div></div>

<h3 id="path-setup">Path setup</h3>

<p>For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex.</p>

<h3 id="set-up-apache-web-server">Set up Apache web server</h3>

<p>We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80.</p>

<ul>
  <li>Edit config to allow files to be served from outside /usr/share and /var/www</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vim /etc/apache2/apache2.conf
</code></pre></div></div>
<p>Add the following content to apache2.conf</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;Directory /workspace&gt;
       Options Indexes FollowSymLinks
       AllowOverride None
       Require all granted
&lt;/Directory&gt;
</code></pre></div></div>

<ul>
  <li>Edit vhost file
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>vim /etc/apache2/sites-available/000-default.conf
</code></pre></div>    </div>
    <p>Change document root in 000-default.conf to ‘/workspace’</p>
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DocumentRoot /workspace
</code></pre></div>    </div>
  </li>
  <li>Restart apache</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>service apache2 restart
</code></pre></div></div>

<p>To check if the server works, type in browser of choice: http://[public ip address of ec2 instance]. You should see the content within /workspace .</p>

<h3 id="save-a-public-ami">Save a public AMI</h3>

<p>Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches.</p>

<h3 id="current-public-amis">Current Public AMIs</h3>

<ul>
  <li>cshl-seqtec-2022 (ami-09b613ae9751a96b1; N. Virginia)</li>
  <li>cbw-rnabio-2023 (ami-09b3fd07d90812201; N. Virginia)</li>
  <li>cshl-seqtec-2023 (ami-05d41e9b8c7eee2df; N. Virginia)</li>
  <li>cshl-seqtec-2024 (ami-00029a06cacbe647c; N. Virginia)</li>
  <li>cshl-seqtec-2025 (also named cshl_2025_AMI_final; ami-027b72b97520101bd; N. Virginia)</li>
</ul>

<h3 id="create-iam-account">Create IAM account</h3>

<p>From AWS Console select Services -&gt; IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -&gt; Manage Password -&gt; ‘Assign a Custom Password’. Go to Groups -&gt; Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -&gt; Add Users to Group, select your new user to add it to the new group.</p>

<h3 id="launch-student-instance">Launch student instance</h3>

<ol>
  <li>Go to AWS console. Login. Select EC2.</li>
  <li>Launch Instance, search for “cshl-seqtec-2025” in Community AMIs and Select.</li>
  <li>Choose “m6a.xlarge” instance type.</li>
  <li>Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination”</li>
  <li>Make sure that you see two snapshots (e.g., the 60GB root volume (gp3) and 500GB EBS volume (gp3) you set up earlier). Tick the boxes for “Delete on termination” for both.</li>
  <li>Create a tag with Name=StudentName</li>
  <li>Choose existing security group call “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. Review and Launch.</li>
  <li>Choose an existing key pair (cshl_2025_student.pem)</li>
  <li>View instances and wait for them to finish initiating.</li>
  <li>Find your instance in console and select it, then hit connect to get your public.ip.address.</li>
  <li>Login to node <code class="language-plaintext highlighter-rouge">ssh -i cshl_2025_student.pem ubuntu@[public.ip.address]</code>.</li>
  <li>Optional - set up DNS redirects (see below)</li>
</ol>

<h3 id="set-up-a-dynamic-dns-service">Set up a dynamic DNS service</h3>

<p>Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like <a href="http://dyn.com">http://dyn.com</a> (or <a href="http://entrydns.net">http://entrydns.net</a>, etc.) to create hostnames like <rna01.dyndns.org>, <rna02.dyndns.org>, etc that point to each public IP address of student instances.</rna02.dyndns.org></rna01.dyndns.org></p>

<h3 id="host-necessary-files-for-the-course">Host necessary files for the course</h3>

<p>Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot.</p>

<ul>
  <li>Files copied to: /gscmnt/sata102/info/ftp-staging/pub/rnaseq/</li>
  <li>Appear here: <a href="http://genome.wustl.edu/pub/rnaseq/">http://genome.wustl.edu/pub/rnaseq/</a></li>
</ul>

<h3 id="after-course-reminders">After course reminders</h3>

<ul>
  <li>Delete the student IAM account created above otherwise students will continue to have EC2 privileges.</li>
  <li>Terminate all instances and clean up any unnecessary volumes, snapshots, etc.</li>
</ul>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Preamble - Amazon AWS/AMI setup for use in workshop This tutorial explains how Amazon cloud instances were configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Amazon AWS. A helpful introduction of AWS can be found here Table of Contents Create AWS account Set up security group (if needed) Start with existing community AMI Perform basic linux configuration Add ubuntu user to docker group Set up additional storage for workspace Install any desired informatics tools Install RNA-seq software Create directory to install software to and setup path variables Install SAMtools Install bam-readcount Install HISAT2 Install StringTie Install gffcompare Install HTSeq Make sure that OpenSSL is on correct version Install TopHat Install kallisto Install FastQC Install Fastp Install MultiQC Install Picard Install Flexbar Install Regtools Install RSeQC Install bedops Install gtfToGenePred Install genePredToBed Install how_are_we_stranded_here Install Cell Ranger Install TABIX Install BWA Install bedtools Install BCFtools Install htslib Install peddy Install slivar Install STRling Install freebayes Install vcflib Install Anaconda Install VEP Set up Jupyter to render in web brower Install R R Libraries Bioconductor libraries Install Sleuth Install softwares for germline analyses Install gatk Install minimap2 Install NanoPlot Install Varscan Install faSplit Install packages for single-cell ATAC-seq lab Install ATACseqQC Install packages for single-cell RNAseq lab Install packages for Variant annotation and python visualization lab Path setup Set up Apache web server Save a public AMI Current Public AMIs Create IAM account Launch student instance Set up a dynamic DNS service Host necessary files for the course After course reminders Create AWS account Create a new gmail account to use for the course Use the above email account to set up a new AWS/Amazon user account. Note: Any AWS account needs to be linked to an actual person and credit card account. Optional - Set up an IAM account. Give this account full EC2 but no other permissions. This provides an account that can be shared with other instructors but does not have access to billing and other root account privelages. Request limit increase for limit types you will be using. You need to be able to spin up at least one instance of the desired type for every student and TA/instructor. See: http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/. Note: You need to request an increase for each instance type and region you might use. Sign into AWS Management Console: http://aws.amazon.com/console/ Go to EC2 services Set up security group (if needed) In general if no new web server is needed, you may pick an existing security group. The security group used for 2025 was “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. It has the following inbound and outbound rules that allows for connection to the Jupyter and R studio servers: Rule type IP version Type Protocol Port range Source / Destination Inbound Rule IPv4 Custom TCP TCP 8888 0.0.0.0/0   IPv4 Custom TCP TCP 8787 0.0.0.0/0   IPv4 HTTP TCP 80 0.0.0.0/0   IPv4 HTTPS TCP 443 0.0.0.0/0   IPv6 HTTP TCP 80 ::/0   IPv4 SSH TCP 22 0.0.0.0/0 Outbound Rule IPv4 All traffic All All 0.0.0.0/0 Start with existing community AMI Set up a Ubuntu instance 1) Launch a fresh Ubuntu Instance (Ubuntu Server 22.04 LTS at the time of writing this). 2) Choose an instance type of m6a.xlarge. 3) Increase root volume (e.g., 60GB)(type:gp3) and add a second volume (e.g., 500GB)(type:gp3). 4) Choose appropriate security group (for 2025 course, choose security group “SSH/HTTP/Jupyter/Rstudio - with outbound rule”). 5) If necessary, create a new key pair, name and save it locally somewhere safe. 5) Review and Launch. Select ‘View Instances’. Take note of public IP address of newly launched instance. Change permissions on downloaded key pair with chmod 400 [instructor-key].pem Login to instance with ubuntu user: ssh -i [instructor-key].pem ubuntu@[public.ip.address] Note for TAs when setting up these instances Usually the instances are setup in the following sequence: 1) Email instructors to request any packages/programs that needs to be installed. 2) Make a draft instance with the instructions above, configure and install everything. 3) Make an AMI. 4) Make an instructor instance by launching a new Ubuntu instance with the same specifications above, but using the AMI. 5) Inform instructors and TAs to test their code on the instructor instance, send them the instructor key pem file. 6) Modify the draft instance as needed. 7) Once finalized, create an AMI. This AMI will be distributed to students for the course. Perform basic linux configuration To allow installation of bioinformatics tools some basic dependencies must be installed first. sudo apt-get update sudo apt-get upgrade sudo apt-get -y install make gcc zlib1g-dev libncurses5-dev libncursesw5-dev git cmake build-essential unzip python3-numpy python3-dev python3-pip python-is-python3 gfortran libreadline-dev default-jdk libx11-dev libxt-dev xorg-dev libxml2-dev apache2 csh ruby-full gnuplot cpanminus libssl-dev gcc g++ gsl-bin libgsl-dev apt-transport-https software-properties-common meson libvcflib-dev libjsoncpp-dev libtabixpp-dev libbz2-dev docker.io libpcre2-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libdbi-perl libdbd-mysql-perl libcairo2-dev sudo ln -s /usr/include/jsoncpp/json/ /usr/include/json sudo timedatectl set-timezone America/New_York logout and log back in for changes to take affect. Add ubuntu user to docker group sudo usermod -aG docker ubuntu Then exit shell and log back into instance. Set up additional storage for workspace We first need to setup the additional storage volume that we added when we created the instance. # Create mountpoint for additional storage volume cd / sudo mkdir workspace # Mount ephemeral storage cd sudo mkfs -t ext4 /dev/nvme1n1 sudo mount /dev/nvme1n1 /workspace In order to make the workspace volume persistent, we need to edit the etc/fstab file in order. AWS provides instructions for how to do this here. # Make ephemeral storage mounts persistent # See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html for guidance on setting up fstab records for AWS # get UUID from sudo lsblk -f UUID=$(sudo lsblk -f | grep nvme1n1 | awk {'print $4'}) #if want to double check, can do 'echo $UUID' to see the UUID. #then add that UUID to /etc/fstab echo -e "LABEL=cloudimg-rootfs / ext4 defaults,discard 0 0\nUUID=$UUID /workspace ext4 defaults,nofail 0 2" | sudo tee /etc/fstab #'less /etc/fstab' , to see if the new line has been added # Change permissions on required drives sudo chown -R ubuntu:ubuntu /workspace # Create symlink to the added volume in your home directory cd ~ ln -s /workspace workspace Install any desired informatics tools NOTE: R in particular is a slow install. NOTE: - All tools should be installed locally (e.g., /home/ubuntu/bin/) in a different location from where students will install tools in their exercises. Paths to pre-installed tools can be added to the .bashrc file. A template .bashrc file: https://github.com/griffithlab/rnaseq_tutorial/blob/master/setup/.bashrc For the draft instance we are setting up, it will be helpful to copy contents from this file directly into the .bashrc file: http://genomedata.org/rnaseq-tutorial/bashrc_copy. Add additional tool paths on top of this. NOTE: (This didn’t happen during installation for the year 2023, but) In some installations of R there is an executable called pager that clashes with the system pager. This causes man to fail. Check with a man ls and if the problem exists, add the following to .bashrc: export MANPAGER=less Install RNA-seq software These install instructions should be identical to those found on https://github.com/griffithlab/rnaseq_tutorial/wiki/Installation except that each tool is installed in /home/ubuntu/bin/ and its install location is exported to the $PATH variable for easy access. Create directory to install software to and setup path variables mkdir ~/bin cd bin WORKSPACE=/home/ubuntu/workspace HOME=/home/ubuntu Install SAMtools ~/bin wget https://github.com/samtools/samtools/releases/download/1.18/samtools-1.18.tar.bz2 bunzip2 samtools-1.18.tar.bz2 tar -xvf samtools-1.18.tar cd samtools-1.18 make ./samtools #add the following line to .bashrc export PATH=/home/ubuntu/bin/samtools-1.18:$PATH export SAMTOOLS_ROOT=/home/ubuntu/bin/samtools-1.18 Install bam-readcount cd ~/bin git clone https://github.com/genome/bam-readcount cd bam-readcount mkdir build cd build cmake .. make export PATH=/home/ubuntu/bin/bam-readcount/build/bin:$PATH Install HISAT2 uname -m cd ~/bin curl -s https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download &gt; hisat2-2.2.1-Linux_x86_64.zip unzip hisat2-2.2.1-Linux_x86_64.zip cd hisat2-2.2.1 ./hisat2 -h export PATH=/home/ubuntu/bin/hisat2-2.2.1:$PATH Install StringTie cd ~/bin wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz tar -xzvf stringtie-2.2.1.tar.gz cd stringtie-2.2.1 make release export PATH=/home/ubuntu/bin/stringtie-2.2.1:$PATH Install gffcompare cd ~/bin wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.12.6.Linux_x86_64.tar.gz tar -xzvf gffcompare-0.12.6.Linux_x86_64.tar.gz cd gffcompare-0.12.6.Linux_x86_64/ ./gffcompare export PATH=/home/ubuntu/bin/gffcompare-0.12.6.Linux_x86_64:$PATH Install HTSeq pip install HTSeq # to check version of HTSeq # pip show HTSeq Make sure that OpenSSL is on correct version TopHat will not install if the version of OpenSSL is too old. To get version: openssl version If version is OpenSSL 1.1.1f, then it needs to be updated using the following steps. cd ~/bin wget https://www.openssl.org/source/openssl-1.1.1g.tar.gz tar -zxf openssl-1.1.1g.tar.gz &amp;&amp; cd openssl-1.1.1g ./config make make test sudo mv /usr/bin/openssl ~/tmp #in case install goes wrong sudo make install sudo ln -s /usr/local/bin/openssl /usr/bin/openssl sudo ldconfig Again, from the terminal issue the command: openssl version Your output should be as follows: OpenSSL 1.1.1g 21 Apr 2020 Then create ~/.wgetrc file and add to it ca_certificate=/etc/ssl/certs/ca-certificates.crt using vim or nano. Install TopHat cd ~/bin wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz tar -zxvf tophat-2.1.1.Linux_x86_64.tar.gz cd tophat-2.1.1.Linux_x86_64 ./gtf_to_fasta export PATH=$/home/ubuntu/bin/tophat-2.1.1.Linux_x86_64:$PATH Install kallisto Note: There are a couple of arguments only supported in kallisto legacy versions (version before 0.50.0). Also how_are_we_stranded_here uses kallisto == 0.44.x. Thus, installation steps below if for 1 of the legacy versions. But if run into problem, consider using a more updated version. cd ~/bin wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz tar -zxvf kallisto_linux-v0.44.0.tar.gz cd kallisto_linux-v0.44.0 ./kallisto export PATH=/home/ubuntu/bin/kallisto_linux-v0.44.0:$PATH Install FastQC sudo apt-get install -y default-jre cd ~/bin sudo wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip sudo unzip fastqc_v0.12.1.zip sudo chmod 755 fastqc ./fastqc --help export PATH=/home/ubuntu/bin/FastQC:$PATH Install Fastp cd /home/ubuntu/bin wget http://opengene.org/fastp/fastp chmod a+x ./fastp ./fastp # Add fastp to bashrc export PATH=/home/ubuntu/bin/fastp:$PATH Install MultiQC cd ~/bin pip3 install multiqc export PATH=/home/ubuntu/.local/bin:$PATH multiqc --help Install Picard cd ~/bin wget https://github.com/broadinstitute/picard/releases/download/2.26.4/picard.jar -O picard.jar java -jar ~/bin/picard.jar export PICARD=/home/ubuntu/bin/picard.jar Install Flexbar sudo apt install flexbar Install Regtools cd ~/bin git clone https://github.com/griffithlab/regtools cd regtools/ mkdir build cd build/ cmake .. make ./regtools export PATH=/home/ubuntu/bin/regtools/build:$PATH Install RSeQC pip3 install RSeQC ~/.local/bin/read_GC.py export PATH=~/.local/bin/:$PATH Install bedops cd ~/bin mkdir bedops_linux_x86_64-v2.4.41 cd bedops_linux_x86_64-v2.4.41 wget -c https://github.com/bedops/bedops/releases/download/v2.4.41/bedops_linux_x86_64-v2.4.41.tar.bz2 tar -jxvf bedops_linux_x86_64-v2.4.41.tar.bz2 ./bin/bedops export PATH=~/bin/bedops_linux_x86_64-v2.4.41/bin:$PATH Install gtfToGenePred cd ~/bin mkdir gtfToGenePred cd gtfToGenePred wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred chmod a+x gtfToGenePred ./gtfToGenePred export PATH=/home/ubuntu/bin/gtfToGenePred:$PATH Install genePredToBed cd ~/bin mkdir genePredtoBed cd genePredtoBed wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed chmod a+x genePredToBed ./genePredToBed export PATH=/home/ubuntu/bin/genePredtoBed:$PATH #note: the path has lowercase 't' at in 'genePredtoBed' #genePredToBed Install how_are_we_stranded_here pip3 install git+https://github.com/kcotto/how_are_we_stranded_here.git check_strandedness Install Cell Ranger Check if it has been updated Must register to get download link cd ~/bin wget -O cellranger-9.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.1.tar.gz?Expires=1761476807&amp;Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&amp;Signature=EzblHLl1V4c8zY~f1gqgkJDfwXHdpXeUUsoqiipDGTTRc1cDXifB5CZdV2te0aGah4VoEUssh8ERdRpmLzMNZzSBdsjmX9t6FE3PwZ83c-cOKAVJK3-v7z8GhMH9HZMqVxEb3w6SwztkZmipGhCyG95yT2fv-sNQZssJmUEzx8Wnc4t69iS~u0uMrd4rj9EDkyw6CYCfkRGGoxCclUAyPqMBQGRbcCogVlSoVk6sc~UsD5vpXXoxlgoVRrThdEZ4DpUUtInLST8cvtS127nrWIJcJ1e1Jk8dSndpKvAHEwTGB~U21oyiOb8lZhXVsY7VCXKsvivRKaRXWj0kh8whOA__" tar -xzvf cellranger-9.0.1.tar.gz export PATH=/home/ubuntu/bin/cellranger-9.0.1:$PATH Install TABIX sudo apt-get install tabix Install BWA cd ~/bin git clone https://github.com/lh3/bwa.git cd bwa make export PATH=/home/ubuntu/bin/bwa:$PATH #bwa mem #to call bwa Install bedtools cd ~/bin wget https://github.com/arq5x/bedtools2/releases/download/v2.31.0/bedtools-2.31.0.tar.gz tar -zxvf bedtools-2.31.0.tar.gz cd bedtools2 make export PATH=/home/ubuntu/bin/bedtools2/bin:$PATH Install BCFtools cd ~/bin wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2 bunzip2 bcftools-1.18.tar.bz2 tar -xvf bcftools-1.18.tar cd bcftools-1.18 make ./bcftools export PATH=/home/ubuntu/bin/bcftools-1.18:$PATH Install htslib cd ~/bin wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2 bunzip2 htslib-1.18.tar.bz2 tar -xvf htslib-1.18.tar cd htslib-1.18 make sudo make install #htsfile --help export PATH=/home/ubuntu/bin/htslib-1.18:$PATH Install peddy cd ~/bin git clone https://github.com/brentp/peddy cd peddy pip install -r requirements.txt pip install --editable . Install slivar cd ~/bin wget https://github.com/brentp/slivar/releases/download/v0.3.0/slivar chmod +x ./slivar Install STRling cd ~/bin wget https://github.com/quinlan-lab/STRling/releases/download/v0.5.2/strling chmod +x ./strling Install freebayes sudo apt install freebayes Install vcflib sudo apt install libvcflib-tools libvcflib-dev Install Anaconda cd ~/bin wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh bash Anaconda3-2023.09-0-Linux-x86_64.sh Press Enter to review the license agreement. Then press and hold Enter to scroll. Enter “yes” to agree to the license agreement. Saved the installation to /home/ubuntu/bin/anaconda3 and chose yes to initializng Anaconda3. Add in bashrc: export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH To see location of conda executable: which conda Install VEP Note: Install VEP in workspace because cache file for that takes a lot of space (~25G). Describes dependencies for VEP 110, used in this course for variant annotation. When running the VEP installer follow the prompts specified: Do you want to install any cache files (y/n)? n (In case want to install cache file, choose ‘y’ [ENTER] (select number for homo_sapiens_vep_110_GRCh38.tar.gz) [ENTER] ) Do you want to install any FASTA files (y/n)? y [ENTER] (select number for homo_sapiens) [ENTER] Do you want to install any plugins (y/n)? n [ENTER] cd ~/workspace sudo git clone https://github.com/Ensembl/ensembl-vep.git cd ensembl-vep sudo perl -MCPAN -e'install "LWP::Simple"' sudo perl INSTALL.pl --CACHEDIR ~/workspace/ensembl-vep/ export PATH=/home/ubuntu/workspace/ensembl-vep:$PATH #vep --help Set up Jupyter to render in web brower Followed this website and this website Note: The old jupyter notebook was split into jupyter-server and nbclassic. The steps to set up jupyter on ec2 in the first link therefore have been adapted based on suggestions in the second link to accommodate this migration. First, we need to add Jupyter to the system’s path (you can check if it is already on the path by running: which python, if no path is returned you need to add the path) To add Jupyter functionality to your terminal, add the following line of code to your .bashrc file: export PATH=/home/ubuntu/bin/anaconda3/bin:$PATH Then you need to source the .bashrc for changes to take effect. source .bashrc We then need to create our Jupyter configuration file. In order to create that file, you need to run: jupyter notebook --generate-config Optional: After creating your configuration file, you will need to generate a password for your Jupyter Notebook using ipython: Enter the IPython command line: ipython Now follow these steps to generate your password: from notebook.auth import passwd passwd() You will be prompted to enter and re-enter your password. IPython will then generate a hash output, COPY THIS AND SAVE IT FOR LATER. We will need this for our configuration file. Run exit in order to exit IPython. Next, go into your jupyter config file (/home/ubuntu/.jupyter/jupyter_server_config.py): cd .jupyter vim jupyter_notebook_config.py And add the following code at the beginning of the document: c = get_config() #add this line if it's not already in jupyter_notebook_config.py c.ServerApp.ip = '0.0.0.0' #c.ServerApp.password = u'YOUR PASSWORD HASH' #uncomment this line if decide to use password c.ServerApp.port = 8888 Optional: We then need to create a directory for your notebooks. In order to make a folder to store all of your Jupyter Notebooks simply run: mkdir Notebooks # You can call this folder anything, for this example we call it `Notebooks`. After the previous step, you should be ready to run your notebook and access your EC2 server. To run your Notebook simply run the command: jupyter nbclassic # to open jupyter lab on the instance jupyter lab From there you should be able to access your server by going to: http://(your AWS public dns):8888/ or http://(your AWS public dns):8888/(tree?token=... - in the message generated while running 'jupyter nbclassic') (Note: if ever run into problem accessing server, double check whether you are using http or https. If you didnt add https port in security group configuration step when create the instance, then you wouldn’t be able to access server with https.) Install R Follow this guide from cran website to install R ver 4.4 website # update indices sudo apt update -qq # install two helper packages we need sudo apt install --no-install-recommends software-properties-common dirmngr # add the signing key (by Michael Rutter) for these repos # To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc # Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9 wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc # add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" #install R and its dependencies sudo apt install --no-install-recommends r-base Note, linking the R-patched bin directory into your PATH may cause weird things to happen, such as man pages or git log to not display. This can be circumvented by directly linking the R* executables (R, RScript, RCmd, etc.) into a PATH directory. R Libraries For this tutorial we require: devtools dplyr gplots ggplot2 sctransform Seurat RColorBrewer ggthemes cowplot data.table Rtsne gridExtra UpSetR tidyverse R install.packages(c("devtools","dplyr","gplots","ggplot2","sctransform","Seurat","RColorBrewer","ggthemes","cowplot","data.table","Rtsne","gridExtra","UpSetR","tidyverse"),repos="http://cran.us.r-project.org") quit(save="no") Note: if asked if want to install in personal library, type ‘yes’. Bioconductor libraries For this tutorial we require: genefilter ballgown edgeR GenomicRanges rhdf5 biomaRt scran sva gage org.Hs.eg.db DESeq2 apeglm R # Install Bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("genefilter","ballgown","edgeR","GenomicRanges","rhdf5","biomaRt","scran","sva","gage","org.Hs.eg.db","DESeq2","apeglm","clusterProfiler","enrichplot","pathview")) quit(save="no") Install Sleuth R install.packages("devtools") devtools::install_github("pachterlab/sleuth") quit(save="no") Install softwares for germline analyses gatk minimap NanoPlot Varscan faSplit Install gatk (Note: in cshl2023 version of the course, install this gatk 4.2.1.0 instead of an more updated ver since this work with the current Java version - Java ver 11) cd ~/bin wget https://github.com/broadinstitute/gatk/releases/download/4.2.1.0/gatk-4.2.1.0.zip unzip gatk-4.2.1.0.zip export PATH=/home/ubuntu/bin/gatk-4.2.1.0:$PATH #add to .bashrc gatk --help gatk --list Install minimap2 cd ~/bin curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf - ./minimap2-2.26_x64-linux/minimap2 export PATH=/home/ubuntu/bin/minimap2-2.26_x64-linux:$PATH #add to .bashrc minimap2 --help Install NanoPlot pip install NanoPlot #which NanoPlot #NanoPlot -h Install Varscan cd ~/bin curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar java -jar ~/bin/VarScan.v2.4.2.jar Install faSplit conda create -n fasplit_env bioconda::ucsc-fasplit source activate fasplit_env conda activate fasplit_env #faSplit conda deactivate Install packages for single-cell ATAC-seq lab To prevent dependencies conflicts, install packages for this lab in a conda environment. Packages: SnapATAC2 scanpy macs2 MAGIC deepTools conda create --name snapatac2_env python=3.11 source activate snapatac2_env conda activate snapatac2_env pip install snapatac2 #pip show snapatac2 pip install scanpy pip install MACS2 pip install --user magic-impute pip install deeptools conda deactivate To run virtual environment in jupyter nbclassic, there are a few extra set up steps: #Step 1: Activate the Conda Environment of interest: conda activate snapatac2_env #Step 2: Install Ipykernel: conda install ipykernel #Step 3: Create a Jupyter Kernel for the environment python -m ipykernel install --user --name=snapatac2_env_kernel ```bash Then run jupyter notebook as usual: ```bash jupyter nbclassic Access server by adding to the browser: http://(your AWS public dns):8888/ We can either create a notebook using desired environment kernel, or just create a notebook using the default ipykernel and change kernel within the notebook itself. Install ATACseqQC R #if (!require("BiocManager", quietly = TRUE)) #install.packages("BiocManager") BiocManager::install("ATACseqQC") quit(save="no") Install packages for single-cell RNAseq lab To prevent dependencies conflicts, install packages for this lab in a conda environment. Packages: scanpy leiden gtfparse 1.2.1 scrublet fast_matrix_market harmony-pytorch conda create --name scRNAseq_env python=3.11 source activate scRNAseq_env conda activate scRNAseq_env pip install 'scanpy[leiden]' pip install gtfparse==1.2.0 pip install scrublet pip install fast_matrix_market pip install harmony-pytorch conda install ipykernel python -m ipykernel install --user --name=scRNAseq_env_kernel conda deactivate Install packages for Variant annotation and python visualization lab Packages: pyenv virtualenv pyenv beautifulsoup4 requests PyVCF vcfpy vcftools Pysam civicpy pandas jq #pyenv #note: after installation, add pyenv configs in .bashrc (see below) cd ~/bin curl https://pyenv.run | bash #virtualenv sudo apt install pipx pipx install virtualenv #beautifulsoup4 pip install beautifulsoup4 #requests pip install requests #vcfpy #installation note for vcfpy: https://github.com/KarchinLab/open-cravat/issues/98 conda install -c bioconda open-cravat pip install vcfpy #vcftools #installation note: https://github.com/vcftools/vcftools/issues/188 cd ~/bin sudo apt-get install autoconf git clone https://github.com/vcftools/vcftools.git cd vcftools ./autogen.sh ./configure make sudo make install #pysam #conda config --show channels #to see if 3 needed channels are already configured in the conda environment. if not, add: #conda config --add channels defaults #conda config --add channels conda-forge #conda config --add channels bioconda #conda install pysam pip install pysam #installed at: ./bin/anaconda3/lib/python3.11/site-packages #civicpy pip install civicpy #pandas pip install pandas #jq sudo apt-get install jq add pyenv configs in .bashrc ## pyenv configs export PYENV_ROOT="/home/ubuntu/bin/.pyenv" export PATH="$PYENV_ROOT/bin:$PATH" if command -v pyenv 1&gt;/dev/null 2&gt;&amp;1; then eval "$(pyenv init -)" fi Path setup For 2021 version of the course, rather than exporting each tool’s individual path. I moved all of the subdirs to ~/src and cp all of the binaries from there to ~/bin so that PATH is less complex. Set up Apache web server We will start an apache2 service and serve the contents of the students home directories for convenience. This allows easy download of files to their local hard drives, direct loading in IGV by url, etc. Note that when launching instances a security group will have to be selected/modified that allows http access via port 80. Edit config to allow files to be served from outside /usr/share and /var/www sudo vim /etc/apache2/apache2.conf Add the following content to apache2.conf &lt;Directory /workspace&gt; Options Indexes FollowSymLinks AllowOverride None Require all granted &lt;/Directory&gt; Edit vhost file sudo vim /etc/apache2/sites-available/000-default.conf Change document root in 000-default.conf to ‘/workspace’ DocumentRoot /workspace Restart apache sudo service apache2 restart To check if the server works, type in browser of choice: http://[public ip address of ec2 instance]. You should see the content within /workspace . Save a public AMI Finally, save the instance as a new AMI by right clicking the instance and clicking on “Create Image”. Enter an appropriate name and description and then save. If desired, you may choose at this time to include the workspace snapshot in the AMI to avoid having to explicitly attach it later at launching of AMI instances. Change the permissions of the AMI to “public” if you would like it to be listed under the Community AMIs. Copy the AMI to any additional regions where you would like it to appear in Community AMI searches. Current Public AMIs cshl-seqtec-2022 (ami-09b613ae9751a96b1; N. Virginia) cbw-rnabio-2023 (ami-09b3fd07d90812201; N. Virginia) cshl-seqtec-2023 (ami-05d41e9b8c7eee2df; N. Virginia) cshl-seqtec-2024 (ami-00029a06cacbe647c; N. Virginia) cshl-seqtec-2025 (also named cshl_2025_AMI_final; ami-027b72b97520101bd; N. Virginia) Create IAM account From AWS Console select Services -&gt; IAM. Go to Users, Create User, specify a user name, and Create. Download credentials to a safe location for later reference if needed. Select the new user and go to Security Credentials -&gt; Manage Password -&gt; ‘Assign a Custom Password’. Go to Groups -&gt; Create a New Group, specify a group name and Next. Attach a policy to the group. In this case we give all EC2 privileges but no other AWS privileges by specifying “AmazonEC2FullAccess”. Hit Next, review and then Create Group. Select the Group -&gt; Add Users to Group, select your new user to add it to the new group. Launch student instance Go to AWS console. Login. Select EC2. Launch Instance, search for “cshl-seqtec-2025” in Community AMIs and Select. Choose “m6a.xlarge” instance type. Select one instance to launch (e.g., one per student and instructor), and select “Protect against accidental termination” Make sure that you see two snapshots (e.g., the 60GB root volume (gp3) and 500GB EBS volume (gp3) you set up earlier). Tick the boxes for “Delete on termination” for both. Create a tag with Name=StudentName Choose existing security group call “SSH/HTTP/Jupyter/Rstudio - with outbound rule”. Review and Launch. Choose an existing key pair (cshl_2025_student.pem) View instances and wait for them to finish initiating. Find your instance in console and select it, then hit connect to get your public.ip.address. Login to node ssh -i cshl_2025_student.pem ubuntu@[public.ip.address]. Optional - set up DNS redirects (see below) Set up a dynamic DNS service Rather than handing out ip addresses for each student instance to each student you can instead set up DNS records to redirect from a more human readable name to the IP address. After spinning up all student instances, use a service like http://dyn.com (or http://entrydns.net, etc.) to create hostnames like , , etc that point to each public IP address of student instances. Host necessary files for the course Currently, all miscellaneous data files, annotations, etc. are hosted on an ftp server at the Genome Institute. In the future more data files could be pre-loaded onto the EBS snapshot. Files copied to: /gscmnt/sata102/info/ftp-staging/pub/rnaseq/ Appear here: http://genome.wustl.edu/pub/rnaseq/ After course reminders Delete the student IAM account created above otherwise students will continue to have EC2 privileges. Terminate all instances and clean up any unnecessary volumes, snapshots, etc.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Integrated Assignment Answers</title><link href="http://www.rnabio.org//module-09-appendix/0009/08/01/Integrated_Assignment_Answers/" rel="alternate" type="text/html" title="Integrated Assignment Answers" /><published>0009-08-01T00:00:00+00:00</published><updated>0009-08-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/08/01/Integrated_Assignment_Answers</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/08/01/Integrated_Assignment_Answers/"><![CDATA[<h1 id="integrated-assignment-answers">Integrated Assignment answers</h1>

<p><strong>Background:</strong> Cell lines are often used to study different experimental conditions and to study the function of specific genes by various perturbation approaches. One such type of study involves knocking down expression of a target of interest by shRNA and then using RNA-seq to measure the impact on gene expression. These eperiments often include use of a control shRNA to account for any expression changes that may occur from just the introduction of these molecules. Differential expression is performed by comparing biological replicates of shRNA knockdown vs shRNA control.</p>

<p><strong>Objectives:</strong> In this assignment, we will be using a subset of the <a href="https://www.ncbi.nlm.nih.gov/bioproject/PRJNA471072">GSE114360 dataset</a>, which consists of 6 RNA-seq datasets generated from a cell line (3 transfected with shRNA, and 3 controls). Our goal will be to determine differentially expressed genes.</p>

<p>Experimental information and other things to keep in mind:</p>

<ul>
  <li>The libraries are prepared as paired end.</li>
  <li>The samples are sequenced on an Illumina 4000.</li>
  <li>Each read is 150 bp long</li>
  <li>The dataset is located here: <a href="https://www.ncbi.nlm.nih.gov/bioproject/PRJNA471072">GSE114360</a></li>
  <li>3 samples transfected with target shRNA and 3 samples with control shRNA</li>
  <li>Libraries were prepared using standard Illumina protocols</li>
  <li>For this exercise we will be using a subset of the reads (first 1,000,000 reads from each pair).</li>
  <li>The files are named based on their SRR id’s, and obey the following key:
    <ul>
      <li>SRR7155055 = CBSLR knockdown sample 1 (T1 - aka transfected 1)</li>
      <li>SRR7155056 = CBSLR knockdown sample 2 (T2 - aka transfected 2)</li>
      <li>SRR7155057 = CBSLR knockdown sample 3 (T3 - aka transfected 3)</li>
      <li>SRR7155058 = control sample 1 (C1 - aka control 1)</li>
      <li>SRR7155059 = control sample 2 (C2 - aka control 2)</li>
      <li>SRR7155060 = control sample 3 (C3 - aka control 3)</li>
    </ul>
  </li>
</ul>

<p>Experimental descriptions from the study authors:</p>

<p>Experimental details from the <a href="https://pubmed.ncbi.nlm.nih.gov/35499052/">paper</a>:
“An RNA transcriptome-sequencing analysis was performed in shRNA-NC or shRNA-CBSLR-1 MKN45 cells cultured under hypoxic conditions for 24 h (Fig. 2A).”</p>

<p>Experimental details from the GEO submission:
“An RNA transcriptome sequencing analysis was performed in MKN45 cells that were transfected with tcons_00001221 shRNA or control shRNA.”</p>

<p>Note that according to <a href="https://www.genecards.org/cgi-bin/carddisp.pl?gene=CBSLR">GeneCards</a> and <a href="https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/55459">HGNC</a>, <em>CBSLR</em> and <em>tcons_00001221</em> refer to the same gene.</p>

<h2 id="part-0--obtaining-data-and-references">Part 0 : Obtaining Data and References</h2>

<p><strong>Goals:</strong></p>

<ul>
  <li>Obtain the files necessary for data processing</li>
  <li>Familiarize yourself with reference and annotation file format</li>
  <li>Familiarize yourself with sequence FASTQ format</li>
</ul>

<p>Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_DIR that stores this path for convenience in later commands.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">RNA_HOME</span><span class="o">=</span>~/workspace/rnaseq
<span class="nb">cd</span> <span class="nv">$RNA_HOME</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> ~/workspace/rnaseq/integrated_assignment/
<span class="nb">export </span><span class="nv">RNA_INT_DIR</span><span class="o">=</span>~/workspace/rnaseq/integrated_assignment
</code></pre></div></div>

<p>Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory</p>

<p>Remember: when initiating an environment variable, we do NOT need the $; however, everytime we call the variable, it needs to be preceeded by a $.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="nv">$RNA_INT_DIR</span>
<span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>
wget http://genomedata.org/rnaseq-tutorial/Integrated_Assignment_RNA_Data.tar.gz
<span class="nb">tar</span> <span class="nt">-xvf</span> Integrated_Assignment_RNA_Data.tar.gz
</code></pre></div></div>

<p><strong>Q1.)</strong> How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl?</p>

<p><strong>A1.)</strong> The answer is 10. Review these files so that you are familiar with them. If the reference fasta or gtf was not provided, you could obtain them from the Ensembl website under their downloads &gt; databases.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/reference/
tree
find <span class="nb">.</span> <span class="nt">-type</span> f
find <span class="nb">.</span> <span class="nt">-type</span> f | <span class="nb">wc</span> <span class="nt">-l</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">.</code> tells the <code class="language-plaintext highlighter-rouge">find</code> command to look in the current directory and <code class="language-plaintext highlighter-rouge">-type f</code> restricts the search to files only. The <code class="language-plaintext highlighter-rouge">|</code> uses the output from the <code class="language-plaintext highlighter-rouge">find</code> command and <code class="language-plaintext highlighter-rouge">wc -l</code> counts the lines of that output</p>

<p><strong>Q2.)</strong> How many exons does the gene SOX4 have? Which PCA3 isoform has the most exons?</p>

<p><strong>A2.)</strong> SOX4 only has 1 exon, while the longest isoform of PCA3 (ENST00000645704) has 7 exons. Review the GTF file so that you are familiar with it. What downstream steps will we need this gtf file for?</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-w</span> <span class="s2">"SOX4"</span> Homo_sapiens.GRCh38.92.gtf | less <span class="nt">-S</span>

<span class="nb">grep</span> <span class="nt">-w</span> <span class="s2">"PCA3"</span> Homo_sapiens.GRCh38.92.gtf | <span class="nb">grep</span> <span class="nt">-w</span> <span class="s2">"exon"</span> | <span class="nb">cut</span> <span class="nt">-f</span> 9 | <span class="nb">cut</span> <span class="nt">-d</span> <span class="s2">";"</span> <span class="nt">-f</span> 3 | <span class="nb">sort</span> | <span class="nb">uniq</span> <span class="nt">-c</span>

</code></pre></div></div>

<p><strong>Q3.)</strong> How many samples do you see under the data directory?</p>

<p><strong>A3.)</strong> The answer is 6 samples. The number of files is 12 because the sequence data is paired (an R1 and R2 file for each sample). The files are named based on their SRA accession number.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/data/
<span class="nb">ls</span> <span class="nt">-l</span>
<span class="nb">ls</span> <span class="nt">-1</span> | <span class="nb">wc</span> <span class="nt">-l</span>
</code></pre></div></div>

<p>NOTE: The fastq files you have copied above contain only the first 1,000,000 reads. Keep this in mind when you are combing through the results of the differential expression analysis.</p>

<h2 id="part-1--data-preprocessing">Part 1 : Data preprocessing</h2>

<p><strong>Goals:</strong></p>

<ul>
  <li>Run a quality check with <code class="language-plaintext highlighter-rouge">fastqc</code> before and after trimming</li>
  <li>Familiarize yourself with the options for <code class="language-plaintext highlighter-rouge">fastqc</code> to be able to redirect your output</li>
  <li>Perform adapter trimming and data cleanup on your data using <code class="language-plaintext highlighter-rouge">fastp</code></li>
  <li>Familiarize yourself with the output metrics from adapter trimming</li>
  <li>Examine <code class="language-plaintext highlighter-rouge">fastqc</code> and/or <code class="language-plaintext highlighter-rouge">multiqc</code> reports for the pre- and post-trimmed data</li>
</ul>

<p>Create a new folder that will house the outputs from FastQC. Use the <code class="language-plaintext highlighter-rouge">-h</code> option to view the potential output on the data to determine the quality of the data.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> qc/raw_fastqc
fastqc <span class="nv">$RNA_INT_DIR</span>/data/<span class="k">*</span>.fastq.gz <span class="nt">-o</span> qc/raw_fastqc/
<span class="nb">cd </span>qc/raw_fastqc
multiqc ./

</code></pre></div></div>

<p><strong>Q4.)</strong> What metrics, if any, have the samples failed? Are the errors related?</p>

<p><strong>A4.)</strong> The per base sequence content of the samples don’t show a flat distribution and do have a bias towards certain bases at the beginning of the reads. The reason for this bias could be non-random priming during cDNA synthesis giving rise to non-random bases near the beginning/end of each fragment. The QC reports also flag the presense of adapters in the reads.</p>

<p>Now based on the output of the html summary, proceed to clean up the reads and rerun fastqc to see if an improvement can be made to the data. Make sure to create a directory to hold any processed reads you may create.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>
<span class="nb">mkdir </span>trimmed_reads

fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155055_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155055_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155055_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155055_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155055.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155055.fastp.html 2&gt;trimmed_reads/SRR7155055.fastp.log
fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155056_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155056_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155056_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155056_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155056.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155056.fastp.html 2&gt;trimmed_reads/SRR7155056.fastp.log
fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155057_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155057_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155057_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155057_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155057.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155057.fastp.html 2&gt;trimmed_reads/SRR7155057.fastp.log
fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155058_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155058_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155058_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155058_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155058.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155058.fastp.html 2&gt;trimmed_reads/SRR7155058.fastp.log
fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155059_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155059_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155059_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155059_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155059.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155059.fastp.html 2&gt;trimmed_reads/SRR7155059.fastp.log
fastp <span class="nt">-i</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155060_1.fastq.gz <span class="nt">-I</span> <span class="nv">$RNA_INT_DIR</span>/data/SRR7155060_2.fastq.gz <span class="nt">-o</span> trimmed_reads/SRR7155060_1.fastq.gz <span class="nt">-O</span> trimmed_reads/SRR7155060_2.fastq.gz <span class="nt">-l</span> 25 <span class="nt">--adapter_fasta</span> <span class="nv">$RNA_INT_DIR</span>/adapter/illumina_multiplex.fa <span class="nt">--trim_front1</span> 13 <span class="nt">--trim_front2</span> 13 <span class="nt">--json</span> trimmed_reads/SRR7155060.fastp.json <span class="nt">--html</span> trimmed_reads/SRR7155060.fastp.html 2&gt;trimmed_reads/SRR7155060.fastp.log

</code></pre></div></div>

<p><strong>Q5.)</strong> What average percentage of reads remain after adapter trimming/cleanup with fastp? Why do reads get tossed out?</p>

<p><strong>A5.)</strong> At this point, we could look in the log files individually. Alternatively, we could utilize the command line with a command like the one below.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="nt">-A</span> 1 Read1 trimmed_reads/<span class="k">*</span>.log
</code></pre></div></div>

<p>Doing this, we find that around 93-95% of reads survive after adapter trimming and cleanup with fastp. The reads that get tossed are due to being too short after trimming. They fall below our threshold of minimum read length of 25 (too short), poor sequence quality, or too many N’s.</p>

<p><strong>Q6.)</strong> What sample has the largest number of reads after trimming?</p>

<p><strong>A6.)</strong> The control sample 2 (SRR7155060) has the most reads (1,907,336 individual reads).
An easy way to figure out the number of reads is to check the output log file from the trimming output. Looking at the “remaining reads” row, we see the reads (each read in a pair counted individually) that survive the trimming. We can also look at this from the command line.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="s2">"passed"</span> trimmed_reads/<span class="k">*</span>.log
</code></pre></div></div>

<p>Alternatively, you can make use of the command ‘wc’. This command counts the number of lines in a file. Since fastq files have 4 lines per read, the total number of lines must be divided by 4. Running this command only give you the total number of lines in the fastq file (Note that because the data is compressed, we need to use zcat to unzip it and print it to the screen, before passing it on to the wc command):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>zcat <span class="nv">$RNA_INT_DIR</span>/data/SRR7155059_1.fastq.gz | <span class="nb">wc</span> <span class="nt">-l</span>
zcat <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155059_1.fastq.gz | <span class="nb">wc</span> <span class="nt">-l</span>

</code></pre></div></div>

<p>We could also run <code class="language-plaintext highlighter-rouge">fastqc</code> and <code class="language-plaintext highlighter-rouge">multiqc</code> on the trimmed data and visualize the remaining reads that way.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> qc/trimmed_fastqc
fastqc <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/<span class="k">*</span>.fastq.gz <span class="nt">-o</span> qc/trimmed_fastqc/
<span class="nb">cd </span>qc/trimmed_fastqc
multiqc ./

</code></pre></div></div>

<h2 id="part-2-data-alignment">Part 2: Data alignment</h2>

<p><strong>Goals:</strong></p>

<ul>
  <li>Familiarize yourself with HISAT2 alignment options</li>
  <li>Perform alignments using <code class="language-plaintext highlighter-rouge">hisat2</code> and the trimmed version of the raw sequence data above</li>
  <li>Sort your alignments and convert into compressed bam format using <code class="language-plaintext highlighter-rouge">samtools sort</code></li>
  <li>Obtain alignment summary information using <code class="language-plaintext highlighter-rouge">samtools flagstat</code></li>
</ul>

<p>To create HISAT2 alignment commands for all of the six samples and run alignments:</p>

<p>Create a directory to store the alignment results</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="nv">$RNA_INT_DIR</span>/alignments
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$RNA_INT_DIR</span>/alignments
<span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/alignments
</code></pre></div></div>

<p>Run alignment commands for each sample</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>T1 <span class="nt">--rg</span> SM:Transfected1 <span class="nt">--rg</span> LB:Transfected1_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155055_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155055_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155055.sam
hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>T2 <span class="nt">--rg</span> SM:Transfected2 <span class="nt">--rg</span> LB:Transfected2_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155056_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155056_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155056.sam
hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>T3 <span class="nt">--rg</span> SM:Transfected3 <span class="nt">--rg</span> LB:Transfected3_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155057_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155057_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155057.sam
hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>C1 <span class="nt">--rg</span> SM:Control1 <span class="nt">--rg</span> LB:Control1_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155058_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155058_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155058.sam
hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>C2 <span class="nt">--rg</span> SM:Control2 <span class="nt">--rg</span> LB:Control2_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155059_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155059_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155059.sam
hisat2 <span class="nt">-p</span> 8 <span class="nt">--rg-id</span><span class="o">=</span>C3 <span class="nt">--rg</span> SM:Control3 <span class="nt">--rg</span> LB:Control3_lib <span class="nt">--rg</span> PL:ILLUMINA <span class="nt">-x</span> <span class="nv">$RNA_INT_DIR</span>/reference/Homo_sapiens.GRCh38 <span class="nt">--dta</span> <span class="nt">--rna-strandness</span> RF <span class="nt">-1</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155060_1.fastq.gz <span class="nt">-2</span> <span class="nv">$RNA_INT_DIR</span>/trimmed_reads/SRR7155060_2.fastq.gz <span class="nt">-S</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155060.sam

</code></pre></div></div>

<p>Next, convert sam alignments to bam.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/alignments
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155055.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155055.sam
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155056.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155056.sam
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155057.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155057.sam
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155058.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155058.sam
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155059.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155059.sam
samtools <span class="nb">sort</span> -@ 8 <span class="nt">-o</span> <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155060.bam <span class="nv">$RNA_INT_DIR</span>/alignments/SRR7155060.sam

</code></pre></div></div>

<p><strong>Q7.)</strong> How can we obtain summary statistics for each aligned file?</p>

<p><strong>A7.)</strong> There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/alignments
samtools flagstat SRR7155055.bam <span class="o">&gt;</span> SRR7155055.flagstat.txt
samtools flagstat SRR7155056.bam <span class="o">&gt;</span> SRR7155056.flagstat.txt
samtools flagstat SRR7155057.bam <span class="o">&gt;</span> SRR7155057.flagstat.txt
samtools flagstat SRR7155058.bam <span class="o">&gt;</span> SRR7155058.flagstat.txt
samtools flagstat SRR7155059.bam <span class="o">&gt;</span> SRR7155059.flagstat.txt
samtools flagstat SRR7155060.bam <span class="o">&gt;</span> SRR7155060.flagstat.txt

</code></pre></div></div>

<p>Pull out summaries of mapped reads from the flagstat files</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep</span> <span class="s2">"mapped ("</span> <span class="k">*</span>.flagstat.txt

</code></pre></div></div>

<p><strong>Q8.)</strong> Approximately how much space is saved by converting the sam to a bam format?</p>

<p><strong>A8.)</strong> We get about a 5.5x compression by using the bam format instead of the sam format. This can be seen by adding the <code class="language-plaintext highlighter-rouge">-lh</code> option when listing the files in the aligntments directory.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">ls</span> <span class="nt">-lh</span> <span class="nv">$RNA_INT_DIR</span>/alignments/
</code></pre></div></div>

<p>To specifically look at the sizes of the sam and bam files, we could use <code class="language-plaintext highlighter-rouge">du -h</code>, which shows us the disk space they are utilizing in human readable format.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">du</span> <span class="nt">-h</span> <span class="nv">$RNA_INT_DIR</span>/alignments/<span class="k">*</span>.sam
<span class="nb">du</span> <span class="nt">-h</span> <span class="nv">$RNA_INT_DIR</span>/alignments/<span class="k">*</span>.bam
</code></pre></div></div>

<p>In order to make visualization easier, you should now merge each of your replicate sample bams into one combined BAM for each condition. Make sure to index these bams afterwards to be able to view them on IGV.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/alignments
java <span class="nt">-Xmx2g</span> <span class="nt">-jar</span> <span class="nv">$PICARD</span> MergeSamFiles <span class="nv">OUTPUT</span><span class="o">=</span>transfected.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155055.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155056.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155057.bam
java <span class="nt">-Xmx2g</span> <span class="nt">-jar</span> <span class="nv">$PICARD</span> MergeSamFiles <span class="nv">OUTPUT</span><span class="o">=</span>control.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155058.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155059.bam <span class="nv">INPUT</span><span class="o">=</span>SRR7155060.bam
</code></pre></div></div>

<p>To visualize these merged bam files in IGV, we’ll need to index them. We can do so with the following commands.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/alignments
samtools index <span class="nv">$RNA_INT_DIR</span>/alignments/control.bam
samtools index <span class="nv">$RNA_INT_DIR</span>/alignments/transfected.bam
</code></pre></div></div>

<p>Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this:</p>
<ul>
  <li>Load up IGV</li>
  <li>Change the reference genome to “Human hg38” in the top-left category</li>
  <li>Click on File &gt; Load from URL, and in the File URL enter: “http://<your IP="">/rnaseq/integrated_assignment/alignments/transfected.bam". Repeat this step and enter "http://<your IP="">/rnaseq/integrated_assignment/alignments/control.bam" to load the other bam.</your></your></li>
  <li>Right-click on the alignments track in the middle, and Group alignments by “Library”</li>
  <li>Jump to TP53 by typing it into the search bar above</li>
</ul>

<p><strong>Q9.)</strong> What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data?</p>

<p><strong>A9.)</strong> The reads all pile up on the exonic regions of the gene since we’re dealing with RNA-Sequencing data. Not all exons have equal coverage, and this is due to different isoforms of the gene being sequenced. If the data was from a whole-genome experiment, we would ideally expect to see equal coverage across the whole gene length.</p>

<p>Right-click in the middle of the page, and click on “Expanded” to view the reads more easily.</p>

<p><strong>Q10.)</strong> What are the lines connecting the reads trying to convey?</p>

<p><strong>A10.)</strong> The lines show a connected read, where one part of the read begins mapping to one exon, while the other part maps to the next exon. This is important in RNA-Sequencing alignment as aligners must be aware to take this partial alignment strategy into account.</p>

<h2 id="part-3-expression-estimation">Part 3: Expression Estimation</h2>

<p><strong>Goals:</strong></p>

<ul>
  <li>Familiarize yourself with Stringtie options and how to run Stringtie in “reference-only” mode</li>
  <li>Create an expression results directory, run <code class="language-plaintext highlighter-rouge">stringtie</code> on all 6 samples, and store the results in appropriately named subdirectories in this results dir</li>
  <li>Obtain expression values for the gene SOX4</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$RNA_INT_DIR</span>/expression

stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/transfected1/transcripts.gtf <span class="nt">-A</span> expression/transfected1/gene_abundances.tsv alignments/SRR7155055.bam
stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/transfected2/transcripts.gtf <span class="nt">-A</span> expression/transfected2/gene_abundances.tsv alignments/SRR7155056.bam
stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/transfected3/transcripts.gtf <span class="nt">-A</span> expression/transfected3/gene_abundances.tsv alignments/SRR7155057.bam
stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/control1/transcripts.gtf <span class="nt">-A</span> expression/control1/gene_abundances.tsv alignments/SRR7155058.bam
stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/control2/transcripts.gtf <span class="nt">-A</span> expression/control2/gene_abundances.tsv alignments/SRR7155059.bam
stringtie <span class="nt">-p</span> 8 <span class="nt">-G</span> reference/Homo_sapiens.GRCh38.92.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/control3/transcripts.gtf <span class="nt">-A</span> expression/control3/gene_abundances.tsv alignments/SRR7155060.bam
</code></pre></div></div>

<p><strong>Q11.)</strong> How can you obtain the expression of the gene SOX4 across the transfected and control samples?</p>

<p><strong>A11.)</strong> To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">grep </span>SOX4 <span class="nv">$RNA_INT_DIR</span>/expression/<span class="k">*</span>/transcripts.gtf | <span class="nb">cut</span> <span class="nt">-f</span> 1,9 | <span class="nb">grep </span>FPKM
</code></pre></div></div>

<h2 id="part-4-differential-expression-analysis">Part 4: Differential Expression Analysis</h2>

<p><strong>Goals:</strong></p>

<ul>
  <li>Perform differential analysis between the transfected and control samples</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$RNA_INT_DIR</span>/ballgown/
<span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/ballgown/
</code></pre></div></div>

<p>Perform transfected vs. control comparison, using all samples, for known transcripts:</p>

<p>Adapt the R tutorial code that was used in <a href="https://rnabio.org/module-03-expression/0003/03/01/Differential_Expression/">Differential Expression</a> section. Modify it to work on these data (which are also a 3x3 replicate comparison of two conditions).</p>

<p>First, start an R session:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
</code></pre></div></div>

<p>Run the following R commands in your R session.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="c1"># load the required libraries</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ballgown</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">genefilter</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">devtools</span><span class="p">)</span><span class="w">

</span><span class="c1"># Create phenotype data needed for ballgown analysis. Recall that:</span><span class="w">
</span><span class="c1"># "T1-T3" refers to "transfected" (CBSLR shRNA knockdown) replicates</span><span class="w">
</span><span class="c1"># "C1-C3" refers to "control" (shRNA control) replicates</span><span class="w">

</span><span class="n">ids</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"transfected1"</span><span class="p">,</span><span class="s2">"transfected2"</span><span class="p">,</span><span class="s2">"transfected3"</span><span class="p">,</span><span class="s2">"control1"</span><span class="p">,</span><span class="s2">"control2"</span><span class="p">,</span><span class="s2">"control3"</span><span class="p">)</span><span class="w">
</span><span class="n">type</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Tranfected"</span><span class="p">,</span><span class="s2">"Tranfected"</span><span class="p">,</span><span class="s2">"Tranfected"</span><span class="p">,</span><span class="s2">"Control"</span><span class="p">,</span><span class="s2">"Control"</span><span class="p">,</span><span class="s2">"Control"</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="o">=</span><span class="s2">"/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/"</span><span class="w">
</span><span class="n">path</span><span class="o">=</span><span class="n">paste</span><span class="p">(</span><span class="n">results</span><span class="p">,</span><span class="n">ids</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="n">pheno_data</span><span class="o">=</span><span class="n">data.frame</span><span class="p">(</span><span class="n">ids</span><span class="p">,</span><span class="n">type</span><span class="p">,</span><span class="n">path</span><span class="p">)</span><span class="w">

</span><span class="n">pheno_data</span><span class="w">

</span><span class="c1"># Load ballgown data structure and save it to a variable "bg"</span><span class="w">
</span><span class="n">bg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ballgown</span><span class="p">(</span><span class="n">samples</span><span class="o">=</span><span class="n">as.vector</span><span class="p">(</span><span class="n">pheno_data</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="n">pData</span><span class="o">=</span><span class="n">pheno_data</span><span class="p">)</span><span class="w">

</span><span class="c1"># Display a description of this object</span><span class="w">
</span><span class="n">bg</span><span class="w">

</span><span class="c1"># Load all attributes including gene name</span><span class="w">
</span><span class="n">bg_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">
</span><span class="n">bg_transcript_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">6</span><span class="p">)])</span><span class="w">

</span><span class="c1"># Save the ballgown object to a file for later use</span><span class="w">
</span><span class="n">save</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s1">'bg.rda'</span><span class="p">)</span><span class="w">

</span><span class="c1"># Perform differential expression (DE) analysis with no filtering, at both gene and transcript level</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"transcript"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">bg_transcript_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"t_id"</span><span class="p">))</span><span class="w">

</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">bg_gene_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Save a tab delimited file for both the transcript and gene results</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_transcript_results.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_gene_results.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one</span><span class="w">
</span><span class="n">bg_filt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="w"> </span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="s2">"rowVars(texpr(bg)) &gt; 1"</span><span class="p">,</span><span class="w"> </span><span class="n">genomesubset</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Load all attributes including gene name</span><span class="w">
</span><span class="n">bg_filt_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg_filt</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_filt_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_filt_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">
</span><span class="n">bg_filt_transcript_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_filt_table</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">6</span><span class="p">)])</span><span class="w">

</span><span class="c1"># Perform DE analysis now using the filtered data</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg_filt</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"transcript"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">bg_filt_transcript_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"t_id"</span><span class="p">))</span><span class="w">

</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg_filt</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">bg_filt_gene_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Output the filtered list of genes and transcripts and save to tab delimited files</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_transcript_results_filtered.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_gene_results_filtered.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Identify the significant genes with p-value &lt; 0.05</span><span class="w">
</span><span class="n">sig_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">results_transcripts</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">sig_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">

</span><span class="n">sig_transcripts_ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sig_transcripts</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">sig_transcripts</span><span class="o">$</span><span class="n">pval</span><span class="p">),]</span><span class="w">
</span><span class="n">sig_genes_ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sig_genes</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">sig_genes</span><span class="o">$</span><span class="n">pval</span><span class="p">),]</span><span class="w">

</span><span class="c1"># Output the significant gene results to a pair of tab delimited files</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">sig_transcripts_ordered</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_transcript_results_sig.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">sig_genes_ordered</span><span class="p">,</span><span class="w"> </span><span class="s2">"Transfected_vs_Control_gene_results_sig.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Exit the R session</span><span class="w">
</span><span class="n">quit</span><span class="p">(</span><span class="n">save</span><span class="o">=</span><span class="s2">"no"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p><strong>Q12.)</strong> Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case?</p>

<p><strong>A12.)</strong> Yes, there are about 523 significantly differntially expressed genes. Due to the fact that we’re using a subset of the fully sequenced library for each sample, the SOX4 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all the reads from each sample in the original data set, which will give you greater resolution of the expression of each gene to build mean and variance estimates for eacch gene’s expression.</p>

<h2 id="part-5-differential-expression-analysis-visualization">Part 5: Differential Expression Analysis Visualization</h2>

<p><strong>Q13.)</strong> What plots can you generate to help you visualize this gene expression profile</p>

<p><strong>A13.)</strong> The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of transfect vs. control FPKM values.</p>

<p>Make sure we are in the directory with our DE results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_INT_DIR</span>/ballgown/
</code></pre></div></div>

<p>Restart an R session:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">R</span><span class="w">
</span></code></pre></div></div>

<p>The following R commands create summary visualizations of the DE results from Ballgown</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="c1">#Load libraries</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gplots</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">GenomicRanges</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ballgown</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggrepel</span><span class="p">)</span><span class="w">

</span><span class="c1">#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline</span><span class="w">
</span><span class="n">load</span><span class="p">(</span><span class="s1">'bg.rda'</span><span class="p">)</span><span class="w">

</span><span class="c1"># View a summary of the ballgown object</span><span class="w">
</span><span class="n">bg</span><span class="w">

</span><span class="c1"># Load gene names for lookup later in the tutorial</span><span class="w">
</span><span class="n">bg_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">

</span><span class="c1"># Pull the gene_expression data frame from the ballgown object</span><span class="w">
</span><span class="n">gene_expression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">gexpr</span><span class="p">(</span><span class="n">bg</span><span class="p">))</span><span class="w">

</span><span class="c1">#Set min value to 1</span><span class="w">
</span><span class="n">min_nonzero</span><span class="o">=</span><span class="m">1</span><span class="w">

</span><span class="c1"># Set the columns for finding FPKM and create shorter names for figures</span><span class="w">
</span><span class="n">data_columns</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">short_names</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"T1"</span><span class="p">,</span><span class="s2">"T2"</span><span class="p">,</span><span class="s2">"T3"</span><span class="p">,</span><span class="s2">"C1"</span><span class="p">,</span><span class="s2">"C2"</span><span class="p">,</span><span class="s2">"C3"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Calculate the FPKM sum for all 6 libraries</span><span class="w">
</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"sum"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="n">data_columns</span><span class="p">],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w">

</span><span class="c1">#Identify genes where the sum of FPKM across all samples is above some arbitrary threshold</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"sum"</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="c1">#Calculate the correlation between all pairs of data</span><span class="w">
</span><span class="n">r</span><span class="o">=</span><span class="n">cor</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">data_columns</span><span class="p">],</span><span class="w"> </span><span class="n">use</span><span class="o">=</span><span class="s2">"pairwise.complete.obs"</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Print out these correlation values</span><span class="w">
</span><span class="n">r</span><span class="w">

</span><span class="c1"># Open a PDF file where we will save some plots. </span><span class="w">
</span><span class="c1"># We will save all figures and then view the PDF at the end</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s2">"transfected_vs_control_figures.pdf"</span><span class="p">)</span><span class="w">

</span><span class="n">data_colors</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"tomato1"</span><span class="p">,</span><span class="s2">"tomato2"</span><span class="p">,</span><span class="s2">"tomato3"</span><span class="p">,</span><span class="s2">"royalblue1"</span><span class="p">,</span><span class="s2">"royalblue2"</span><span class="p">,</span><span class="s2">"royalblue3"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries</span><span class="w">
</span><span class="c1">#This step calculates 2-dimensional coordinates to plot points for each library</span><span class="w">
</span><span class="c1">#Libraries with similar expression patterns (highly correlated to each other) should group together</span><span class="w">

</span><span class="c1">#note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability</span><span class="w">
</span><span class="n">d</span><span class="o">=</span><span class="m">1</span><span class="o">-</span><span class="n">r</span><span class="w">
</span><span class="n">mds</span><span class="o">=</span><span class="n">cmdscale</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">eig</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="o">=</span><span class="s2">"MDS distance plot (all non-zero genes)"</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.01</span><span class="p">,</span><span class="m">0.01</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.01</span><span class="p">,</span><span class="m">0.01</span><span class="p">))</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"grey"</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">short_names</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="n">data_colors</span><span class="p">)</span><span class="w">

</span><span class="c1"># Calculate the differential expression results including significance</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="n">bg_gene_names</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Plot - Display the grand expression values from UHR and HBR and mark those that are significantly differentially expressed</span><span class="w">

</span><span class="n">sig</span><span class="o">=</span><span class="n">which</span><span class="p">(</span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="p">[,</span><span class="s2">"de"</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log2</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[,</span><span class="s2">"fc"</span><span class="p">])</span><span class="w">

</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Transfected"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Control"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="o">:</span><span class="m">6</span><span class="p">)],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">

</span><span class="n">x</span><span class="o">=</span><span class="n">log2</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Transfected"</span><span class="p">]</span><span class="o">+</span><span class="n">min_nonzero</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">log2</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Control"</span><span class="p">]</span><span class="o">+</span><span class="n">min_nonzero</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="s2">"Transfected FPKM (log2)"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="s2">"Control FPKM (log2)"</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="o">=</span><span class="s2">"Transfected vs Control FPKMs"</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">xsig</span><span class="o">=</span><span class="n">x</span><span class="p">[</span><span class="n">sig</span><span class="p">]</span><span class="w">
</span><span class="n">ysig</span><span class="o">=</span><span class="n">y</span><span class="p">[</span><span class="n">sig</span><span class="p">]</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">xsig</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">ysig</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"magenta"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"topleft"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Significant"</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"magenta"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">)</span><span class="w">

</span><span class="c1">#Get the gene symbols for the top N (according to corrected p-value) and display them on the plot</span><span class="w">
</span><span class="n">topn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[</span><span class="n">sig</span><span class="p">,</span><span class="s2">"fc"</span><span class="p">]),</span><span class="w"> </span><span class="n">decreasing</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">]</span><span class="w">
</span><span class="n">topn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[</span><span class="n">sig</span><span class="p">,</span><span class="s2">"qval"</span><span class="p">])[</span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">]</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">topn</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">topn</span><span class="p">],</span><span class="w"> </span><span class="n">results_genes</span><span class="p">[</span><span class="n">topn</span><span class="p">,</span><span class="s2">"gene_name"</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">srt</span><span class="o">=</span><span class="m">45</span><span class="p">)</span><span class="w">

</span><span class="c1">#Plot - Volcano plot</span><span class="w">

</span><span class="c1"># set default for all genes to "no change"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"No"</span><span class="w">

</span><span class="c1"># if log2Foldchange &gt; 2 and pvalue &lt; 0.05, set as "Up regulated"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">de</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0.6</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"Up"</span><span class="w">

</span><span class="c1"># if log2Foldchange &lt; -2 and pvalue &lt; 0.05, set as "Down regulated"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">de</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">-0.6</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"Down"</span><span class="w">

</span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_label</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">

</span><span class="c1"># write the gene names of those significantly upregulated/downregulated to a new column</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_label</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_name</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">]</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">results_genes</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,],</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">de</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="n">pval</span><span class="p">),</span><span class="w"> </span><span class="n">label</span><span class="o">=</span><span class="n">gene_label</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">diffexpressed</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"log2Foldchange"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Differentially expressed"</span><span class="p">,</span><span class="w"> </span><span class="n">values</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_text_repel</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.6</span><span class="p">,</span><span class="w"> </span><span class="m">0.6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="m">0.05</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">guides</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">guide_legend</span><span class="p">(</span><span class="n">override.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">5</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">results_genes</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,],</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">de</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="n">pval</span><span class="p">)),</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">)</span><span class="w">


</span><span class="n">dev.off</span><span class="p">()</span><span class="w">

</span><span class="c1"># Exit the R session</span><span class="w">
</span><span class="n">quit</span><span class="p">(</span><span class="n">save</span><span class="o">=</span><span class="s2">"no"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[Integrated Assignment answers Background: Cell lines are often used to study different experimental conditions and to study the function of specific genes by various perturbation approaches. One such type of study involves knocking down expression of a target of interest by shRNA and then using RNA-seq to measure the impact on gene expression. These eperiments often include use of a control shRNA to account for any expression changes that may occur from just the introduction of these molecules. Differential expression is performed by comparing biological replicates of shRNA knockdown vs shRNA control. Objectives: In this assignment, we will be using a subset of the GSE114360 dataset, which consists of 6 RNA-seq datasets generated from a cell line (3 transfected with shRNA, and 3 controls). Our goal will be to determine differentially expressed genes. Experimental information and other things to keep in mind: The libraries are prepared as paired end. The samples are sequenced on an Illumina 4000. Each read is 150 bp long The dataset is located here: GSE114360 3 samples transfected with target shRNA and 3 samples with control shRNA Libraries were prepared using standard Illumina protocols For this exercise we will be using a subset of the reads (first 1,000,000 reads from each pair). The files are named based on their SRR id’s, and obey the following key: SRR7155055 = CBSLR knockdown sample 1 (T1 - aka transfected 1) SRR7155056 = CBSLR knockdown sample 2 (T2 - aka transfected 2) SRR7155057 = CBSLR knockdown sample 3 (T3 - aka transfected 3) SRR7155058 = control sample 1 (C1 - aka control 1) SRR7155059 = control sample 2 (C2 - aka control 2) SRR7155060 = control sample 3 (C3 - aka control 3) Experimental descriptions from the study authors: Experimental details from the paper: “An RNA transcriptome-sequencing analysis was performed in shRNA-NC or shRNA-CBSLR-1 MKN45 cells cultured under hypoxic conditions for 24 h (Fig. 2A).” Experimental details from the GEO submission: “An RNA transcriptome sequencing analysis was performed in MKN45 cells that were transfected with tcons_00001221 shRNA or control shRNA.” Note that according to GeneCards and HGNC, CBSLR and tcons_00001221 refer to the same gene. Part 0 : Obtaining Data and References Goals: Obtain the files necessary for data processing Familiarize yourself with reference and annotation file format Familiarize yourself with sequence FASTQ format Create a working directory ~/workspace/rnaseq/integrated_assignment/ to store this exercise. Then create a unix environment variable named RNA_INT_DIR that stores this path for convenience in later commands. export RNA_HOME=~/workspace/rnaseq cd $RNA_HOME mkdir -p ~/workspace/rnaseq/integrated_assignment/ export RNA_INT_DIR=~/workspace/rnaseq/integrated_assignment Obtain reference, annotation, adapter and data files and place them in the integrated assignment directory Remember: when initiating an environment variable, we do NOT need the $; however, everytime we call the variable, it needs to be preceeded by a $. echo $RNA_INT_DIR cd $RNA_INT_DIR wget http://genomedata.org/rnaseq-tutorial/Integrated_Assignment_RNA_Data.tar.gz tar -xvf Integrated_Assignment_RNA_Data.tar.gz Q1.) How many items are there under the “reference” directory (counting all files in all sub-directories)? What if this reference file was not provided for you - how would you obtain/create a reference genome fasta file. How about the GTF transcripts file from Ensembl? A1.) The answer is 10. Review these files so that you are familiar with them. If the reference fasta or gtf was not provided, you could obtain them from the Ensembl website under their downloads &gt; databases. cd $RNA_INT_DIR/reference/ tree find . -type f find . -type f | wc -l The . tells the find command to look in the current directory and -type f restricts the search to files only. The | uses the output from the find command and wc -l counts the lines of that output Q2.) How many exons does the gene SOX4 have? Which PCA3 isoform has the most exons? A2.) SOX4 only has 1 exon, while the longest isoform of PCA3 (ENST00000645704) has 7 exons. Review the GTF file so that you are familiar with it. What downstream steps will we need this gtf file for? grep -w "SOX4" Homo_sapiens.GRCh38.92.gtf | less -S grep -w "PCA3" Homo_sapiens.GRCh38.92.gtf | grep -w "exon" | cut -f 9 | cut -d ";" -f 3 | sort | uniq -c Q3.) How many samples do you see under the data directory? A3.) The answer is 6 samples. The number of files is 12 because the sequence data is paired (an R1 and R2 file for each sample). The files are named based on their SRA accession number. cd $RNA_INT_DIR/data/ ls -l ls -1 | wc -l NOTE: The fastq files you have copied above contain only the first 1,000,000 reads. Keep this in mind when you are combing through the results of the differential expression analysis. Part 1 : Data preprocessing Goals: Run a quality check with fastqc before and after trimming Familiarize yourself with the options for fastqc to be able to redirect your output Perform adapter trimming and data cleanup on your data using fastp Familiarize yourself with the output metrics from adapter trimming Examine fastqc and/or multiqc reports for the pre- and post-trimmed data Create a new folder that will house the outputs from FastQC. Use the -h option to view the potential output on the data to determine the quality of the data. cd $RNA_INT_DIR mkdir -p qc/raw_fastqc fastqc $RNA_INT_DIR/data/*.fastq.gz -o qc/raw_fastqc/ cd qc/raw_fastqc multiqc ./ Q4.) What metrics, if any, have the samples failed? Are the errors related? A4.) The per base sequence content of the samples don’t show a flat distribution and do have a bias towards certain bases at the beginning of the reads. The reason for this bias could be non-random priming during cDNA synthesis giving rise to non-random bases near the beginning/end of each fragment. The QC reports also flag the presense of adapters in the reads. Now based on the output of the html summary, proceed to clean up the reads and rerun fastqc to see if an improvement can be made to the data. Make sure to create a directory to hold any processed reads you may create. cd $RNA_INT_DIR mkdir trimmed_reads fastp -i $RNA_INT_DIR/data/SRR7155055_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155055_2.fastq.gz -o trimmed_reads/SRR7155055_1.fastq.gz -O trimmed_reads/SRR7155055_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155055.fastp.json --html trimmed_reads/SRR7155055.fastp.html 2&gt;trimmed_reads/SRR7155055.fastp.log fastp -i $RNA_INT_DIR/data/SRR7155056_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155056_2.fastq.gz -o trimmed_reads/SRR7155056_1.fastq.gz -O trimmed_reads/SRR7155056_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155056.fastp.json --html trimmed_reads/SRR7155056.fastp.html 2&gt;trimmed_reads/SRR7155056.fastp.log fastp -i $RNA_INT_DIR/data/SRR7155057_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155057_2.fastq.gz -o trimmed_reads/SRR7155057_1.fastq.gz -O trimmed_reads/SRR7155057_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155057.fastp.json --html trimmed_reads/SRR7155057.fastp.html 2&gt;trimmed_reads/SRR7155057.fastp.log fastp -i $RNA_INT_DIR/data/SRR7155058_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155058_2.fastq.gz -o trimmed_reads/SRR7155058_1.fastq.gz -O trimmed_reads/SRR7155058_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155058.fastp.json --html trimmed_reads/SRR7155058.fastp.html 2&gt;trimmed_reads/SRR7155058.fastp.log fastp -i $RNA_INT_DIR/data/SRR7155059_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155059_2.fastq.gz -o trimmed_reads/SRR7155059_1.fastq.gz -O trimmed_reads/SRR7155059_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155059.fastp.json --html trimmed_reads/SRR7155059.fastp.html 2&gt;trimmed_reads/SRR7155059.fastp.log fastp -i $RNA_INT_DIR/data/SRR7155060_1.fastq.gz -I $RNA_INT_DIR/data/SRR7155060_2.fastq.gz -o trimmed_reads/SRR7155060_1.fastq.gz -O trimmed_reads/SRR7155060_2.fastq.gz -l 25 --adapter_fasta $RNA_INT_DIR/adapter/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json trimmed_reads/SRR7155060.fastp.json --html trimmed_reads/SRR7155060.fastp.html 2&gt;trimmed_reads/SRR7155060.fastp.log Q5.) What average percentage of reads remain after adapter trimming/cleanup with fastp? Why do reads get tossed out? A5.) At this point, we could look in the log files individually. Alternatively, we could utilize the command line with a command like the one below. grep -A 1 Read1 trimmed_reads/*.log Doing this, we find that around 93-95% of reads survive after adapter trimming and cleanup with fastp. The reads that get tossed are due to being too short after trimming. They fall below our threshold of minimum read length of 25 (too short), poor sequence quality, or too many N’s. Q6.) What sample has the largest number of reads after trimming? A6.) The control sample 2 (SRR7155060) has the most reads (1,907,336 individual reads). An easy way to figure out the number of reads is to check the output log file from the trimming output. Looking at the “remaining reads” row, we see the reads (each read in a pair counted individually) that survive the trimming. We can also look at this from the command line. grep "passed" trimmed_reads/*.log Alternatively, you can make use of the command ‘wc’. This command counts the number of lines in a file. Since fastq files have 4 lines per read, the total number of lines must be divided by 4. Running this command only give you the total number of lines in the fastq file (Note that because the data is compressed, we need to use zcat to unzip it and print it to the screen, before passing it on to the wc command): zcat $RNA_INT_DIR/data/SRR7155059_1.fastq.gz | wc -l zcat $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz | wc -l We could also run fastqc and multiqc on the trimmed data and visualize the remaining reads that way. cd $RNA_INT_DIR mkdir -p qc/trimmed_fastqc fastqc $RNA_INT_DIR/trimmed_reads/*.fastq.gz -o qc/trimmed_fastqc/ cd qc/trimmed_fastqc multiqc ./ Part 2: Data alignment Goals: Familiarize yourself with HISAT2 alignment options Perform alignments using hisat2 and the trimmed version of the raw sequence data above Sort your alignments and convert into compressed bam format using samtools sort Obtain alignment summary information using samtools flagstat To create HISAT2 alignment commands for all of the six samples and run alignments: Create a directory to store the alignment results echo $RNA_INT_DIR/alignments mkdir -p $RNA_INT_DIR/alignments cd $RNA_INT_DIR/alignments Run alignment commands for each sample hisat2 -p 8 --rg-id=T1 --rg SM:Transfected1 --rg LB:Transfected1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155055_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155055_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155055.sam hisat2 -p 8 --rg-id=T2 --rg SM:Transfected2 --rg LB:Transfected2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155056_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155056_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155056.sam hisat2 -p 8 --rg-id=T3 --rg SM:Transfected3 --rg LB:Transfected3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155057_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155057_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155057.sam hisat2 -p 8 --rg-id=C1 --rg SM:Control1 --rg LB:Control1_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155058_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155058_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155058.sam hisat2 -p 8 --rg-id=C2 --rg SM:Control2 --rg LB:Control2_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155059_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155059_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155059.sam hisat2 -p 8 --rg-id=C3 --rg SM:Control3 --rg LB:Control3_lib --rg PL:ILLUMINA -x $RNA_INT_DIR/reference/Homo_sapiens.GRCh38 --dta --rna-strandness RF -1 $RNA_INT_DIR/trimmed_reads/SRR7155060_1.fastq.gz -2 $RNA_INT_DIR/trimmed_reads/SRR7155060_2.fastq.gz -S $RNA_INT_DIR/alignments/SRR7155060.sam Next, convert sam alignments to bam. cd $RNA_INT_DIR/alignments samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155055.bam $RNA_INT_DIR/alignments/SRR7155055.sam samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155056.bam $RNA_INT_DIR/alignments/SRR7155056.sam samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155057.bam $RNA_INT_DIR/alignments/SRR7155057.sam samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155058.bam $RNA_INT_DIR/alignments/SRR7155058.sam samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155059.bam $RNA_INT_DIR/alignments/SRR7155059.sam samtools sort -@ 8 -o $RNA_INT_DIR/alignments/SRR7155060.bam $RNA_INT_DIR/alignments/SRR7155060.sam Q7.) How can we obtain summary statistics for each aligned file? A7.) There are many RNA-seq QC tools available that can provide you with detailed information about the quality of the aligned sample (e.g. FastQC and RSeQC). However, for a simple summary of aligned reads counts you can use samtools flagstat. cd $RNA_INT_DIR/alignments samtools flagstat SRR7155055.bam &gt; SRR7155055.flagstat.txt samtools flagstat SRR7155056.bam &gt; SRR7155056.flagstat.txt samtools flagstat SRR7155057.bam &gt; SRR7155057.flagstat.txt samtools flagstat SRR7155058.bam &gt; SRR7155058.flagstat.txt samtools flagstat SRR7155059.bam &gt; SRR7155059.flagstat.txt samtools flagstat SRR7155060.bam &gt; SRR7155060.flagstat.txt Pull out summaries of mapped reads from the flagstat files grep "mapped (" *.flagstat.txt Q8.) Approximately how much space is saved by converting the sam to a bam format? A8.) We get about a 5.5x compression by using the bam format instead of the sam format. This can be seen by adding the -lh option when listing the files in the aligntments directory. ls -lh $RNA_INT_DIR/alignments/ To specifically look at the sizes of the sam and bam files, we could use du -h, which shows us the disk space they are utilizing in human readable format. du -h $RNA_INT_DIR/alignments/*.sam du -h $RNA_INT_DIR/alignments/*.bam In order to make visualization easier, you should now merge each of your replicate sample bams into one combined BAM for each condition. Make sure to index these bams afterwards to be able to view them on IGV. cd $RNA_INT_DIR/alignments java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=transfected.bam INPUT=SRR7155055.bam INPUT=SRR7155056.bam INPUT=SRR7155057.bam java -Xmx2g -jar $PICARD MergeSamFiles OUTPUT=control.bam INPUT=SRR7155058.bam INPUT=SRR7155059.bam INPUT=SRR7155060.bam To visualize these merged bam files in IGV, we’ll need to index them. We can do so with the following commands. cd $RNA_INT_DIR/alignments samtools index $RNA_INT_DIR/alignments/control.bam samtools index $RNA_INT_DIR/alignments/transfected.bam Try viewing genes such as TP53 to get a sense of how the data is aligned. To do this: Load up IGV Change the reference genome to “Human hg38” in the top-left category Click on File &gt; Load from URL, and in the File URL enter: “http:///rnaseq/integrated_assignment/alignments/transfected.bam". Repeat this step and enter "http:///rnaseq/integrated_assignment/alignments/control.bam" to load the other bam. Right-click on the alignments track in the middle, and Group alignments by “Library” Jump to TP53 by typing it into the search bar above Q9.) What portion of the gene do the reads seem to be piling up on? What would be different if we were viewing whole-genome sequencing data? A9.) The reads all pile up on the exonic regions of the gene since we’re dealing with RNA-Sequencing data. Not all exons have equal coverage, and this is due to different isoforms of the gene being sequenced. If the data was from a whole-genome experiment, we would ideally expect to see equal coverage across the whole gene length. Right-click in the middle of the page, and click on “Expanded” to view the reads more easily. Q10.) What are the lines connecting the reads trying to convey? A10.) The lines show a connected read, where one part of the read begins mapping to one exon, while the other part maps to the next exon. This is important in RNA-Sequencing alignment as aligners must be aware to take this partial alignment strategy into account. Part 3: Expression Estimation Goals: Familiarize yourself with Stringtie options and how to run Stringtie in “reference-only” mode Create an expression results directory, run stringtie on all 6 samples, and store the results in appropriately named subdirectories in this results dir Obtain expression values for the gene SOX4 cd $RNA_INT_DIR/ mkdir -p $RNA_INT_DIR/expression stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected1/transcripts.gtf -A expression/transfected1/gene_abundances.tsv alignments/SRR7155055.bam stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected2/transcripts.gtf -A expression/transfected2/gene_abundances.tsv alignments/SRR7155056.bam stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/transfected3/transcripts.gtf -A expression/transfected3/gene_abundances.tsv alignments/SRR7155057.bam stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control1/transcripts.gtf -A expression/control1/gene_abundances.tsv alignments/SRR7155058.bam stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control2/transcripts.gtf -A expression/control2/gene_abundances.tsv alignments/SRR7155059.bam stringtie -p 8 -G reference/Homo_sapiens.GRCh38.92.gtf -e -B -o expression/control3/transcripts.gtf -A expression/control3/gene_abundances.tsv alignments/SRR7155060.bam Q11.) How can you obtain the expression of the gene SOX4 across the transfected and control samples? A11.) To look for the expression value of a specific gene, you can use the command ‘grep’ followed by the gene name and the path to the expression file grep SOX4 $RNA_INT_DIR/expression/*/transcripts.gtf | cut -f 1,9 | grep FPKM Part 4: Differential Expression Analysis Goals: Perform differential analysis between the transfected and control samples mkdir -p $RNA_INT_DIR/ballgown/ cd $RNA_INT_DIR/ballgown/ Perform transfected vs. control comparison, using all samples, for known transcripts: Adapt the R tutorial code that was used in Differential Expression section. Modify it to work on these data (which are also a 3x3 replicate comparison of two conditions). First, start an R session: R Run the following R commands in your R session. # load the required libraries library(ballgown) library(genefilter) library(dplyr) library(devtools) # Create phenotype data needed for ballgown analysis. Recall that: # "T1-T3" refers to "transfected" (CBSLR shRNA knockdown) replicates # "C1-C3" refers to "control" (shRNA control) replicates ids=c("transfected1","transfected2","transfected3","control1","control2","control3") type=c("Tranfected","Tranfected","Tranfected","Control","Control","Control") results="/home/ubuntu/workspace/rnaseq/integrated_assignment/expression/" path=paste(results,ids,sep="") pheno_data=data.frame(ids,type,path) pheno_data # Load ballgown data structure and save it to a variable "bg" bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data) # Display a description of this object bg # Load all attributes including gene name bg_table = texpr(bg, 'all') bg_gene_names = unique(bg_table[, 9:10]) bg_transcript_names = unique(bg_table[,c(1,6)]) # Save the ballgown object to a file for later use save(bg, file='bg.rda') # Perform differential expression (DE) analysis with no filtering, at both gene and transcript level results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM") results_transcripts = merge(results_transcripts, bg_transcript_names, by.x=c("id"), by.y=c("t_id")) results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id")) # Save a tab delimited file for both the transcript and gene results write.table(results_transcripts, "Transfected_vs_Control_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(results_genes, "Transfected_vs_Control_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one bg_filt = subset (bg,"rowVars(texpr(bg)) &gt; 1", genomesubset=TRUE) # Load all attributes including gene name bg_filt_table = texpr(bg_filt , 'all') bg_filt_gene_names = unique(bg_filt_table[, 9:10]) bg_filt_transcript_names = unique(bg_filt_table[,c(1,6)]) # Perform DE analysis now using the filtered data results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM") results_transcripts = merge(results_transcripts, bg_filt_transcript_names, by.x=c("id"), by.y=c("t_id")) results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id")) # Output the filtered list of genes and transcripts and save to tab delimited files write.table(results_transcripts, "Transfected_vs_Control_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(results_genes, "Transfected_vs_Control_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Identify the significant genes with p-value &lt; 0.05 sig_transcripts = subset(results_transcripts, results_transcripts$pval&lt;0.05) sig_genes = subset(results_genes, results_genes$pval&lt;0.05) sig_transcripts_ordered = sig_transcripts[order(sig_transcripts$pval),] sig_genes_ordered = sig_genes[order(sig_genes$pval),] # Output the significant gene results to a pair of tab delimited files write.table(sig_transcripts_ordered, "Transfected_vs_Control_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(sig_genes_ordered, "Transfected_vs_Control_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Exit the R session quit(save="no") Q12.) Are there any significant differentially expressed genes? How many in total do you see? If we expected SOX4 to be differentially expressed, why don’t we see it in this case? A12.) Yes, there are about 523 significantly differntially expressed genes. Due to the fact that we’re using a subset of the fully sequenced library for each sample, the SOX4 signal is not significant at the adjusted p-value level. You can try re-running the above exercise on your own by using all the reads from each sample in the original data set, which will give you greater resolution of the expression of each gene to build mean and variance estimates for eacch gene’s expression. Part 5: Differential Expression Analysis Visualization Q13.) What plots can you generate to help you visualize this gene expression profile A13.) The CummerBund package provides a wide variety of plots that can be used to visualize a gene’s expression profile or genes that are differentially expressed. Some of these plots include heatmaps, boxplots, and volcano plots. Alternatively you can use custom plots using ggplot2 command or base R plotting commands such as those provided in the supplementary tutorials. Start with something very simple such as a scatter plot of transfect vs. control FPKM values. Make sure we are in the directory with our DE results cd $RNA_INT_DIR/ballgown/ Restart an R session: R The following R commands create summary visualizations of the DE results from Ballgown #Load libraries library(ggplot2) library(gplots) library(GenomicRanges) library(ballgown) library(ggrepel) #Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline load('bg.rda') # View a summary of the ballgown object bg # Load gene names for lookup later in the tutorial bg_table = texpr(bg, 'all') bg_gene_names = unique(bg_table[, 9:10]) # Pull the gene_expression data frame from the ballgown object gene_expression = as.data.frame(gexpr(bg)) #Set min value to 1 min_nonzero=1 # Set the columns for finding FPKM and create shorter names for figures data_columns=c(1:6) short_names=c("T1","T2","T3","C1","C2","C3") #Calculate the FPKM sum for all 6 libraries gene_expression[,"sum"]=apply(gene_expression[,data_columns], 1, sum) #Identify genes where the sum of FPKM across all samples is above some arbitrary threshold i = which(gene_expression[,"sum"] &gt; 5) #Calculate the correlation between all pairs of data r=cor(gene_expression[i,data_columns], use="pairwise.complete.obs", method="pearson") #Print out these correlation values r # Open a PDF file where we will save some plots. # We will save all figures and then view the PDF at the end pdf(file="transfected_vs_control_figures.pdf") data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3") #Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries #This step calculates 2-dimensional coordinates to plot points for each library #Libraries with similar expression patterns (highly correlated to each other) should group together #note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability d=1-r mds=cmdscale(d, k=2, eig=TRUE) par(mfrow=c(1,1)) plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.01,0.01), ylim=c(-0.01,0.01)) points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16) text(mds$points[,1], mds$points[,2], short_names, col=data_colors) # Calculate the differential expression results including significance results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id")) # Plot - Display the grand expression values from UHR and HBR and mark those that are significantly differentially expressed sig=which(results_genes$pval&lt;0.05) results_genes[,"de"] = log2(results_genes[,"fc"]) gene_expression[,"Transfected"]=apply(gene_expression[,c(1:3)], 1, mean) gene_expression[,"Control"]=apply(gene_expression[,c(4:6)], 1, mean) x=log2(gene_expression[,"Transfected"]+min_nonzero) y=log2(gene_expression[,"Control"]+min_nonzero) plot(x=x, y=y, pch=16, cex=0.25, xlab="Transfected FPKM (log2)", ylab="Control FPKM (log2)", main="Transfected vs Control FPKMs") abline(a=0, b=1) xsig=x[sig] ysig=y[sig] points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5) legend("topleft", "Significant", col="magenta", pch=16) #Get the gene symbols for the top N (according to corrected p-value) and display them on the plot topn = order(abs(results_genes[sig,"fc"]), decreasing=TRUE)[1:25] topn = order(results_genes[sig,"qval"])[1:25] text(x[topn], y[topn], results_genes[topn,"gene_name"], col="black", cex=0.75, srt=45) #Plot - Volcano plot # set default for all genes to "no change" results_genes$diffexpressed &lt;- "No" # if log2Foldchange &gt; 2 and pvalue &lt; 0.05, set as "Up regulated" results_genes$diffexpressed[results_genes$de &gt; 0.6 &amp; results_genes$pval &lt; 0.05] &lt;- "Up" # if log2Foldchange &lt; -2 and pvalue &lt; 0.05, set as "Down regulated" results_genes$diffexpressed[results_genes$de &lt; -0.6 &amp; results_genes$pval &lt; 0.05] &lt;- "Down" results_genes$gene_label &lt;- NA # write the gene names of those significantly upregulated/downregulated to a new column results_genes$gene_label[results_genes$diffexpressed != "No"] &lt;- results_genes$gene_name[results_genes$diffexpressed != "No"] ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) + xlab("log2Foldchange") + scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) + geom_point() + theme_minimal() + geom_text_repel() + geom_vline(xintercept=c(-0.6, 0.6), col="red") + geom_hline(yintercept=-log10(0.05), col="red") + guides(colour = guide_legend(override.aes = list(size=5))) + geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black") dev.off() # Exit the R session quit(save="no")]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Team Assignment - ExpressionDE Answers</title><link href="http://www.rnabio.org//module-09-appendix/0009/07/01/Team_Assignment_ExpressionDE-Answers/" rel="alternate" type="text/html" title="Team Assignment - ExpressionDE Answers" /><published>0009-07-01T00:00:00+00:00</published><updated>0009-07-01T00:00:00+00:00</updated><id>http://www.rnabio.org//module-09-appendix/0009/07/01/Team_Assignment_ExpressionDE-Answers</id><content type="html" xml:base="http://www.rnabio.org//module-09-appendix/0009/07/01/Team_Assignment_ExpressionDE-Answers/"><![CDATA[<p>The solutions below are for team A. Other team solutions will be very similar but each for their own unique chromosome dataset.</p>

<h4 id="estimate-expression-levels">Estimate expression levels</h4>
<p>Use stringtie to estimate gene/transcript abundance levels</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_HOME</span>/team_exercise
<span class="nb">mkdir </span>expression

stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/KO_sample1/transcripts.gtf <span class="nt">-A</span> expression/KO_sample1/gene_abundances.tsv alignments/SRR10045016.bam
stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/KO_sample2/transcripts.gtf <span class="nt">-A</span> expression/KO_sample2/gene_abundances.tsv alignments/SRR10045017.bam
stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/KO_sample3/transcripts.gtf <span class="nt">-A</span> expression/KO_sample3/gene_abundances.tsv alignments/SRR10045018.bam

stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/Rescue_sample1/transcripts.gtf <span class="nt">-A</span> expression/Rescue_sample1/gene_abundances.tsv alignments/SRR10045019.bam
stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/Rescue_sample2/transcripts.gtf <span class="nt">-A</span> expression/Rescue_sample2/gene_abundances.tsv alignments/SRR10045020.bam
stringtie <span class="nt">-p</span> 4 <span class="nt">-G</span> references/chr11_Homo_sapiens.GRCh38.95.gtf <span class="nt">-e</span> <span class="nt">-B</span> <span class="nt">-o</span> expression/Rescue_sample3/transcripts.gtf <span class="nt">-A</span> expression/Rescue_sample3/gene_abundances.tsv alignments/SRR10045021.bam
</code></pre></div></div>

<p><strong>Q1.</strong> Based on your stringtie results, what are the top 5 genes with highest average expression levels across all knockout samples? What about in your rescue samples? (Hint: You can use R, command-line tools, or download files to your desktop for this analysis)</p>

<p><strong>A1.</strong> TO BE COMPLETED</p>

<h4 id="perform-differential-expression-analysis">Perform differential expression analysis</h4>
<p>Use ballgown to identify differentially expressed genes between KO and Rescue samples</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_HOME</span>/team_exercise
<span class="nb">mkdir </span>de
<span class="nb">cd </span>de
</code></pre></div></div>

<p>First, start an R session:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R
</code></pre></div></div>

<p>Run the following R commands in your R session.</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load the required libraries</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ballgown</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">genefilter</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">devtools</span><span class="p">)</span><span class="w">

</span><span class="c1"># Create phenotype data needed for ballgown analysis.</span><span class="w">
</span><span class="n">ids</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"KO_sample1"</span><span class="p">,</span><span class="s2">"KO_sample2"</span><span class="p">,</span><span class="s2">"KO_sample3"</span><span class="p">,</span><span class="s2">"Rescue_sample1"</span><span class="p">,</span><span class="s2">"Rescue_sample2"</span><span class="p">,</span><span class="s2">"Rescue_sample3"</span><span class="p">)</span><span class="w">
</span><span class="n">type</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"KO"</span><span class="p">,</span><span class="s2">"KO"</span><span class="p">,</span><span class="s2">"KO"</span><span class="p">,</span><span class="s2">"Rescue"</span><span class="p">,</span><span class="s2">"Rescue"</span><span class="p">,</span><span class="s2">"Rescue"</span><span class="p">)</span><span class="w">
</span><span class="n">results</span><span class="o">=</span><span class="s2">"/home/ubuntu/workspace/rnaseq/team_exercise/expression/"</span><span class="w">
</span><span class="n">path</span><span class="o">=</span><span class="n">paste</span><span class="p">(</span><span class="n">results</span><span class="p">,</span><span class="n">ids</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="n">pheno_data</span><span class="o">=</span><span class="n">data.frame</span><span class="p">(</span><span class="n">ids</span><span class="p">,</span><span class="n">type</span><span class="p">,</span><span class="n">path</span><span class="p">)</span><span class="w">

</span><span class="n">pheno_data</span><span class="w">

</span><span class="c1"># Load ballgown data structure and save it to a variable "bg"</span><span class="w">
</span><span class="n">bg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ballgown</span><span class="p">(</span><span class="n">samples</span><span class="o">=</span><span class="n">as.vector</span><span class="p">(</span><span class="n">pheno_data</span><span class="o">$</span><span class="n">path</span><span class="p">),</span><span class="w"> </span><span class="n">pData</span><span class="o">=</span><span class="n">pheno_data</span><span class="p">)</span><span class="w">

</span><span class="c1"># Display a description of this object</span><span class="w">
</span><span class="n">bg</span><span class="w">

</span><span class="c1"># Load all attributes including gene name</span><span class="w">
</span><span class="n">bg_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">
</span><span class="n">bg_transcript_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">6</span><span class="p">)])</span><span class="w">

</span><span class="c1"># Save the ballgown object to a file for later use</span><span class="w">
</span><span class="n">save</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="o">=</span><span class="s1">'bg.rda'</span><span class="p">)</span><span class="w">

</span><span class="c1"># Perform differential expression (DE) analysis with no filtering</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"transcript"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">bg_transcript_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"t_id"</span><span class="p">))</span><span class="w">

</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">bg_gene_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Save a tab delimited file for both the transcript and gene results</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_transcript_results.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_gene_results.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one</span><span class="w">
</span><span class="n">bg_filt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="w"> </span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="s2">"rowVars(texpr(bg)) &gt; 1"</span><span class="p">,</span><span class="w"> </span><span class="n">genomesubset</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Load all attributes including gene name</span><span class="w">
</span><span class="n">bg_filt_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg_filt</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_filt_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_filt_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">
</span><span class="n">bg_filt_transcript_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_filt_table</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">6</span><span class="p">)])</span><span class="w">

</span><span class="c1"># Perform differential expression (DE) analysis with no filtering, at both gene and transcript level</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg_filt</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"transcript"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">bg_filt_transcript_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"t_id"</span><span class="p">))</span><span class="w">

</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg_filt</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">bg_filt_gene_names</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Output the filtered list of genes and transcripts and save to tab delimited files</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_transcript_results_filtered.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_gene_results_filtered.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Identify the significant genes with p-value &lt; 0.05</span><span class="w">
</span><span class="n">sig_transcripts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">results_transcripts</span><span class="p">,</span><span class="w"> </span><span class="n">results_transcripts</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">sig_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">

</span><span class="n">sig_transcripts_ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sig_transcripts</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">sig_transcripts</span><span class="o">$</span><span class="n">pval</span><span class="p">),]</span><span class="w">
</span><span class="n">sig_genes_ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sig_genes</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">sig_genes</span><span class="o">$</span><span class="n">pval</span><span class="p">),]</span><span class="w">

</span><span class="c1"># Output the significant gene results to a pair of tab delimited files</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">sig_transcripts_ordered</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_transcript_results_sig.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">sig_genes_ordered</span><span class="p">,</span><span class="w"> </span><span class="s2">"KO_vs_Rescue_gene_results_sig.tsv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Exit the R session</span><span class="w">
</span><span class="n">quit</span><span class="p">(</span><span class="n">save</span><span class="o">=</span><span class="s2">"no"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>

<p><strong>Q2.</strong> How many significant differentially expressed genes do you observe?</p>

<p><strong>A2.</strong> TO BE COMPLETED</p>

<p><strong>Q3.</strong> By referring back to the supplementary tutorial in the DE Visualization Module, can you construct a volcano plot showcasing the significantly de genes?</p>

<p><strong>A3.</strong> See below.</p>

<h4 id="perform-differential-expression-analysis-visualization">Perform differential expression analysis visualization</h4>

<p>Make sure we are in the directory with our DE results</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$RNA_HOME</span>/team_exercise/de
</code></pre></div></div>

<p>Restart an R session:</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">R</span><span class="w">
</span></code></pre></div></div>

<p>The following R commands create summary visualizations of the DE results from Ballgown</p>

<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="c1">#Load libraries</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">gplots</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">GenomicRanges</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ballgown</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggrepel</span><span class="p">)</span><span class="w">

</span><span class="c1">#Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline</span><span class="w">
</span><span class="n">load</span><span class="p">(</span><span class="s1">'bg.rda'</span><span class="p">)</span><span class="w">

</span><span class="c1"># View a summary of the ballgown object</span><span class="w">
</span><span class="n">bg</span><span class="w">

</span><span class="c1"># Load gene names for lookup later in the tutorial</span><span class="w">
</span><span class="n">bg_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">texpr</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span><span class="n">bg_gene_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bg_table</span><span class="p">[,</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">10</span><span class="p">])</span><span class="w">

</span><span class="c1"># Pull the gene_expression data frame from the ballgown object</span><span class="w">
</span><span class="n">gene_expression</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">gexpr</span><span class="p">(</span><span class="n">bg</span><span class="p">))</span><span class="w">

</span><span class="c1">#Set min value to 1</span><span class="w">
</span><span class="n">min_nonzero</span><span class="o">=</span><span class="m">1</span><span class="w">

</span><span class="c1"># Set the columns for finding FPKM and create shorter names for figures</span><span class="w">
</span><span class="n">data_columns</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">short_names</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"KO1"</span><span class="p">,</span><span class="s2">"KO2"</span><span class="p">,</span><span class="s2">"KO3"</span><span class="p">,</span><span class="s2">"R1"</span><span class="p">,</span><span class="s2">"R2"</span><span class="p">,</span><span class="s2">"R3"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Calculate the FPKM sum for all 6 libraries</span><span class="w">
</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"sum"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="n">data_columns</span><span class="p">],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w">

</span><span class="c1">#Identify genes where the sum of FPKM across all samples is above some arbitrary threshold</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"sum"</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="c1">#Calculate the correlation between all pairs of data</span><span class="w">
</span><span class="n">r</span><span class="o">=</span><span class="n">cor</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">data_columns</span><span class="p">],</span><span class="w"> </span><span class="n">use</span><span class="o">=</span><span class="s2">"pairwise.complete.obs"</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="o">=</span><span class="s2">"pearson"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Print out these correlation values</span><span class="w">
</span><span class="n">r</span><span class="w">

</span><span class="c1"># Open a PDF file where we will save some plots. </span><span class="w">
</span><span class="c1"># We will save all figures and then view the PDF at the end</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s2">"KO_vs_rescue_figures.pdf"</span><span class="p">)</span><span class="w">

</span><span class="n">data_colors</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"tomato1"</span><span class="p">,</span><span class="s2">"tomato2"</span><span class="p">,</span><span class="s2">"tomato3"</span><span class="p">,</span><span class="s2">"royalblue1"</span><span class="p">,</span><span class="s2">"royalblue2"</span><span class="p">,</span><span class="s2">"royalblue3"</span><span class="p">)</span><span class="w">

</span><span class="c1">#Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries</span><span class="w">
</span><span class="c1">#This step calculates 2-dimensional coordinates to plot points for each library</span><span class="w">
</span><span class="c1">#Libraries with similar expression patterns (highly correlated to each other) should group together</span><span class="w">

</span><span class="c1">#note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability</span><span class="w">
</span><span class="n">d</span><span class="o">=</span><span class="m">1</span><span class="o">-</span><span class="n">r</span><span class="w">
</span><span class="n">mds</span><span class="o">=</span><span class="n">cmdscale</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">eig</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="o">=</span><span class="s2">"MDS distance plot (all non-zero genes)"</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.01</span><span class="p">,</span><span class="m">0.01</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.01</span><span class="p">,</span><span class="m">0.01</span><span class="p">))</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"grey"</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">mds</span><span class="o">$</span><span class="n">points</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">short_names</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="n">data_colors</span><span class="p">)</span><span class="w">

</span><span class="c1"># Calculate the differential expression results including significance</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stattest</span><span class="p">(</span><span class="n">bg</span><span class="p">,</span><span class="w"> </span><span class="n">feature</span><span class="o">=</span><span class="s2">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">covariate</span><span class="o">=</span><span class="s2">"type"</span><span class="p">,</span><span class="w"> </span><span class="n">getFC</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">meas</span><span class="o">=</span><span class="s2">"FPKM"</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">results_genes</span><span class="p">,</span><span class="n">bg_gene_names</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"id"</span><span class="p">),</span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"gene_id"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Plot - Display the grand expression values from KO and Rescue conditions and mark those that are significantly differentially expressed</span><span class="w">

</span><span class="n">sig</span><span class="o">=</span><span class="n">which</span><span class="p">(</span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="o">&lt;</span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">results_genes</span><span class="p">[,</span><span class="s2">"de"</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log2</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[,</span><span class="s2">"fc"</span><span class="p">])</span><span class="w">
</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"KO"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Rescue"</span><span class="p">]</span><span class="o">=</span><span class="n">apply</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="o">:</span><span class="m">6</span><span class="p">)],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">

</span><span class="n">x</span><span class="o">=</span><span class="n">log2</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"KO"</span><span class="p">]</span><span class="o">+</span><span class="n">min_nonzero</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">log2</span><span class="p">(</span><span class="n">gene_expression</span><span class="p">[,</span><span class="s2">"Rescue"</span><span class="p">]</span><span class="o">+</span><span class="n">min_nonzero</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="s2">"KO FPKM (log2)"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="s2">"Rescue FPKM (log2)"</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="o">=</span><span class="s2">"Rescue vs KO FPKMs"</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">xsig</span><span class="o">=</span><span class="n">x</span><span class="p">[</span><span class="n">sig</span><span class="p">]</span><span class="w">
</span><span class="n">ysig</span><span class="o">=</span><span class="n">y</span><span class="p">[</span><span class="n">sig</span><span class="p">]</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">xsig</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">ysig</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"magenta"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s2">"topleft"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Significant"</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"magenta"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="o">=</span><span class="m">16</span><span class="p">)</span><span class="w">

</span><span class="c1">#Get the gene symbols for the top N (according to corrected p-value) and display them on the plot</span><span class="w">
</span><span class="n">topn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[</span><span class="n">sig</span><span class="p">,</span><span class="s2">"fc"</span><span class="p">]),</span><span class="w"> </span><span class="n">decreasing</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">]</span><span class="w">
</span><span class="n">topn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">(</span><span class="n">results_genes</span><span class="p">[</span><span class="n">sig</span><span class="p">,</span><span class="s2">"qval"</span><span class="p">])[</span><span class="m">1</span><span class="o">:</span><span class="m">25</span><span class="p">]</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">topn</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">topn</span><span class="p">],</span><span class="w"> </span><span class="n">results_genes</span><span class="p">[</span><span class="n">topn</span><span class="p">,</span><span class="s2">"gene_name"</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.75</span><span class="p">,</span><span class="w"> </span><span class="n">srt</span><span class="o">=</span><span class="m">45</span><span class="p">)</span><span class="w">

</span><span class="c1">#Plot - Volcano plot</span><span class="w">

</span><span class="c1"># set default for all genes to "no change"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"No"</span><span class="w">

</span><span class="c1"># if log2Foldchange &gt; 2 and pvalue &lt; 0.05, set as "Up regulated"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">de</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0.6</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"Up"</span><span class="w">

</span><span class="c1"># if log2Foldchange &lt; -2 and pvalue &lt; 0.05, set as "Down regulated"</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">de</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">-0.6</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">pval</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"Down"</span><span class="w">

</span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_label</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NA</span><span class="w">

</span><span class="c1"># write the gene names of those significantly upregulated/downregulated to a new column</span><span class="w">
</span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_label</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">results_genes</span><span class="o">$</span><span class="n">gene_name</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">]</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">results_genes</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,],</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">de</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="n">pval</span><span class="p">),</span><span class="w"> </span><span class="n">label</span><span class="o">=</span><span class="n">gene_label</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">diffexpressed</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"log2Foldchange"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Differentially expressed"</span><span class="p">,</span><span class="w"> </span><span class="n">values</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_text_repel</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.6</span><span class="p">,</span><span class="w"> </span><span class="m">0.6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="m">0.05</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">guides</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">guide_legend</span><span class="p">(</span><span class="n">override.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">5</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
             </span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">results_genes</span><span class="p">[</span><span class="n">results_genes</span><span class="o">$</span><span class="n">diffexpressed</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"No"</span><span class="p">,],</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">de</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=-</span><span class="n">log10</span><span class="p">(</span><span class="n">pval</span><span class="p">)),</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">)</span><span class="w">


</span><span class="n">dev.off</span><span class="p">()</span><span class="w">

</span><span class="c1"># Exit the R session</span><span class="w">
</span><span class="n">quit</span><span class="p">(</span><span class="n">save</span><span class="o">=</span><span class="s2">"no"</span><span class="p">)</span><span class="w">

</span></code></pre></div></div>]]></content><author><name>Zachary Skidmore</name></author><category term="Module-09-Appendix" /><summary type="html"><![CDATA[The solutions below are for team A. Other team solutions will be very similar but each for their own unique chromosome dataset. Estimate expression levels Use stringtie to estimate gene/transcript abundance levels cd $RNA_HOME/team_exercise mkdir expression stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample1/transcripts.gtf -A expression/KO_sample1/gene_abundances.tsv alignments/SRR10045016.bam stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample2/transcripts.gtf -A expression/KO_sample2/gene_abundances.tsv alignments/SRR10045017.bam stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/KO_sample3/transcripts.gtf -A expression/KO_sample3/gene_abundances.tsv alignments/SRR10045018.bam stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample1/transcripts.gtf -A expression/Rescue_sample1/gene_abundances.tsv alignments/SRR10045019.bam stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample2/transcripts.gtf -A expression/Rescue_sample2/gene_abundances.tsv alignments/SRR10045020.bam stringtie -p 4 -G references/chr11_Homo_sapiens.GRCh38.95.gtf -e -B -o expression/Rescue_sample3/transcripts.gtf -A expression/Rescue_sample3/gene_abundances.tsv alignments/SRR10045021.bam Q1. Based on your stringtie results, what are the top 5 genes with highest average expression levels across all knockout samples? What about in your rescue samples? (Hint: You can use R, command-line tools, or download files to your desktop for this analysis) A1. TO BE COMPLETED Perform differential expression analysis Use ballgown to identify differentially expressed genes between KO and Rescue samples cd $RNA_HOME/team_exercise mkdir de cd de First, start an R session: R Run the following R commands in your R session. # load the required libraries library(ballgown) library(genefilter) library(dplyr) library(devtools) # Create phenotype data needed for ballgown analysis. ids=c("KO_sample1","KO_sample2","KO_sample3","Rescue_sample1","Rescue_sample2","Rescue_sample3") type=c("KO","KO","KO","Rescue","Rescue","Rescue") results="/home/ubuntu/workspace/rnaseq/team_exercise/expression/" path=paste(results,ids,sep="") pheno_data=data.frame(ids,type,path) pheno_data # Load ballgown data structure and save it to a variable "bg" bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data) # Display a description of this object bg # Load all attributes including gene name bg_table = texpr(bg, 'all') bg_gene_names = unique(bg_table[, 9:10]) bg_transcript_names = unique(bg_table[,c(1,6)]) # Save the ballgown object to a file for later use save(bg, file='bg.rda') # Perform differential expression (DE) analysis with no filtering results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM") results_transcripts = merge(results_transcripts, bg_transcript_names, by.x=c("id"), by.y=c("t_id")) results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id")) # Save a tab delimited file for both the transcript and gene results write.table(results_transcripts, "KO_vs_Rescue_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(results_genes, "KO_vs_Rescue_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one bg_filt = subset (bg,"rowVars(texpr(bg)) &gt; 1", genomesubset=TRUE) # Load all attributes including gene name bg_filt_table = texpr(bg_filt , 'all') bg_filt_gene_names = unique(bg_filt_table[, 9:10]) bg_filt_transcript_names = unique(bg_filt_table[,c(1,6)]) # Perform differential expression (DE) analysis with no filtering, at both gene and transcript level results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM") results_transcripts = merge(results_transcripts, bg_filt_transcript_names, by.x=c("id"), by.y=c("t_id")) results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id")) # Output the filtered list of genes and transcripts and save to tab delimited files write.table(results_transcripts, "KO_vs_Rescue_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(results_genes, "KO_vs_Rescue_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Identify the significant genes with p-value &lt; 0.05 sig_transcripts = subset(results_transcripts, results_transcripts$pval&lt;0.05) sig_genes = subset(results_genes, results_genes$pval&lt;0.05) sig_transcripts_ordered = sig_transcripts[order(sig_transcripts$pval),] sig_genes_ordered = sig_genes[order(sig_genes$pval),] # Output the significant gene results to a pair of tab delimited files write.table(sig_transcripts_ordered, "KO_vs_Rescue_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE) write.table(sig_genes_ordered, "KO_vs_Rescue_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE) # Exit the R session quit(save="no") Q2. How many significant differentially expressed genes do you observe? A2. TO BE COMPLETED Q3. By referring back to the supplementary tutorial in the DE Visualization Module, can you construct a volcano plot showcasing the significantly de genes? A3. See below. Perform differential expression analysis visualization Make sure we are in the directory with our DE results cd $RNA_HOME/team_exercise/de Restart an R session: R The following R commands create summary visualizations of the DE results from Ballgown #Load libraries library(ggplot2) library(gplots) library(GenomicRanges) library(ballgown) library(ggrepel) #Import expression and differential expression results from the HISAT2/StringTie/Ballgown pipeline load('bg.rda') # View a summary of the ballgown object bg # Load gene names for lookup later in the tutorial bg_table = texpr(bg, 'all') bg_gene_names = unique(bg_table[, 9:10]) # Pull the gene_expression data frame from the ballgown object gene_expression = as.data.frame(gexpr(bg)) #Set min value to 1 min_nonzero=1 # Set the columns for finding FPKM and create shorter names for figures data_columns=c(1:6) short_names=c("KO1","KO2","KO3","R1","R2","R3") #Calculate the FPKM sum for all 6 libraries gene_expression[,"sum"]=apply(gene_expression[,data_columns], 1, sum) #Identify genes where the sum of FPKM across all samples is above some arbitrary threshold i = which(gene_expression[,"sum"] &gt; 5) #Calculate the correlation between all pairs of data r=cor(gene_expression[i,data_columns], use="pairwise.complete.obs", method="pearson") #Print out these correlation values r # Open a PDF file where we will save some plots. # We will save all figures and then view the PDF at the end pdf(file="KO_vs_rescue_figures.pdf") data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3") #Plot - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries #This step calculates 2-dimensional coordinates to plot points for each library #Libraries with similar expression patterns (highly correlated to each other) should group together #note that the x and y display limits will have to be adjusted for each dataset depending on the amount of variability d=1-r mds=cmdscale(d, k=2, eig=TRUE) par(mfrow=c(1,1)) plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.01,0.01), ylim=c(-0.01,0.01)) points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16) text(mds$points[,1], mds$points[,2], short_names, col=data_colors) # Calculate the differential expression results including significance results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes,bg_gene_names,by.x=c("id"),by.y=c("gene_id")) # Plot - Display the grand expression values from KO and Rescue conditions and mark those that are significantly differentially expressed sig=which(results_genes$pval&lt;0.05) results_genes[,"de"] = log2(results_genes[,"fc"]) gene_expression[,"KO"]=apply(gene_expression[,c(1:3)], 1, mean) gene_expression[,"Rescue"]=apply(gene_expression[,c(4:6)], 1, mean) x=log2(gene_expression[,"KO"]+min_nonzero) y=log2(gene_expression[,"Rescue"]+min_nonzero) plot(x=x, y=y, pch=16, cex=0.25, xlab="KO FPKM (log2)", ylab="Rescue FPKM (log2)", main="Rescue vs KO FPKMs") abline(a=0, b=1) xsig=x[sig] ysig=y[sig] points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5) legend("topleft", "Significant", col="magenta", pch=16) #Get the gene symbols for the top N (according to corrected p-value) and display them on the plot topn = order(abs(results_genes[sig,"fc"]), decreasing=TRUE)[1:25] topn = order(results_genes[sig,"qval"])[1:25] text(x[topn], y[topn], results_genes[topn,"gene_name"], col="black", cex=0.75, srt=45) #Plot - Volcano plot # set default for all genes to "no change" results_genes$diffexpressed &lt;- "No" # if log2Foldchange &gt; 2 and pvalue &lt; 0.05, set as "Up regulated" results_genes$diffexpressed[results_genes$de &gt; 0.6 &amp; results_genes$pval &lt; 0.05] &lt;- "Up" # if log2Foldchange &lt; -2 and pvalue &lt; 0.05, set as "Down regulated" results_genes$diffexpressed[results_genes$de &lt; -0.6 &amp; results_genes$pval &lt; 0.05] &lt;- "Down" results_genes$gene_label &lt;- NA # write the gene names of those significantly upregulated/downregulated to a new column results_genes$gene_label[results_genes$diffexpressed != "No"] &lt;- results_genes$gene_name[results_genes$diffexpressed != "No"] ggplot(data=results_genes[results_genes$diffexpressed != "No",], aes(x=de, y=-log10(pval), label=gene_label, color = diffexpressed)) + xlab("log2Foldchange") + scale_color_manual(name = "Differentially expressed", values=c("blue", "red")) + geom_point() + theme_minimal() + geom_text_repel() + geom_vline(xintercept=c(-0.6, 0.6), col="red") + geom_hline(yintercept=-log10(0.05), col="red") + guides(colour = guide_legend(override.aes = list(size=5))) + geom_point(data = results_genes[results_genes$diffexpressed == "No",], aes(x=de, y=-log10(pval)), colour = "black") dev.off() # Exit the R session quit(save="no")]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://www.rnabio.org//assets/logos/DNA.jpg" /><media:content medium="image" url="http://www.rnabio.org//assets/logos/DNA.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>