Bioinformatics Best Practices

« POSIT Setup Course Complete Result Sets »

Introduction

This best practices guide provides a basic overview of useful practices and tools for managing bioinformatics environments and analysis development.

Managing Your Analysis with Notebooks

Similar to the use of a laboratory notebook, taking notes about the procedures and analysis you performed is critical to reproducible science. There are a number of scientific computing notebooks available, but the most popular by far is the Jupyter Notebook.

Jupyter supports interactive data science and scientific computer across a small number of languages, although the most popular use of Jupyter is with Python, as the Jupyter notebook is built upon the Python-based iPython Notebook.

Example notebooks

A live version of Jupyter is available to try online, and provides several example notebooks in a few different languages. You can also check out a real analysis of Guide to Pharmacology gene family data for incorporation into the Drug-Gene Interaction Database.

Versioning Code with Git and GitHub

Git is a distributed version control system that allows users to make changes to code while simultaneously documenting those changes and preserving a history, allowing code to be rolled back to a previous version quickly and safely. GitHub is a freemium, online repository hosting service. You may use GitHub to track projects, discuss issues, document applications, and review code. GitHub is one of the best ways to share your projects, and should be used from the very onset of a project. Some forethought should be given in creating and managing a repository, however, as GitHub is not a good place to share very large or sensitive data files. See the 10-minute introduction to using GitHub.

Managing Your Compute Environment

One of the most challenging aspects of bioinformatics workflows is reproducibility. In addition to documenting your analysis with a notebook, providing a copy of your compute environment limits variability in results, allowing for future reproduction of results. A world of options exist to handle this, although some of the most common options are presented.

AWS Elastic Cloud Computing is a useful service for creating entire virtual machines that can easily be copied and distributed. This option does require a paid account with Amazon, and the costs of storing the images and running instances may add up over time, especially if every analysis is stored in a separate image. Additionally, this option does not isolate the analysis environment from the system environment, potentially leading to changes in analysis output as system libraries are updated over time. The RNA-seq wiki makes heavy use of AWS as a distribution platform.

VirtualBox is a general-purpose full virtualizer that allows you to emulate a computer, complete with virtual disks, a virtual operating system, and any data and applications stored therein. It has the advantage of creating machines that are stored and run on local hardware (e.g. your personal workstation), but the extra overhead of running a virtual computer on top of a host operating system can considerably slow performance of tools stored on the virtual machine, and thus is best used for testing or demonstration purposes.

Docker packages apps and their dependencies into containers which may be docked to a docker engine running on a computer. Docker engines are available on all major operating systems, and allow software to remain infrastructure independent while sharing a filespace and system resources with other docked containers. This is a much more efficient approach than guest virtual machines, and containers may be docked locally or on cloud-based infrastructure.

Conda is a language-agnostic package, dependency and environment management system. It is included in the data-science-focused distribution of Conda, Anaconda. Anaconda is based on Python and R packages for the analysis of scientific, large-scale data. Bioinformaticians also commonly use Bioconda, which add channels to Conda with bioinformatics tools (such as the popular sequence alignment tool BWA).

« POSIT Setup Course Complete Result Sets »

POSIT Setup

« GCP Setup Course Bioinformatics Best Practices »

Posit setup for use in CRI 2024 workshop

This tutorial explains how Posit cloud RStudio was configured for the course. This exercise is not to be completed by the students but is provided as a reference for future course developers that wish to conduct their hands on exercises on Posit RStudio.

A Posit workspace was already created by the workshop organizers. We used Posit projects with 16GB RAM and 2 cores for the workshop with OS Ubuntu 20.04. Using these configurations, we created a template file that has all the raw data files uploaded along with the R packages needed for the workshop. From the student side, the intention is to make copies off this template so that they have an RStudio environment with the raw data files that has the packages pre-installed.

Upload raw data

Folders for uploading raw data were created using the RStudio terminal. Files were either uploaded from a local laptop/ storage1 location using the Upload feature in the bottom right pane of the RStudio window; or downloaded from genomedata.org using wget from the RStudio terminal.

mkdir data
mkdir outdir
mkdir outdir_single_cell_rna
mkdir package_installation

cd data
mkdir single_cell_rna
mkdir bulk_rna

Files in single_cell_rna

CellRanger outputs for reps1,3,5 (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/counts_gex/sample_filtered_feature_bc_matrix.h5.zip)
BCR and TCR clonotypes (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_b_posit.zip and /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/CellRanger_v7_run/runs/cri_workshop_scrna_files/clonotypes_t_posit.zip)
MSigDB M8: cell type signature gene sets (downloaded GMT file from MSigDB website to laptop and then uploaded to single_cell_rna folder)
CONICSmat mm10 chr arms positions file (downloaded file from CONICSmat GitHub - chromosome_full_positions_mm10.txt to laptop and then uploaded to single_cell_rna folder)
VarTrix file with barcodes and tumor calls (uploaded from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/Tumor_Calls_per_Variants_for_CRI_Updated_Barcodes.tsv) -> might not need this so may remove.
VarTrix output files (uploaded all matrices and the barcodes files from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/scRNA/vartrix_outputs_for_CRI.zip - uploaded to a cancer_cell_id folder in data/single_cell_rna/)
Mouse variants VCF file (uploaded file from /storage1/fs1/mgriffit/Active/scrna_mcb6c/Mouse_Bladder_MCB6C_Arora/exome/output_updated/final_basic_filtered_annotated.vcf)

Posit requires all files to be zipped prior to uploading and automatically unzips the folder after the upload. After uploading the files, made a folder for the cellranger outputs, and moved the .h5 files there. Will also download inferCNV files using wget

#organize cellranger outputs
cd /cloud/project/data/single_cell_rna
mkdir cellranger_outputs
mv *.h5 cellranger_outputs

#download inferCNV reference files and organize all reference files
mkdir reference_files
mv m8.all.v2023.2.Mm.symbols.gmt reference_files
mv Tumor_Calls_per_Variants_for_CRI.tsv reference_files
cd reference_files
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_id.infercnv_positions
wget https://data.broadinstitute.org/Trinity/CTAT/cnv/mouse_gencode.GRCm38.p6.vM25.basic.annotation.by_gene_name.infercnv_positions

#organize vartrix files
cd /cloud/project/data/single_cell_rna
mkdir cancer_cell_id 
cd cancer_cell_id
wget http://genomedata.org/cri-workshop/somatic_variants_exome/mcb6c-exome-somatic.variants.annotated.clean.tsv

Files in bulk_rna

Batch correction file (downloaded from genomedata - GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv)
DE analysis files (downloaded from genomedata - ENSG_ID2Name.txt and gene_read_counts_table_all_final.tsv)

cd /cloud/project/data/bulk_rna
wget http://genomedata.org/rnaseq-tutorial/batch_correction/GSE48035_ILMN.Counts.SampleSubset.ProteinCodingGenes.tsv
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/ENSG_ID2Name.txt
wget http://genomedata.org/rnaseq-tutorial/results/cshl2022/rnaseq/gene_read_counts_table_all_final.tsv

Back-up files

Created folder in outdir/single_cell_rna called backup_files. Ran through QA/QC assessment and celltyping modules and added preprocessed_object.rds Seurat object from there to backup_files.

Installing packages

All package installations are from CRAN or BioConductor or GitHub pages, except for CytoTRACE. That was downloaded to the package_installation folder and then installed using devtools.

#Download CytoTRACE tar.gz file
download.file("https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz", destfile = "package_installation/CytoTRACE_0.3.3.tar.gz")

# Installing package installers
install.packages("devtools")
install.packages("BiocManager")

# Bulk RNA seq libraries
BiocManager::install("genefilter")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("data.table")
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("GO.db")
BiocManager::install("gage")
BiocManager::install("sva")
install.packages("gridExtra")
BiocManager::install("edgeR")
install.packages("UpSetR")
BiocManager::install("DESeq2")
install.packages("gtable")
BiocManager::install("apeglm")

# Intro to R packages
install.packages("tidyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("MASS")
install.packages("ggpubr")

# Single-cell RNA seq libraries
BiocManager::install("sva") #need this for cytotrace
devtools::install_local("package_installation/CytoTRACE_0.3.3.tar.gz")
install.packages("Seurat")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("Matrix")
install.packages("hdf5r")
install.packages("bench") # to mark time
install.packages("viridis")
install.packages("R.utils")
remotes::install_github("satijalab/seurat-wrappers")
BiocManager::install("celldex")
BiocManager::install("SingleR")
devtools::install_github("immunogenomics/presto")
BiocManager::install("EnhancedVolcano")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Mm.eg.db")
install.packages("msigdbr")
BiocManager::install("scRepertoire")
BiocManager::install("BiocGenerics")
BiocManager::install("DelayedArray")
BiocManager::install("DelayedMatrixStats")
BiocManager::install("limma")
BiocManager::install("lme4")
BiocManager::install("S4Vectors")
BiocManager::install("SingleCellExperiment")
BiocManager::install("SummarizedExperiment")
BiocManager::install("batchelor")
BiocManager::install("HDF5Array")
BiocManager::install("terra")
BiocManager::install("ggrastr")
devtools::install_github("cole-trapnell-lab/monocle3")
install.packages("beanplot")
install.packages("mixtools")
install.packages("pheatmap")
install.packages("zoo")
install.packages("squash")
install.packages("showtext")
BiocManager::install("biomaRt")
BiocManager::install("scran")
devtools::install_github("diazlab/CONICS/CONICSmat", dep = FALSE)
install.packages("gprofiler2")
devtools::install_github(repo = "ncborcherding/scRepertoire")