Welcome to the blog

Posts

My thoughts and ideas

Introduction to R | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Introduction to R

Author : Katie Campbell, UCLA


Introduction to R

This session will cover the basics of R programming, from setting up your environment to basic data analysis. In this tutorial, you will:

  • Become comfortable navigating your filesystem and environment from the R console
  • Understanding, reading, and manipulating data structures
  • Change or summarize datasets for basic analysis

Introduction to R programming

R is a powerful programming language and software environment used for statistical computing, data analysis, and graphical representation of data. It is widely used among statisticians, data analysts, and researchers for data mining and statistical software development.

Prerequisite: Files for today’s session

Today’s session will utilize two input files, which we will use for practice. These can be downloaded into your instance with the following commands:

curl https://raw.githubusercontent.com/ParkerICI/MORRISON-1-public/refs/heads/main/RNASeq/RNA-CancerCell-MORRISON1-metadata.tsv > intro_r_metadata.tsv
curl https://raw.githubusercontent.com/ParkerICI/MORRISON-1-public/refs/heads/main/RNASeq/data/RNA-CancerCell-MORRISON1-combat_batch_corrected-logcpm-all_samples.tsv.zip > intro_r_dataset.tsv.zip

The downloaded intro_r_metadata.tsv file contains annotation of a set of RNAseq samples from patients with melanoma treated with immunotherapies. The intro_r_dataset.tsv.zip file contains batch effect-corrected gene expression values for all of the samples in this dataset.

Why use R?

  • Open Source: R is free to use and open-source.
  • Extensive Packages: Thousands of packages available for various statistical techniques.
  • Strong Community Support: Active community contributes to continuous improvement.
  • Cross-Platform: Works on Windows, macOS, and Linux.

Getting started

We will launch R in our instance and be programming within the terminal directly. This would be equivalent to writing code in the “Console” panel of R studio. Launch R using the command:

R

The terminal should print out the following information:

R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

The > character indicates that you are now within the R environment.

After the course, you can download R from CRAN on your personal computer. Once R is installed, you can run it from your terminal or using an Integrated Development Environment (IDE) like RStudio, which makes it convenient to code, debug, and analyze data within a convenient user interface.

Some of the commands in this tutorial have a comment character (#) in front of them. These are commands that you don’t have to run in today’s session, but can come back to on your own time. Commenting your code in scripts will also support the documentation and clarity of your scripts.


Working directory and packages

Working directory

The working directory is the folder where R reads and saves files by default.

You can check your working directory by:

getwd()

If you are interested in moving to a different directory (for example, where your files are being stored), you can change the working directory:

# setwd("/new/path/")

You can replace "/new/path" with your desired path. Note that directory paths are subtly different between Mac and Windows.

Packages

Packages extend R’s functionality by providing additional functions and datasets. First, they need to be installed before they can be loaded. Many of the packages used in this course will already be available in your instance.

# install.packages("tidyverse")

Once packages are installed, they can be loaded using library(package_name). This is usually the first thing that you run in your script or console to load all of the functions you will be using. Tidyverse is a collection of R packages designed for data science, including packages like ggplot2 for data visualization and dplyr for data manipulation.

library(tidyverse)

As you learn commands, you can include ? before the command in order to see a description of what the command does, the parameters and inputs, and the structure of the output. For example, you can see the documentation for list.files() with the following command:

?list.files()
# Type q to exit the help documentation

This shows that list.files() will list all of the files in a directory or folder as a character vector. You can change the directory to search in with the parameter path or specify a type of file with pattern. As you perform data analysis in R over the course of this workshop, this is a helpful way to explore what commands are doing and how you can use them.

Try this: What files are in your current working directory?

You can also check the files in the parent directory, subdirectories, or based upon file path patterns:

list.files(path = "..")
list.files(pattern = ".tsv")

Variables and the environment

In python, = assigns the value on the right to the name of the variable on the left In R, <- or = can be used to assign a value on the right to the name of the variable on the left

age <- 32
first_name <- 'Katie'

Displaying values

print() can be used to show a value, but you can also just type the variable name.

print(age)
age
print(first_name)
first_name

The value 32 is shown without quotes, while “Katie” is shown with quotes, since 32 is an integer (or double) and “Katie” is a character string. One other variable type is a logical (TRUE or FALSE value). Variable types can be determined using the commands typeof().

typeof(age)
typeof(32)

typeof(first_name)

typeof(TRUE)

Note that if we want to print out multiple variables, we can’t simply list them in print() the way that we did in python.

print(first_name, "is", age, "years old")

Instead, we can use paste() function to combine this statement:

print(paste(first_name, "is", age, "years old"))

paste() will concatenate the values, separating them by spaces. Try this: What happens if you use the paste0() function instead?

As in python, variables must be created before they are used.

print(last_name)

This results in the error: Error: object ‘last_name’ not found

The global environment

The environment is where R stores variables and functions you’ve created. In RStudio, all of your stored variables are visibly listed in the upper right panel. You can also list the objects of your environment with the command ls():

ls()

As we store more variables, this list grows:

last_name <- "Campbell"
ls()

We can also remove variables from our environment:

rm(age)
ls()

Or we can remove all objects and clear the environment entirely:

rm(list = ls())

Creating data structures

Data structures can be combined into one- and two-dimensional objects in R, including vectors, lists, matrices, and data frames.

  • Vectors are created using the c() function, and all of the elements must be of the same type.
  • Lists are created using the list() function and can contain elements of multiple types.
  • Matrices are created using the matrix() function and are two-dimensional arrays that contain one type.
  • Data frames are created using the data.frame() function, are two-dimensional, and columns contain vectors of the same type.

Vectors and lists

You can create a numeric vector of only numbers. We can use the function str() to further explore the dimensions and types in the object.

numbers <- c(1, 2, 3, 4, 5)
typeof(numbers)
str(numbers)

A numeric vector can also be generated by a sequence of numbers:

sequence <- 1:10
sequence

skipping_sequence <- seq(from = 1, to = 10, by = 2)
skipping_sequence

repeated_sequence <- rep(x = 1, times = 10)
repeated_sequence

A character vector only contains character strings:

names <- c("Alice", "Bob", "Charlie", "Daniel", "Edwin")
typeof(names)
str(names)

names_sequence <- rep(x = "Your Name", 10)
names_sequence

Note that the structure command str() shows not only the type of vector, but also the length of it.

If you try to create a vector of different types, R will automatically coerce the variable into the most general type. For example, if you try to create a vector of numbers and letters, it will coerce the entire vector into characters:

values <- c(1, 2, 3, 4, "Alice", "Bob")
values
str(values)

more_values <- c(numbers, names)
str(more_values)

Notice that all of the values print out with quotation marks, and str() shows that it is a character (chr) vector.

Lists

Lists are unique from vectors

Matrices

Matrices are generated using the matrix() function. You can create a matrix by feeding the command a specific set of values. Look at the parameters for matrix():

?matrix()

The default of matrix() is to create a matrix with one column. To organize these values into a specific size, you need to fill in the parameters nrow and ncol:

matrix(13:24) # Creates a 12x1 matrix of these values

# These are all the same
matrix(13:24, nrow = 3)
matrix(13:24, ncol = 4)
matrix(13:24, nrow = 3, ncol = 4)

# These fill in the values by row, rather than in the order of the columns
matrix(13:24, nrow = 3, byrow = TRUE)
matrix(13:24, ncol = 4, byrow = TRUE)
matrix(13:24, nrow = 3, ncol = 4, byrow = TRUE)

Data frames

Data frames are used for storing two-dimensional data structures, where each row represents a series of observations. Each column represents a variable, which is a vector of one type.

df <- data.frame(
  name = names,
  age = numbers,
  flower = c(rep("rose", 3), rep("petunia", 2)),
  favorite_color = c('red','blue','green','yellow','macaroni and cheese'),
  has_dog = c(TRUE, FALSE, TRUE, FALSE, FALSE)
)

You can view your data frame using head():

head(df)
head(df, 3)

Vectors can be extracted from the data frame using data_frame$column_name:

df$name
df$age

Some useful commands to understand your data structures, in general, include summarizing the rows, columns, and names of these dimensions:

str(df)
nrow(df) # Number of rows
ncol(df) # Number of columns
dim(df) # Dimensions of the data frame (row, column)
colnames(df)
names(df)

Data frames store the type of each variable as a vector:

summary(df)

Adding, removing, and modifying columns

df$new_column <- "my new column"
df

df$new_column <- NULL
df

df$lower_name <- tolower(df$name)
df

df$has_dog <- ifelse(df$has_dog, "yes", "no")
df

df$has_dog <- ifelse(df$has_dog == "yes", TRUE, FALSE)
df

In tidyverse, you can use the mutate() column to do the same thing. You just have to overwrite the variable each time:

mutate(df, new_column = "my new column")
df

df <- mutate(df, new_column = "my new column")
df

# You can also include a list of these changes in one
mutate(df, new_column = "change it again", new_name = toupper(name))

Indexing

Unlike python, R uses one-based indexing. So the index of the first element is 1, not 0.

Indexing one dimensional objects: Vectors, lists

For one-dimensional objects, like vectors or lists, we just use brackets to extract the specific index:

numbers[1]

df$name[3]

This is also how we overwrite these values:

df$name[3] <- "Frederick"
df
colnames(df) # outputs a vector of the column names
colnames(df)[2] <- "age_in_years"
colnames(df)

Lists are sometimes named or can be more complex, since not every element of the list has to be the same structure or type. This is where double brackets may come into play:

complicated_list <- list("first" = c(1,2,3,4), "second" = df, "third" = 1087.29)
str(complicated_list)
complicated_list
complicated_list[[1]]
complicated_list[['first']]
complicated_list$first
complicated_list[['first']][3]

Indexes of two dimensional objects: Arrays, data frames

For two dimensional objects, you have to specify both the rows and columns that you’re interested in viewing, using [row, column]:

# This can be done with indices
my_matrix <- matrix(13:24, nrow = 3)
my_matrix
my_matrix[1,4]

# Or if the rows/columns are named
colnames(my_matrix) <- LETTERS[1:4]
row.names(my_matrix) <- paste0("row", 1:3)
my_matrix
my_matrix['row1','C']

# To get all of the rows or all of the columns you leave the field blank
# BUT you still need the comma
my_matrix['row3',]

# Without the comma...
my_matrix['A']
my_matrix['row2']

String Manipulation

The stringr package in R (one of the packages in tidyverse) simplifies these tasks with easy-to-use functions that can handle typical string operations.

Finding patterns

Finding specific sequences or motifs within biological sequences is a common task.

sequence <- "ATGCGTACGTTGACA"
motif <- "CGT"
str_locate(sequence, motif)

Replacing substrings

Modifying sequences by replacing specific nucleotides or amino acids.

dna_sequence <- "ATGCGTACGTTGACT"
rna_sequence <- str_replace_all(dna_sequence, "T", "U")
print(rna_sequence)

Substring extraction

Extracting parts of sequences, such as cutting out genes or regions of interest.

extracted_sequence <- str_sub(sequence, 3, 8)
print(extracted_sequence)

Length calculation

Determining the length of sequences.

sequence_length <- str_length(sequence)
print(sequence_length)

Case conversion

Converting uppercase to lowercase, or vice versa.

sequence_upper <- str_to_upper(sequence)
print(sequence_upper)

Splitting strings

Splitting sequences into arrays, useful for reading fasta files or analyzing codons.

codons <- str_sub(sequence, seq(1, str_length(sequence), by = 3), seq(3, str_length(sequence), by = 3))
print(codons)

Try this: What if our sequence length wasn’t a multiple of three?

Counting specific characters

Counting occurrences of specific nucleotides or amino acids.

guanine_count <- str_count(sequence, "G")
print(guanine_count)

Logical operations

Data analysis will require filtering data types using comparison operators:

  • == checks to see whether two values are equivalent. A common error is to only use one =, which is used to set variables in functions.
  • != checks if two values are not equivalent
  • >, <, >=, and <= check for greater/less than (or equal to) for comparing numbers

You can use these operators to subset vectors:

numbers<3
which(numbers<3)
numbers[numbers<3]
numbers[which(numbers<3)]

Conditional statements can be combined using logical operators:

  • & requires that both statements are true
  • | requires that one of the statements is true
  • ! requires that the opposite of the statement is true
numbers>1 & numbers<5
which(numbers>1 | numbers<5)
numbers[numbers>1 & numbers<5]
numbers[which(numbers>1 & numbers<5)]

When we want to use these operations to filter data frames, we can specify the rows from the data frame by filtering the vectors stored in the columns. Note that we need to include the comma to indicate that we want all of the rows of this subsetted data frame:

df[df$Age < 30, ]

Alternatively, we can also use the filter() command from the dplyr package:

filter(df, Age < 30)

Reading in your own data

Instead of typing out all of the data elements, you can import data from various file formats. Some file types are easily read in by base R. Note that there are packages that may be helpful that are specific to certain input file types.

Reading tab- or comma-delimited files

It is important to know the format of the file you are trying to read. The extension tsv indicates that the values are tab-separated.

metadata <- read.table("intro_r_metadata.tsv")

What is wrong with this data when you evaluate it?

head(metadata)

You’ll notice that the column names are in the first row of data, and all of the columns are labeled as “V” and the number corresponding to the column. It helps to be specific about how R should read in the data:

metadata <- read.table("intro_r_metadata.tsv", sep = "\t", head = TRUE)
head(metadata)
str(metadata)

Reading Excel Files

The readxl package can be used to read in Excel files:

# install.packages("readxl")
# library(readxl)
# excel_data <- read_excel("path/to/excel.xlsx", sheet = "Sheet1")

Exploring the data

You can get a glimpse of the data by only looking at the first few lines:

head(metadata)

You can quickly perform summary statistics on a dataset or a vector:

summary(metadata)
summary(metadata$subject.age)

Try this:

  • Filter data to the samples that are cutaneous, and save this as a new data frame.
  • What is the average age of the patients in this filtered dataset?

Factors

Currently, all of the character strings have been read in as characters. This is helpful, but oftentimes our data is more structured than that. For example, the bor column in metadata encodes clinical outcomes, and our brains think of these as an ordinal variable, not just a categorical one. That is, we think of the outcomes in order from best to worst: CR > PR > SD > PD (Complete response is better than partial response, which is better than stable disease, then progressive disease).

A factor encodes this information using levels. When we read in data, strings are read in as character vectors, but we can coerce them to become factors for our data summary:

summary(metadata$bor)
unique(metadata$bor)

metadata$bor <- factor(metadata$bor)
summary(metadata$bor)

metadata$bor <- factor(metadata$bor, levels = c("CR","PR","SD","PD"))
summary(metadata$bor)

Try this:

  • Filter metadata to the samples from patients that had complete response to therapy, and save this as a new data frame.
  • How many of these tumors came from cutaneous tumors?

Sorting data

Like numbers, factors have a logical order to them. So when we enforce the data to have this particular order, we can also arrange the data in that order.

numbers
order(numbers)
order(-numbers)

arrange(metadata, bor)
arrange(metadata, age)
arrange(metadata, -age)


arrange(metadata, -bor) # GIVES A WARNING
arrange(metadata, desc(bor)) # Better for factors

Complex data manipulation

Long and Wide Data Formats

Long and wide data formats are two common ways of structuring data, each with its own advantages and use cases.

Long Format

In the long format, also known as the “tidy” format, each observation is represented by a single row in the dataset. This format is characterized by having:

  • Multiple rows, each corresponding to a single observation or measurement.
  • One column for the variable being measured.
  • Additional columns to store metadata or grouping variables.

Advantages:

  • Facilitates easy analysis and manipulation, especially when using tools like Tidyverse packages in R.
  • Suitable for data that follow the “one observation per row” principle, such as time series or longitudinal data.

Wide Format

In the wide format, each observation is represented by a single row, but with multiple columns corresponding to different variables. This format is characterized by:

  • One row per observation.
  • Each variable is represented by a separate column.

Advantages:

  • Can be easier to understand for simple datasets with few variables.
  • May be more convenient for certain types of analysis or visualization.

Choosing Between Long and Wide Formats

The choice between long and wide formats depends on factors such as the nature of the data, the analysis tasks, and personal preference. Long format is often preferred for its flexibility and compatibility with modern data analysis tools, while wide format may be suitable for simpler datasets or specific analysis requirements.

Long to Wide

library(tidyr)

# Example long format data
long_data <- data.frame(
  Subject = c("A", "A", "B", "B"),
  Time = c(1, 2, 1, 2),
  Measurement = c(10, 15, 12, 18)
)

# Convert long format data to wide format
wide_data <- spread(long_data, key = Time, value = Measurement)

# View the wide format data
print(wide_data)

Wide to Long

library(tidyr)

# Example wide format data
wide_data <- data.frame(
  Subject = c("A", "B"),
  Time1 = c(10, 12),
  Time2 = c(15, 18)
)

# Convert wide format data to long format
long_data <- gather(wide_data, key = Time, value = Measurement, -Subject)

# View the long format data
print(long_data)

Example: Gene expression data

Let’s work with a real dataset: intro_r_dataset.tsv.zip. This file is compressed, so we can’t read it in by the default read.table function. This is a circumstance where I use the data.table package and the fread function, which automatically recognizes the compression format to read in the file:

read.table("intro_r_dataset.tsv.zip") # DOESN'T WORK

# install.packages('data.table')
data <- fread("intro_r_dataset.tsv.zip")

dim(data)
colnames(data)
str(data)

Try this: Create a long data frame, where the key is called “sample.id” and the value column contains the gene expression (“cpm”). Call this new dataset “data_long”.

Merging Data

Merging allows combining data from different sources. This is common in analyzing biological data. Joins and merging are common operations used to combine multiple datasets based on common variables or keys. In Tidyverse, these operations are typically performed using functions from the dplyr package.

Types of Joins:

Inner Join (inner_join()):

An inner join combines rows from two datasets where there is a match based on a common key, retaining only the rows with matching keys from both datasets.

Left Join (left_join()):

A left join combines all rows from the first (left) dataset with matching rows from the second (right) dataset based on a common key. If there is no match in the second dataset, missing values are filled in.

Right Join (right_join()):

Similar to a left join, but it retains all rows from the second (right) dataset and fills in missing values for non-matching rows from the first (left) dataset.

Full Join (full_join()):

A full join combines all rows from both datasets, filling in missing values where there are no matches.

Semi-Join (semi_join()):

A semi-join returns only rows from the first dataset where there are matching rows in the second dataset, based on a common key.

Anti-Join (anti_join()):

An anti-join returns only rows from the first dataset that do not have matching rows in the second dataset, based on a common key.

Merge (merge()):

The merge() function is a base R function used to merge datasets based on common columns or keys. It performs similar operations to joins in dplyr, but with slightly different syntax and behavior.

Example:
library(dplyr)

# Example datasets
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Score = c(85, 90, 95))

# Inner join
inner_merged <- inner_join(df1, df2, by = "ID")

# Left join
left_merged <- left_join(df1, df2, by = "ID")

# Right join
right_merged <- right_join(df1, df2, by = "ID")

# Full join
full_merged <- full_join(df1, df2, by = "ID")

# Semi-join
semi_merged <- semi_join(df1, df2, by = "ID")

# Anti-join
anti_merged <- anti_join(df1, df2, by = "ID")

Example: Merging our metadata with our gene expression data

Now let’s merge our data_long with metadata so that our gene expression data also contains our sample annotation.

left_join(data_long, metadata)


Chaining commands, groupby(), and summarise()

So far, we’ve used individual commands to accomplish several tasks, but sometimes we want to do multiple things in one line of code. The %>% is called a pipe and is used to chain commands together.

df
df %>% 
  mutate(age_in_days = age_in_years*365) %>% 
  filter(age < 500)
  
metadata %>%
  filter(sample.tumor.type == "cutaneous") %>%
  arrange(bor)

groupby() allows us to apply individual functions to grouped objects, and we can use summarise() to perform functions within those individual groups:

metadata %>%
  group_by(sample.tumor.type, bor) %>% 
  summarise(total = n(), mean_age = mean(subject.age, na.rm = TRUE)) # n() performs a count
  
metadata %>%
  group_by(sample.tumor.type, bor) %>% count() # This is another way to quickly count groups

Advanced exercise: Create a wide data frame that summarizes the total number of responders (defined by bor) per tumor type (in each row).


Repeating tasks/functions

The groupby() functionality in tidyverse allows you to perform many metrics across individual groups, so you don’t have to create a filtered dataset over and over again. Base R also has a series of functions to make it easier to repeat the same function over and over again.

apply

The apply() function in R is a powerful tool for applying a function to the rows or columns of a matrix or data frame. It is particularly useful for performing operations across a dataset without needing to write explicit loops. The syntax for apply() is:

apply(X, margin, function, ...)

# X: This is the array or matrix on which you want to apply the function.
# margin: A value that specifies whether to apply the function over rows (1), columns (2), or both (c(1, 2)).
# function: The function you want to apply to each row or column.

To calculate the sum of each row in a matrix:

# Create a matrix
my_matrix <- matrix(1:9, nrow=3)

# Apply sum function across rows
row_sums <- apply(my_matrix, 1, sum)
print(row_sums)

To find the mean of each column in a data frame:

# Create a data frame
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))

# Apply mean function across columns
column_means <- apply(df, 2, mean)
print(column_means)

sapply and lappy

  • lapply() returns a list, regardless of the output of each application of the function.
  • sapply() attempts to simplify the result into a vector or matrix if possible. If simplification is not possible, it returns a list similar to lapply().

Suppose you have a list of numerical vectors and you want to compute the sum of each vector. Here’s how you could use lapply():

# Define a list of vectors
num_list <- list(c(1, 2, 3), c(4, 5), c(6, 7, 8, 9))

# Use lapply to apply the sum function
list_sums <- lapply(num_list, sum)
print(list_sums)

Using the same list of numerical vectors, if you use sapply() to compute the sum, the function will try to simplify the output into a vector:

# Use sapply to apply the sum function
vector_sums <- sapply(num_list, sum)
print(vector_sums)

When to Use Each

  • lapply(): When you need the robustness of a list output, especially when dealing with heterogeneous data or when the function can return variable lengths or types.
  • sapply(): When you are working with homogeneous data and prefer a simplified output such as a vector or matrix, assuming the lengths and types are consistent across elements.

Advanced: Using enframe() and unnest() to create a data frame

Another option, if you enjoy using tidyverse and groupby() is to convert a list of objects into a data structure amenable to this:

data1 <- data.frame("subject" = paste0("subject", sample(1:10, 5, replace = FALSE)),
  "age" = sample(1:100, 5))
data2 <- data.frame("subject" = paste0("subject", sample(11:20, 5, replace = FALSE)),
  "age" = sample(1:100, 5))
data3 <- data.frame("subject" = paste0("subject", sample(21:30, 5, replace = FALSE)),
  "age" = sample(1:100, 5))

list_of_data <- list("data1" = data1, "data2" = data2, "data3" = data3)

enframe(list_of_data)

enframe(list_of_data) %>% unnest(value)

enframe() will convert a list of objects into a tibble (a type of data frame), and each list element is encoded in the value column.

unnest(value) will unnest the data frames from the value column into individual rows.

This is very helpful if you ever have many files that are the same type, and you’re trying to create one master data frame.

# my_files <- list.files(path = "/path/to/data", pattern = "expression.tsv")
# merged_data <- lapply(my_files, fread) %>% enframe() %>% unnest(value)

Additional exercises

Try these exercises, using the metadata and data_long objects that you read into R:

  • What sample had the highest expression of B2M?
  • Which cohort had the highest average expression of B2M?
  • Which gene had the highest average expression across all patients?
  • Which response group (bor) had the highest average expression of NLRC5?
  • What is the median expression of HLA-A within each tumor type/response group?

Save R objects for future use

Throughout the course, we will complete one stage of analysis and save objects from the environment to individual files for future use. If you want to export all objects in the environment, you can use save.image("path/to/file.rds"). Alternatively, individual files are stored and then loaded later using the following commands:

saveRDS(data, file = "testdata.rds")

# Next time you open R, you can reload the object with:
load("testdata.rds")

Closing R

When you want to quit R in your terminal, you can type in the following commmand:

q()

The console will ask you:

Save workspace image? [y/n/c]: 

Answering “yes” and saving your workspace image will create a hidden file called .Rdata in your current working directory. This will store all of the data objects in your global environment to that .Rdata file and can be automatically loaded for future use. If you open R again in the future, this will automatically reload all of the stored data objects, recreating the environment. This is helpful for continuing analysis from where you previously left off.

Answering “no” will not save your current environment, and you will need to rerun all of the code that you previously ran. Oftentimes, all of your code will be stored in an R script and will reproduce your prior analysis.


Additional Resources

Unix | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Unix

Adapted by : Jason Walker, McDonnell Genome Institute
Additional adaptation by : Alex Wagner, McDonnell Genome Institute
Original author : Keith Bradnam, UC Davis Genome Center
Version 1.04 — 2016-11-11


Introduction

This ‘bootcamp’ is intended to provide the reader with a basic overview of essential Unix/Linux commands that will allow them to navigate a file system and move, copy, and edit files. It will also introduce a brief overview of some ‘power’ commands in Unix. It was originally developed as part of a Bioinformatics Core Workshop taught at UC Davis (Using the Linux Command-Line for Analysis of High Throughput Sequence Data).

Why Unix?

The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it is much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux and OSX), though the differences do not matter much for most basic functions.

Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suited to working with such files and has several powerful (and flexible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So, if you can learn just five Unix commands, you will be able to do a lot more than just five things.

Typeset Conventions

Command-line examples that you are meant to type into a terminal window will be shown indented in a constant-width font, e.g.

    ls -lrh

Sometimes the accompanying text will include a reference to a Unix command. Any such text will also be in a constant-width, boxed font. E.g.

Type the pwd command again.

From time to time this documentation will contain web links to pages that will help you find out more about certain Unix commands. Usually, the first mention of a command or function will be a hyperlink to Wikipedia. Important or critical points will be styled like so:

This is an important point!


Assumptions

The lessons from this point onwards will assume very little apart from the following:

  1. You have access to a Unix/Linux system
  2. You know how to launch a terminal program on that system
  3. You have a home directory where you can create/edit new files In the following documentation, we will also assume that the logged in user has a username ‘ubuntu’ and the home directory is located at /home/ubuntu.

1. The Terminal

A terminal is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view files etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program available.

Open the terminal application. You should now see something that looks like the following: unixterm

There will be many situations where it will be useful to have multiple terminals open and it will be a matter of preference as to whether you want to have multiple windows, or one window with multiple tabs (there are typically keyboard shortcuts for switching between windows, or moving between tabs).


2. Your First Unix Commands

It’s important to note that you will always be inside a single directory when using the terminal. The default behavior is that when you open a new terminal you start in your own home directory (containing files and directories that only you can modify). To see what files and directories are in our home directory, we need to use the ls command. This command lists the contents of a directory. If we run the ls command we should see something like (though it will reflect the files on your system):

    ubuntu@:~$ ls ~
    R  bin  tools  workspace

    ubuntu@:~$ cd ~

    ubuntu@:~$ mkdir workspace

    ubuntu@:~$ ls ~

There are four things that you should note here:

  1. You will probably see different output to what is shown here, it depends on your computer setup. Don’t worry about that for now.
  2. The ubuntu@:~$ text that you see is the Unix command prompt. In this case, it contains a user name (‘ubuntu’) and the name of the current directory (‘~’, more on that later). Note that the command prompt might not look the same on different Unix systems. In this case, the $ sign marks the end of the prompt.
  3. The output of the ls command lists two things. In this case, they are both directories, but they could also be files. We’ll learn how to tell them apart later on. These directories were created as part of a specific course that used this bootcamp material. You will therefore probably see something very different on your own computer.
  4. After the ls command finishes it produces a new command prompt, ready for you to type your next command.

The ls command is used to list the contents of any directory, not necessarily the one that you are currently in. Try the following:

    ubuntu@:~$ ls ~/workspace/

    ubuntu@:~$ ls /etc/


3. The Unix Tree

Looking at directories from within a Unix terminal can often seem confusing. But bear in mind that these directories are exactly the same type of folders that you can see if you use any graphical file browser. From the root level (/) there are usually a dozen or so directories. You can treat the root directory like any other, e.g. you can list its contents:

    ubuntu@:~$ ls /
    bin   dev  home        lib    lost+found  mnt  proc  run   snap  sys  usr  vmlinuz
    boot  etc  initrd.img  lib64  media       opt  root  sbin  srv   tmp  var  workspace

You might notice some of these names appearing in different colors. Many Unix systems will display files and directories differently by default. Other colors may be used for special types of files. When you log in to a computer you are working with your files in your home directory, and this is often inside a directory called ‘users’ or ‘home’.


4. Finding Out Where You Are

There may be many hundreds of directories on any Unix machine, so how do you know which one you are in? The command pwd will Print the Working Directory and that’s pretty much all this command does:

    ubuntu@:~$ pwd
    /home/ubuntu

When you log in to a Unix computer, you are typically placed into your home directory. In this example, after we log in, we are placed in a directory called ‘ubuntu’ which itself is a subdirectory of another directory called ‘home’. Conversely, ‘users’ is the parent directory of ‘clmuser’. The first forward slash that appears in a list of directory names always refers to the top level directory of the file system (known as the root directory). The remaining forward slash (between ‘home’ and ‘ubuntu’) delimits the various parts of the directory hierarchy. If you ever get ‘lost’ in Unix, remember the pwd command.

As you learn Unix you will frequently type commands that don’t seem to work. Most of the time this will be because you are in the wrong directory, so it’s a really good habit to get used to running the pwd command a lot.


5. Making New Directories

If we want to make a new directory (e.g. to store some work related data), we can use the mkdir command:

    ubuntu@:~$ mkdir workspace/Learning_unix
    ubuntu@:~$ ls workspace
    data  Learning_unix  lib  lost+found

6. Changing Directories and Command Options

We are in the home directory on the computer but we want to to work in the new Learning_unix directory. To change directories in Unix, we use the cd command:

    cd workspace/Learning_unix
    ubuntu@:~/workspace/Learning_unix$ 

Notice that — on this system — the command prompt has expanded to include our current directory. This doesn’t happen by default on all Unix systems, but you should know that you can configure what information appears as part of the command prompt.

Let’s make two new subdirectories and navigate into them:

    ubuntu@:~/workspace/Learning_unix$ mkdir Outer_directory
    ubuntu@:~/workspace/Learning_unix$ cd Outer_directory
    ubuntu@:~/workspace/Learning_unix/Outer_directory$ 

    ubuntu@:~/workspace/Learning_unix/Outer_directory$ mkdir Inner_directory
    ubuntu@:~/workspace/Learning_unix/Outer_directory$ cd Inner_directory/
    ubuntu@:~/workspace/Learning_unix/Outer_directory/Inner_directory$ 

Now our command prompt is getting quite long, but it reveals that we are four levels beneath the home directory. We created the two directories in separate steps, but it is possible to use the mkdir command in way to do this all in one step.

Like most Unix commands, mkdir supports command-line options which let you alter its behavior and functionality. Command-line options are — as the name suggests — optional arguments that are placed after the command name. They often take the form of single letters (following a dash). If we had used the -p option of the mkdir command we could have done this in one step. E.g.

    mkdir -p Outer_directory/Inner_directory

Note the spaces on either side of the -p !

Sometimes options are entire words, usually preceded by a double-dash. For example, using the --parents option:

    mkdir --parents Outer_directory/Inner_directory

is identical to using the -p option.


7. Getting Help

Many programs will provide information about the command being called by passing a -h or --help option.

    ubuntu@:~$mkdir --help
    Usage: mkdir [OPTION]... DIRECTORY...
    Create the DIRECTORY(ies), if they do not already exist.

    Mandatory arguments to long options are mandatory for short options too.
      -m, --mode=MODE   set file mode (as in chmod), not a=rwx - umask
      -p, --parents     no error if existing, make parent directories as needed
      -v, --verbose     print a message for each created directory
      -Z                   set SELinux security context of each created directory
                         to the default type
                         --context[=CTX]  like -Z, or if CTX is specified then set the SELinux
                         or SMACK security context to CTX
                         --help     display this help and exit
                         --version  output version information and exit

    GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
    Full documentation at: <http://www.gnu.org/software/coreutils/mkdir>
    or available locally via: info '(coreutils) mkdir invocation'

In addition, many commands will have man (manual) pages, which are often more detailed descriptions of the program and its usage.

    man ls
    man cd
    man man # yes even the man command has a manual page

When you are using the man command, press space to scroll down a page, b to go back a page, or q to quit. You can also use the up and down arrows to scroll a line at a time. The man command is actually using another Unix program, a text viewer called less, which we’ll come to later on.


8. The Root Directory

Let’s change directory to the root directory, and then, into our home directory

    ubuntu@:~/workspace/Learning_unix/Outer_directory/Inner_directory$ cd /
    ubuntu@:/$ cd home
    ubuntu@:/home$ cd ubuntu
    ubuntu@:~$ 

In this case, we may as well have just changed directory in one go:

    cd /home/ubuntu

The leading / is incredibly important. The following two commands are very different:

    cd /home/ubuntu/
    cd home/ubuntu/

The first command says go the unbuntu directory that is beneath the home directory that is at the top level (the root) of the file system. There can only be one /home/ubuntu directory on any Unix system.

The second command says go to the unbuntu directory that is beneath the home directory that is located wherever I am right now. There can potentially be many home/ubuntu directories on a Unix system (though this is unlikely).

Learn and understand the difference between these two commands.


9. Climbing The Tree

Frequently, you will find that you want to go ‘upwards’ one level in the directory tree. Two dots .. are used in Unix to refer to the parent directory of wherever you are. Every directory has a parent except the root level of the computer. Let’s go into the Learning_unix directory and then navigate up three levels:

    ubuntu@:~$ cd workspace/Learning_unix/
    ubuntu@:~/workspace/Learning_unix$ cd ..
    ubuntu@:~/workspace$ cd ..
    ubuntu@:~$ cd ..
    ubuntu@:/home$ 

What if you wanted to navigate up two levels in the file system in one go? It’s very simple, just use two sets of the .. operator, separated by a forward slash:

    cd ../..

10. Absolute and Relative Paths

Using cd .. allows us to change directory relative to where we are now. You can also always change to a directory based on its absolute location. E.g. if you are working in the /home/ubuntu/workspace/Learning_unix directory and you then want to change to the /tmp directory, then you could do either of the following:

    $ cd ../../../../tmp

or…

    $ cd /tmp

They both achieve the same thing, but the 2nd example requires that you know about the full path from the root level of the computer to your directory of interest (the ‘path’ is an important concept in Unix). Sometimes it is quicker to change directories using the relative path, and other times it will be quicker to use the absolute path.


11. Finding Your Way Back home

Remember that the command prompt shows you the name of the directory that you are currently in, and that when you are in your home directory it shows you a tilde character (~) instead? This is because Unix uses the tilde character as a short-hand way of specifying a home directory.

See what happens when you try the following commands (use the pwd command after each one to confirm the results if necessary):

    cd /
    cd ~
    cd
    cd -

Hopefully, you should find that cd and cd ~ do the same thing, i.e. they take you back to your home directory (from wherever you were). You will frequently want to jump straight back to your home directory, and typing cd is a very quick way to get there.

You can also use the ~ as a quick way of navigating into subdirectories of your home directory when your current directory is somewhere else. I.e. the quickest way of navigating from the root directory to your Learning_unix directory is as follows:

    ubuntu@:~$ cd /
    ubuntu@:/$ cd ~/workspace/Learning_unix

12. Making The ls Command More Useful

The .. operator that we saw earlier can also be used with the ls command, e.g. you can list directories that are ‘above’ you:

    ubuntu@:~/workspace/Learning_unix$ cd ~/Learning_unix/Outer_directory/
    ubuntu@:~/workspace/Learning_unix/Outer_directory$ ls ../../
    command_line_course  Learning_unix  linux_bootcamp

Time to learn another useful command-line option. If you add the letter ‘l’ to the ls command it will give you a longer output compared to the default:

    ubuntu@:~/workspace/Learning_unix$ ls -l /home
    total 4
    drwxr-xr-x 12 ubuntu ubuntu 4096 Nov 12 01:45 ubuntu

For each file or directory we now see more information (including file ownership and modification times). The ‘d’ at the start of each line indicates that these are directories. There are many, many different options for the ls command. Try out the following (against any directory of your choice) to see how the output changes.

    ls -l
    ls -R
    ls -l -t -r
    ls -lh

Note that the last example combine multiple options but only use one dash. This is a very common way of specifying multiple command-line options.


13. Removing Directories

We now have a few (empty) directories that we should remove. To do this use the rmdir command, this will only remove empty directories so it is quite safe to use. If you want to know more about this command (or any Unix command), then remember that you can just look at its man page.

    ubuntu@:~$ cd ~/workspace/Learning_unix/Outer_directory/
    ubuntu@:~/workspace/Learning_unix/Outer_directory$ rmdir Inner_directory/
    ubuntu@:~/workspace/Learning_unix/Outer_directory$ cd ..
    ubuntu@:~/workspace/Learning_unix$ rmdir Outer_directory/
    ubuntu@:~/workspace/Learning_unix$ ls
    ubuntu@:~/workspace/Learning_unix$ 

EXERCISE: Recreate the directories you just removed with a single command. Use the --help option if you need to! Now remove them again, but this time in a single command.

Note, you have to be outside a directory before you can remove it with rmdir ***


14. Using Tab Completion

Saving keystrokes may not seem important, but the longer that you spend typing in a terminal window, the happier you will be if you can reduce the time you spend at the keyboard. Especially, as prolonged typing is not good for your body. So the best Unix tip to learn early on is that you can tab complete the names of files and programs on most Unix systems. Type enough letters that uniquely identify the name of a file, directory or program and press tab…Unix will do the rest. E.g. if you type ‘tou’ and then press tab, Unix should autocomplete the word to ‘touch’ (this is a command which we will learn more about in a minute). In this case, tab completion will occur because there are no other Unix commands that start with ‘tou’. If pressing tab doesn’t do anything, then you have not have typed enough unique characters. In this case pressing tab twice will show you all possible completions. This trick can save you a LOT of typing!

Navigate to your home directory, and then use the cd command to change to the Learning_unix directory. Use tab completion to complete directory name. If there are no other directories starting with ‘L’ in your home directory, then you should only need to type ‘cd’ + ‘L’ + ‘tab’.

Tab completion will make your life easier and make you more productive!

Another great time-saver is that Unix stores a list of all the commands that you have typed in each login session. You can access this list by using the history command or more simply by using the up and down arrows to access anything from your history. So if you type a long command but make a mistake, press the up arrow and then you can use the left and right arrows to move the cursor in order to make a change.


15. Creating Empty Files With The Touch Command

The following sections will deal with Unix commands that help us to work with files, i.e. copy files to/from places, move files, rename files, remove files, and most importantly, look at files. First, we need to have some files to play with. The Unix command touch will let us create a new, empty file. The touch command does other things too, but for now we just want a couple of files to work with.

    ubuntu@:~$ cd workspace/Learning_unix/
    ubuntu@:~/workspace/Learning_unix$ touch red_fish.txt
    ubuntu@:~/workspace/Learning_unix$ touch blue_fish.txt
    ubuntu@:~/workspace/Learning_unix$ ls
    red_fish.txt  blue_fish.txt

touch also accepts multiple files as arguments.

    ubuntu@:~/workspace/Learning_unix$ touch one_fish.txt two_fish.txt
    ubuntu@:~/workspace/Learning_unix$ ls
    blue_fish.txt  one_fish.txt  red_fish.txt  two_fish.txt

16. Moving files

Now, let’s assume that we want to move these files to a new directory (‘colors’). We will do this using the Unix mv (move) command. Remember to use tab completion:

    ubuntu@:~/workspace/Learning_unix$ mkdir colors
    ubuntu@:~/workspace/Learning_unix$ mv red_fish.txt colors/
    ubuntu@:~/workspace/Learning_unix$ mv blue_fish.txt colors/
    ubuntu@:~/workspace/Learning_unix$ ls
    colors  one_fish.txt  two_fish.txt
    ubuntu@:~/workspace/Learning_unix$ ls colors/
    blue_fish.txt  red_fish.txt

EXERCISE: Make a new directory called ‘counts’, and move one_fish.txt and two_fish.txt into it. Can you move the two files with a single mv command?

For the mv command, we always have to specify a source file (or directory) that we want to move, and then specify a target location. If we had wanted to we could have moved both files in one go by typing any of the following commands:

    mv *.txt Temp/
    mv *t Temp/
    mv *fi* Temp/

The asterisk * acts as a wild-card character. Likewise, the third example works because only those two files contain the letters ‘fi’ in their names. Using wild-card characters can save you a lot of typing.), essentially meaning ‘match anything’. The second example works because the files ‘red_fish.txt’ and ‘blue_fish.txt’ end with the letter ‘t’. However, note that the command also places ‘one_fish.txt’ and ‘two_fish.txt’ in the [Temp/] directory, because they both also end in ‘t’. Likewise, the third example works because those two files contain the letters ‘fi’ in their names, but ‘fi’ also exists in ‘one_fish.txt’ and ‘two_fish.txt’. Using wild-card characters can save you a lot of typing, but it also requires caution. Make sure you know which files the wild-card will catch if you’re using it to delete files permanently!

The ‘?’ character is also a wild-card but for only a single character.


17. Renaming Files

In the earlier example, the destination for the mv command was a directory name (colors). So we moved a file from its source location to a target location, but note that the target could have also been a (different) file name, rather than a directory. E.g. let’s make a new file and move it whilst renaming it at the same time:

    ubuntu@:~/workspace/Learning_unix$ touch rags
    ubuntu@:~/workspace/Learning_unix$ ls
    colors  counts  rags
    ubuntu@:~/workspace/Learning_unix$ mv rags counts/riches
    ubuntu@:~/workspace/Learning_unix$ ls counts/
    one_fish.txt  two_fish.txt  riches

In this example we create a new file (‘rags’) and move it to a new location and in the process change the name (to ‘riches’). So mv can rename a file as well as move it. The logical extension of this is using mv to rename a file without moving it (you have to use mv to do this as Unix does not have a separate ‘rename’ command):

    ubuntu@:~/workspace/Learning_unix$ mv counts/riches counts/rags

18. Moving Directories

It is important to understand that as long as you have specified a ‘source’ and a ‘target’ location when you are moving a file, then it doesn’t matter what your current directory is. You can move or copy things within the same directory or between different directories regardless of whether you are in any of those directories. Moving directories is just like moving files:

    ubuntu@:~/workspace/Learning_unix$ mkdir fish
    ubuntu@:~/workspace/Learning_unix$ mv counts fish
    ubuntu@:~/workspace/Learning_unix$ ls -R .
    .:
    colors  fish

    ./colors:
    blue_fish.txt  red_fish.txt

    ./fish:
    counts

    ./fish/counts:
    one_fish.txt  rags  two_fish.txt

This step moves the counts directory inside the fish directory.

EXERCISE: Try creating a ‘net’ directory inside ‘Learning_unix’ and then cd to your home directory. Can you move fish inside net without using cd?


19. Removing Files

You’ve seen how to remove a directory with the rmdir command, but rmdir won’t remove directories if they contain any files. So how can we remove the files we have created (inside Learning_Unix/Temp)? In order to do this, we will have to use the rm (remove) command.

Please read the next section VERY carefully. Misuse of the rm command can lead to needless death & destruction

Potentially, rm is a very dangerous command; if you delete something with rm, you will not get it back! It is possible to delete everything in your home directory (all directories and subdirectories) with rm, that is why it is such a dangerous command.

Let me repeat that last part again. It is possible to delete EVERY file you have ever created with the rm command. Are you scared yet? You should be. Luckily there is a way of making rm a little bit safer. We can use it with the -i command-line option which will ask for confirmation before deleting anything (remember to use tab-completion):

Potentially, rm is a very dangerous command; if you delete something with rm, you will not get it back! It is possible to delete everything in your home directory (all directories and subdirectories) with rm, that is why it is such a dangerous command.

Let me repeat that last part again. It is possible to delete EVERY file you have ever created with the rm command. Are you scared yet? You should be. Luckily there is a way of making rm a little bit safer. We can use it with the -i command-line option which will ask for confirmation before deleting anything (remember to use tab-completion):

    ubuntu@:~/workspace/Learning_unix$ cd net/fish/counts
    ubuntu@:~/workspace/Learning_unix/net/fish/counts$ ls
    one_fish.txt  rags  two_fish.txt
    ubuntu@:~/workspace/Learning_unix/net/fish/counts$ rm -i one_fish.txt  rags  two_fish.txt
    rm: remove regular empty file 'one_fish.txt'? y
    rm: remove regular empty file 'rags'? y
    rm: remove regular empty file 'two_fish.txt'? y
    ubuntu@:~/workspace/Learning_unix/net/fish/counts$ ls

We could have simplified this step by using a wild-card (e.g. rm -i *.txt) or we could have made things more complex by removing each file with a separate rm command. Let’s finish cleaning up:

    ubuntu@:~/workspace/Learning_unix/net/fish/counts$ cd ~/workspace/Learning_unix/
    ubuntu@:~/workspace/Learning_unix$ rmdir -p net/fish/counts/
    ubuntu@:~/workspace/Learning_unix$ rm -ir colors/
    rm: descend into directory 'colors/'? y
    rm: remove regular empty file 'colors/red_fish.txt'? y
    rm: remove regular empty file 'colors/blue_fish.txt'? y
    rm: remove directory 'colors/'? y

20. Copying Files

Copying files with the cp (copy) command is very similar to moving them. Remember to always specify a source and a target location. Let’s create a new file and make a copy of it:

    ubuntu@:~/workspace/Learning_unix$ touch file1
    ubuntu@:~/workspace/Learning_unix$ cp file1 file2
    ubuntu@:~/workspace/Learning_unix$ ls
    file1  file2

What if we wanted to copy files from a different directory to our current directory? Let’s put a file in our home directory (specified by ~ remember?) and copy it to the current directory (Learning_unix):

    ubuntu@:~/workspace/Learning_unix$ touch ~/file3
    ubuntu@:~/workspace/Learning_unix$ ls ~
    file3  tools  workspace
    ubuntu@:~/workspace/Learning_unix$ cp ~/file3 .
    ubuntu@:~/workspace/Learning_unix$ ls
    file1  file2  file3

This last step introduces another new concept. In Unix, the current directory can be represented by a . (dot) character. You will mostly use this only for copying files to the current directory that you are in. Compare the following:

    ls
    ls .
    ls ./

In this case, using the dot is somewhat pointless because ls will already list the contents of the current directory by default. Also note how the trailing slash is optional. You can use rm to remove the temporary files.


21. Copying Directories

The cp command also allows us (with the use of a command-line option) to copy entire directories. Use cp --help to see how the -R or -r options let you copy a directory recursively.

EXERCISE: Create a new directory called ‘filing_cabinet’ and move the files into it with mv. Make a second copy of this directory called ‘trash_can’. Were all of the files from ‘filing_cabinet’ also in ‘trash_can’? Remove both directories and their contents with a single command.


22. Viewing Files With Less

So far we have covered listing the contents of directories and moving/copying/deleting either files and/or directories. Now we will quickly cover how you can look at files. The less command lets you view (but not edit) text files. We will use the echo command to put some text in a file and then view it:

    ubuntu@:~/workspace/Learning_unix$ echo "Call me Ishmael."
    Call me Ishmael.
    ubuntu@:~/workspace/Learning_unix$ echo "Call me Ishmael." > opening_lines.txt
    ubuntu@:~/workspace/Learning_unix$ ls
    opening_lines.txt
    ubuntu@:~/workspace/Learning_unix$ less opening_lines.txt

On its own, echo isn’t a very exciting Unix command. It just echoes text back to the screen. But we can redirect that text into an output file by using the > symbol. This allows for something called file redirection.

Careful when using file redirection (>), it will overwrite any existing file of the same name

When you are using less, you can bring up a page of help commands by pressing h, scroll forward a page by pressing space, or go forward or backwards one line at a time by pressing j or k. To exit less, press q (for quit). The less program also does about a million other useful things (including text searching).


23. Viewing Files With Cat

Let’s add another line to the file:

    ubuntu@:~/workspace/Learning_unix$ echo "The primroses were over." >> opening_lines.txt
    ubuntu@:~/workspace/Learning_unix$ cat opening_lines.txt
    Call me Ishmael.
    The primroses were over.

Notice that we use >> and not just >. This operator will append to a file. If we only used >, we would end up overwriting the file. The cat command displays the contents of the file (or files) and then returns you to the command line. Unlike less you have no control on how you view that text (or what you do with it). It is a very simple, but sometimes useful, command. You can use cat to quickly combine multiple files or, if you wanted to, make a copy of an existing file:

    cat opening_lines.txt > file_copy.txt

24. Counting Characters In A File

    ubuntu@:~/workspace/Learning_unix$ ls
    opening_lines.txt

    ubuntu@:~/workspace/Learning_unix$ ls -l
    total 4
    -rw-rw-r-- 1 ubuntu ubuntu 42 Jun 15 04:13 opening_lines.txt

    ubuntu@:~/workspace/Learning_unix$ wc opening_lines.txt
    2  7 42 opening_lines.txt

    ubuntu@:~/workspace/Learning_unix$ wc -l opening_lines.txt
    2 opening_lines.txt

The ls -l option shows us a long listing, which includes the size of the file in bytes (in this case ‘42’). Another way of finding this out is by using Unix’s wc command (word count). By default this tells you many lines, words, and characters are in a specified file (or files), but you can use command-line options to give you just one of those statistics (in this case we count lines with wc -l).


25. Editing Small Text Files With Nano

Nano is a lightweight editor installed on most Unix systems. There are many more powerful editors (such as ‘emacs’ and ‘vi’), but these have steep learning curves. Nano is very simple. You can edit (or create) files by typing:

    nano opening_lines.txt

You should see the following appear in your terminal: unixterm2

The bottom of the nano window shows you a list of simple commands which are all accessible by typing ‘Control’ plus a letter. E.g. Control + X exits the program.


26. The $PATH Environment Variable

One other use of the echo command is for displaying the contents of something known as environment variables. These contain user-specific or system-wide values that either reflect simple pieces of information (your username), or lists of useful locations on the file system. Some examples:

    ubuntu@:~/workspace/Learning_unix$ echo $USER
    ubuntu
    ubuntu@:~/workspace/Learning_unix$ echo $HOME
    /home/ubuntu
    ubuntu@:~/workspace/Learning_unix$ echo $PATH
    /home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/tools/perl5/bin:/home/ubuntu/tools/bin:/home/ubuntu/workspace/data/anaconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ubuntu/tools/bowtie-1.1.2:/home/ubuntu/tools/bowtie2-2.2.9:/home/ubuntu/tools/trinityrnaseq-2.2.0:/home/ubuntu/tools/hisat2-2.0.4:/home/ubuntu/tools/sambamba_v0.6.4:/home/ubuntu/tools/stringtie-1.3.0.Linux_x86_64:/home/ubuntu/tools/gffcompare-0.9.8.Linux_x86_64:/home/ubuntu/tools/RSEM-1.2.31:/home/ubuntu/tools/cufflinks-2.2.1.Linux_x86_64:/home/ubuntu/tools/bedtools2/bin:/home/ubuntu/tools/MUMmer3.23:/home/ubuntu/tools/allpathslg-52488/bin:/home/ubuntu/tools/bin/Sniffles/bin/sniffles-core-1.0.0:/home/ubuntu/tools/ensembl-tools-release-86/scripts/variant_effect_predictor:/home/ubuntu/tools/VAAST_2.2.0/bin:/home/ubuntu/tools/speedseq/bin:/home/ubuntu/tools/hall_misc

The last one shows the content of the $PATH environment variable, which displays a — colon separated — list of directories that are expected to contain programs that you can run. This includes all of the Unix commands that you have seen so far. These are files that live in directories which are run like programs (e.g. ls is just a special type of file in the /bin directory).

Knowing how to change your $PATH to include custom directories can be necessary sometimes (e.g. if you install some new bioinformatics software in a non-standard location).


27. Matching Lines In Files With Grep

Use nano to add the following lines to opening_lines.txt:

    Now is the winter of our discontent.
    All children, except one, grow up.
    The Galactic Empire was dying.
    In a hole in the ground there lived a hobbit.
    It was a pleasure to burn.
    It was a bright, cold day in April, and the clocks were striking thirteen.
    It was love at first sight.
    I am an invisible man.
    It was the day my grandmother exploded.
    When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.
    Marley was dead, to begin with.

You will often want to search files to find lines that match a certain pattern. The Unix command grep does this (and much more). The following examples show how you can use grep’s command-line options to:

 grep was opening_lines.txt
 The Galactic Empire was dying.
 It was a pleasure to burn.
 It was a bright, cold day in April, and the clocks were striking thirteen.
 It was love at first sight.
 It was the day my grandmother exploded.
 When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.
 Marley was dead, to begin with.

 grep -v was opening_lines.txt
 Call me Ishmael.
 The primroses were over.
 Now is the winter of our discontent.
 All children, except one, grow up.
 In a hole in the ground there lived a hobbit.
 I am an invisible man.

 grep all opening_lines.txt
 Call me Ishmael.

 grep -i all opening_lines.txt
 Call me Ishmael.
 All children, except one, grow up.

 grep in opening_lines.txt
 Now is the winter of our discontent.
 The Galactic Empire was dying.
 In a hole in the ground there lived a hobbit.
 It was a bright, cold day in April, and the clocks were striking  thirteen.
 I am an invisible man.
 Marley was dead, to begin with.

 grep -w in opening_lines.txt
 In a hole in the ground there lived a hobbit.
 It was a bright, cold day in April, and the clocks were striking thirteen.

 grep -w o.. opening_lines.txt
 Now is the winter of our discontent.
 All children, except one, grow up.

 grep [aeiou]t opening_lines.txt
 In a hole in the ground there lived a hobbit.
 It was love at first sight.
 It was the day my grandmother exploded.
 When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.
 Marley was dead, to begin with.

 grep -w -i [aeiou]t opening_lines.txt
 It was a pleasure to burn.
 It was a bright, cold day in April, and the clocks were striking thirteen.
 It was love at first sight.
 It was the day my grandmother exploded.
 When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.

28. Combining Unix Commands With Pipes

One of the most powerful features of Unix is that you can send the output from one command or program to any other command (as long as the second command accepts input of some sort). We do this by using what is known as a pipe. This is implemented using the ‘|’ character (which is a character which always seems to be on different keys depending on the keyboard that you are using). Think of the pipe as simply connecting two Unix programs. Here’s an example which introduces some new Unix commands:

 ubuntu@:~/workspace/Learning_unix$ grep was opening_lines.txt | wc -c
 316

 ubuntu@:~/workspace/Learning_unix$ 
 grep was opening_lines.txt | sort | head -n 3 | wc -c
 130

The first use of grep searches the specified file for lines matching ‘was’, it sends the lines that match through a pipe to the wc program. We use the -c option to just count characters in the matching lines (316).

The second example first sends the output of grep to the Unix sort command. This sorts a file alphanumerically by default. The sorted output is sent to the head command which by default shows the first 10 lines of a file. We use the -n option of this command to only show 3 lines. These 3 lines are then sent to the wc command as before.

Whenever making a long pipe, test each step as you build it!


Shell Scripting

Learnshell Tutorial


Advanced File Editing With Vim

Interactive Tutorial

Cheat Sheet

Another Cheat Sheet


Other Useful Commands

scp for transferring files over SSH

wget for downloading files from URL

watch for monitoring

xargs for complex inputs

ps for process status

top for running processes

kill for stopping processes

Miscellaneous Unix Power Commands

The following examples introduce some other Unix commands, and show how they could be used to work on a fictional file called file.txt. Remember, you can always learn more about these Unix commands from their respective man pages with the man command. These are not all real world cases, but rather show the diversity of Unix command-line tools:

 tail -n 20 file.txt | head
 grep "^ATG" file.txt
 cut -f 3 file.txt | sort -u
 grep -c '[bc]at' file.txt
 cat file.txt | tr 'a-z' 'A-Z'
 cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt