Posts
My thoughts and ideas
Welcome to the blog
My thoughts and ideas
Introduction to bioinformatics for RNA sequence analysis
Author : Katie Campbell, UCLA
This session will cover the basics of R programming, from setting up your environment to basic data analysis. In this tutorial, you will:
R is a powerful programming language and software environment used for statistical computing, data analysis, and graphical representation of data. It is widely used among statisticians, data analysts, and researchers for data mining and statistical software development.
Today’s session will utilize two input files, which we will use for practice. Start by creating a directory for the R session on your instance and download the files with the following commands:
mkdir -p ~/workspace/intro_to_R
cd ~/workspace/intro_to_R
curl https://raw.githubusercontent.com/ParkerICI/MORRISON-1-public/refs/heads/main/RNASeq/RNA-CancerCell-MORRISON1-metadata.tsv > intro_r_metadata.tsv
curl https://raw.githubusercontent.com/ParkerICI/MORRISON-1-public/refs/heads/main/RNASeq/data/RNA-CancerCell-MORRISON1-combat_batch_corrected-logcpm-all_samples.tsv.zip > intro_r_dataset.tsv.zip
The downloaded intro_r_metadata.tsv file contains annotation of a set of RNAseq samples from patients with melanoma treated with immunotherapies. The intro_r_dataset.tsv.zip file contains batch effect-corrected gene expression values for all of the samples in this dataset.
We will launch R in our instance and be programming within the terminal directly. This would be equivalent to writing code in the “Console” panel of R studio. Launch R using the command:
R
The terminal should print out the following information:
R version 4.5.1 (2025-06-13) -- "Great Square Root"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
The > character indicates that you are now within the R environment.
After the course, you can download R from CRAN on your personal computer. Once R is installed, you can run it from your terminal or using an Integrated Development Environment (IDE) like RStudio, which makes it convenient to code, debug, and analyze data within a convenient user interface.
Some of the commands in this tutorial have a comment character (#) in front of them. These are commands that you don’t have to run in today’s session, but can come back to on your own time. Commenting your code in scripts will also support the documentation and clarity of your scripts.
The working directory is the folder where R reads and saves files by default.
You can check your working directory by:
getwd()
If you are interested in moving to a different directory (for example, where your files are being stored), you can change the working directory:
# setwd("/new/path/")
You can replace "/new/path" with your desired path (eg: "~/workspace/intro_to_R"). Note that directory paths are subtly different between Mac and Windows.
Packages extend R’s functionality by providing additional functions and datasets. First, they need to be installed before they can be loaded. Many of the packages used in this course will already be available in your instance.
# install.packages("tidyverse")
Once packages are installed, they can be loaded using library(package_name). This is usually the first thing that you run in your script or console to load all of the functions you will be using. Tidyverse is a collection of R packages designed for data science, including packages like ggplot2 for data visualization and dplyr for data manipulation.
library(tidyverse)
As you learn commands, you can include ? before the command in order to see a description of what the command does, the parameters and inputs, and the structure of the output. For example, you can see the documentation for list.files() with the following command:
?list.files()
# Type q to exit the help documentation
This shows that list.files() will list all of the files in a directory or folder as a character vector. You can change the directory to search in with the parameter path or specify a type of file with pattern. As you perform data analysis in R over the course of this workshop, this is a helpful way to explore what commands are doing and how you can use them.
Try this: What files are in your current working directory?
You can also check the files in the parent directory, subdirectories, or based upon file path patterns:
list.files(path = "..")
list.files(pattern = ".tsv")
Note: You can interrupt/kill a running R command using Ctrl-C.
In R, <- or = can be used to assign a value on the right to the name of the variable on the left. In python, only = can be used to assign the value on the right to the name of the variable on the left.
age <- 29
first_name <- 'Katie'
print() can be used to show a value, but you can also just type the variable name.
29
print(29)
print(age)
age
print(first_name)
first_name
The value 29 is shown without quotes, while “Katie” is shown with quotes, since 29 is an integer (or double) and “Katie” is a character string. One other variable type is a logical (TRUE or FALSE value). Variable types can be determined using the commands typeof().
typeof(age)
typeof(29)
typeof(first_name)
typeof(TRUE)
What if we want to use print() to concatenate multiple variables or character strings into a statement?
print(first_name, "is", age, "years old")
This doesn’t work. Instead, we can use paste() function to combine this statement:
print(paste(first_name, "is", age, "years old"))
paste() will concatenate the values, separating them by spaces.
Try this: What happens if you use the paste0() function instead?
We can only use variables that have been created:
print(last_name)
This results in the error: Error: object ‘last_name’ not found
The environment is where R stores variables and functions you’ve created. In RStudio, all of your stored variables are visibly listed in the upper right panel. You can also list the objects of your environment with the command ls():
ls()
As we store more variables, this list grows:
last_name <- "Campbell"
ls()
We can also remove variables from our environment:
rm(age)
ls()
Or we can remove all objects and clear the environment entirely:
rm(list = ls())
Data structures can be combined into one- and two-dimensional objects in R, including vectors, lists, matrices, and data frames.
c() function, and all of the elements must be of the same type.list() function and can contain elements of multiple types.matrix() function and are two-dimensional arrays that contain one type.data.frame() function, are two-dimensional, and columns contain vectors of the same type.You can create a numeric vector of only numbers. We can use the function str() to further explore the dimensions and types in the object.
numbers <- c(1, 2, 3, 4, 5)
typeof(numbers)
str(numbers)
A numeric vector can also be generated by a sequence of numbers:
sequence <- 1:10
sequence
skipping_sequence <- seq(from = 1, to = 10, by = 2)
skipping_sequence
repeated_sequence <- rep(x = 1, times = 10)
repeated_sequence
A character vector only contains character strings:
names <- c("Alice", "Bob", "Charlie", "Daniel", "Edwin")
typeof(names)
str(names)
names_sequence <- rep(x = "Your Name", 10)
names_sequence
Note that the structure command str() shows not only the type of vector, but also the length of it.
If you try to create a vector of different types, R will automatically coerce the variable into the most general type. For example, if you try to create a vector of numbers and letters, it will coerce the entire vector into characters:
values <- c(1, 2, 3, 4, "Alice", "Bob")
values
str(values)
more_values <- c(numbers, names)
str(more_values)
Notice that all of the values print out with quotation marks, and str() shows that it is a character (chr) vector.
Lists are unique from vectors. Unlike vectors, lists are designed to store different types of elements (or even different data structures) within a single object. A list is created using the list() function and can contain a numeric vector, a character string, a data frame, another list, or any object type.
my_list <- list(data.frame(name = names, grade = sample(LETTERS[1:3], 10, replace = TRUE), subject = c(rep("Math", 5), rep("Science", 5))),
1:10,
"Katie")
my_list
str(my_list)
Matrices are generated using the matrix() function. You can create a matrix by feeding the command a specific set of values. Look at the parameters for matrix():
?matrix()
The default of matrix() is to create a matrix with one column. To organize these values into a specific size, you need to fill in the parameters nrow and ncol:
matrix(13:24) # Creates a 12x1 matrix of these values
# These are all the same
matrix(13:24, nrow = 3)
matrix(13:24, ncol = 4)
matrix(13:24, nrow = 3, ncol = 4)
# These fill in the values by row, rather than in the order of the columns
matrix(13:24, nrow = 3, byrow = TRUE)
matrix(13:24, ncol = 4, byrow = TRUE)
matrix(13:24, nrow = 3, ncol = 4, byrow = TRUE)
Data frames are used for storing two-dimensional data structures, where each row represents a series of observations. Each column represents a variable, which is a vector of one type.
df <- data.frame(
name = names,
age = numbers,
flower = c(rep("rose", 3), rep("petunia", 2)),
favorite_color = c('red','blue','green','yellow','macaroni and cheese'),
has_dog = c(TRUE, FALSE, TRUE, FALSE, FALSE)
)
You can view your data frame using head():
head(df)
head(df, 3)
Vectors can be extracted from the data frame using data_frame$column_name:
df$name
df$age
Some useful commands to understand your data structures, in general, include summarizing the rows, columns, and names of these dimensions:
str(df)
nrow(df) # Number of rows
ncol(df) # Number of columns
dim(df) # Dimensions of the data frame (row, column)
colnames(df)
names(df)
Data frames store the type of each variable as a vector:
summary(df)
df$new_column <- "my new column"
df
df$new_column <- NULL
df
df$lower_name <- tolower(df$name)
df
df$has_dog <- ifelse(df$has_dog, "yes", "no")
df
df$has_dog <- ifelse(df$has_dog == "yes", TRUE, FALSE)
df
In tidyverse, you can use the mutate() column to do the same thing. You just have to overwrite the variable each time:
library(tidyverse)
mutate(df, new_column = "my new column")
df
df <- mutate(df, new_column = "my new column")
df
# You can also include a list of these changes in one
mutate(df, new_column = "change it again", new_name = toupper(name))
R uses one-based indexing. So the index of the first element is 1, not 0. This is different than command-line and python, which are 0-based (where the first element is 0).
For one-dimensional objects, like vectors or lists, we just use brackets to extract the specific index:
numbers[1]
df$name[3]
This is also how we overwrite these values:
df
df$name[3] <- "Frederick"
df
colnames(df) # outputs a vector of the column names
colnames(df)[2] <- "age_in_years"
colnames(df)
List elements may be called using double brackets or by naming them, which can be helpful for organizing their structures:
complicated_list <- list("first" = c(1,2,3,4), "second" = df, "third" = 1087.29)
str(complicated_list)
complicated_list
complicated_list[[1]]
complicated_list[['first']]
complicated_list$first
complicated_list[['first']][3]
For two dimensional objects, you have to specify both the rows and columns that you’re interested in viewing, using [row, column]:
# This can be done with indices
my_matrix <- matrix(13:24, nrow = 3)
my_matrix
my_matrix[1,4]
# Or if the rows/columns are named
colnames(my_matrix) <- LETTERS[1:4]
row.names(my_matrix) <- paste0("row", 1:3)
my_matrix
my_matrix['row1','C']
# To get all of the rows or all of the columns you leave the field blank
# BUT you still need the comma
my_matrix['row3',]
# Without the comma...
my_matrix['A']
my_matrix['row2']
The stringr package in R (one of the packages in tidyverse) simplifies these tasks with easy-to-use functions that can handle typical string operations.
Finding specific sequences or motifs within biological sequences is a common task.
sequence <- "ATGCGTACGTTGACA"
motif <- "CGT"
str_locate(sequence, motif)
Modifying sequences by replacing specific nucleotides or amino acids.
dna_sequence <- "ATGCGTACGTTGACT"
rna_sequence <- str_replace_all(dna_sequence, "T", "U")
print(rna_sequence)
Extracting parts of sequences, such as cutting out genes or regions of interest.
extracted_sequence <- str_sub(sequence, 3, 8)
print(extracted_sequence)
Determining the length of sequences.
sequence_length <- str_length(sequence)
print(sequence_length)
Converting uppercase to lowercase, or vice versa.
sequence_upper <- str_to_upper(sequence)
print(sequence_upper)
Splitting sequences into arrays, useful for reading fasta files or analyzing codons.
codons <- str_sub(sequence, seq(1, str_length(sequence), by = 3), seq(3, str_length(sequence), by = 3))
print(codons)
Try this: What if our sequence length wasn’t a multiple of three?
Counting occurrences of specific nucleotides or amino acids.
guanine_count <- str_count(sequence, "G")
print(guanine_count)
Data analysis will require filtering data types using comparison operators to return TRUE and FALSE as a vector:
== checks to see whether two values are equivalent. A common error is to only use one =, which is used to set variables in functions.!= checks if two values are not equivalent>, <, >=, and <= check for greater/less than (or equal to) for comparing numbersYou can use these operators to subset vectors:
numbers<3
which(numbers<3)
numbers[numbers<3]
numbers[which(numbers<3)]
Conditional statements can be combined using logical operators:
& requires that both statements are true| requires that one of the statements is true! requires that the opposite of the statement is truenumbers>1 & numbers<5
which(numbers>1 | numbers<5)
numbers[numbers>1 & numbers<5]
numbers[which(numbers>1 & numbers<5)]
When we want to use these operations to filter data frames, we can specify the rows from the data frame by filtering the vectors stored in the columns. Note that we need to include the comma to indicate that we want all of the rows of this subsetted data frame:
df[df$age_in_years < 30, ]
Alternatively, we can also use the filter() command from the dplyr package:
filter(df, age_in_years < 30)
Instead of typing out all of the data elements, you can import data from various file formats. Some file types are easily read in by base R. Note that there are packages that may be helpful that are specific to certain input file types.
It is important to know the format of the file you are trying to read. The extension tsv indicates that the values are tab-separated.
metadata <- read.table("intro_r_metadata.tsv")
What is wrong with this data when you evaluate it?
head(metadata)
You’ll notice that the column names are in the first row of data, and all of the columns are labeled as “V” and the number corresponding to the column. It helps to be specific about how R should read in the data:
metadata <- read.table("intro_r_metadata.tsv", sep = "\t", head = TRUE)
head(metadata)
str(metadata)
The readxl package can be used to read in Excel files:
# install.packages("readxl")
# library(readxl)
# excel_data <- read_excel("path/to/excel.xlsx", sheet = "Sheet1")
You can get a glimpse of the data by only looking at the first few lines:
head(metadata)
You can quickly perform summary statistics on a dataset or a vector:
summary(metadata)
summary(metadata$subject.age)
Try this:
Currently, all of the character strings have been read in as characters. This is helpful, but oftentimes our data is more structured than that. For example, the bor column in metadata encodes clinical outcomes, and our brains think of these as an ordinal variable, not just a categorical one. That is, we think of the outcomes in order from best to worst: CR > PR > SD > PD (Complete response is better than partial response, which is better than stable disease, then progressive disease).
A factor encodes this information using levels. When we read in data, strings are read in as character vectors, but we can coerce them to become factors for our data summary:
summary(metadata$bor)
unique(metadata$bor)
metadata$bor <- factor(metadata$bor)
summary(metadata$bor)
metadata$bor <- factor(metadata$bor, levels = c("CR","PR","SD","PD"))
summary(metadata$bor)
Try this:
Like numbers, factors have a logical order to them. So when we enforce the data to have this particular order, we can also arrange the data in that order.
numbers
order(numbers)
order(-numbers)
arrange(metadata, bor)
arrange(metadata, subject.age)
arrange(metadata, -subject.age)
arrange(metadata, -bor) # GIVES A WARNING
arrange(metadata, desc(bor)) # Better for factors
Long and wide data formats are two common ways of structuring data, each with its own advantages and use cases.
In the long format, also known as the “tidy” format, each observation is represented by a single row in the dataset. This format is characterized by having:
Advantages:
In the wide format, each observation is represented by a single row, but with multiple columns corresponding to different variables. This format is characterized by:
Advantages:
The choice between long and wide formats depends on factors such as the nature of the data, the analysis tasks, and personal preference. Long format is often preferred for its flexibility and compatibility with modern data analysis tools, while wide format may be suitable for simpler datasets or specific analysis requirements.
library(tidyr)
# Example long format data
long_data <- data.frame(
Subject = c("A", "A", "B", "B"),
Time = c(1, 2, 1, 2),
Measurement = c(10, 15, 12, 18)
)
# Convert long format data to wide format
wide_data <- spread(long_data, key = Time, value = Measurement)
# View the wide format data
print(wide_data)
library(tidyr)
# Example wide format data
wide_data <- data.frame(
Subject = c("A", "B"),
Time1 = c(10, 12),
Time2 = c(15, 18)
)
# Convert wide format data to long format
long_data <- gather(wide_data, key = Time, value = Measurement, -Subject)
# View the long format data
print(long_data)
Let’s work with a real dataset: intro_r_dataset.tsv.zip. This file is compressed, so we can’t read it in by the default read.table function. This is a circumstance where I use the data.table package and the fread function, which automatically recognizes the compression format to read in the file:
# read.table("intro_r_dataset.tsv.zip") # DOESN'T WORK
# install.packages('data.table')
data <- data.table::fread("intro_r_dataset.tsv.zip")
dim(data)
colnames(data)
str(data)
Try this: Create a long data frame, where the key is called “sample.id” and the value column contains the gene expression (“cpm”). Call this new dataset “data_long”.
Merging allows combining data from different sources. This is common in analyzing biological data. Joins and merging are common operations used to combine multiple datasets based on common variables or keys. In Tidyverse, these operations are typically performed using functions from the dplyr package.
inner_join()):An inner join combines rows from two datasets where there is a match based on a common key, retaining only the rows with matching keys from both datasets.
left_join()):A left join combines all rows from the first (left) dataset with matching rows from the second (right) dataset based on a common key. If there is no match in the second dataset, missing values are filled in.
right_join()):Similar to a left join, but it retains all rows from the second (right) dataset and fills in missing values for non-matching rows from the first (left) dataset.
full_join()):A full join combines all rows from both datasets, filling in missing values where there are no matches.
semi_join()):A semi-join returns only rows from the first dataset where there are matching rows in the second dataset, based on a common key.
anti_join()):An anti-join returns only rows from the first dataset that do not have matching rows in the second dataset, based on a common key.
merge()):The merge() function is a base R function used to merge datasets based on common columns or keys. It performs similar operations to joins in dplyr, but with slightly different syntax and behavior.
library(dplyr)
# Example datasets
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(2, 3, 4), Score = c(85, 90, 95))
# Inner join
inner_merged <- inner_join(df1, df2, by = "ID")
# Left join
left_merged <- left_join(df1, df2, by = "ID")
# Right join
right_merged <- right_join(df1, df2, by = "ID")
# Full join
full_merged <- full_join(df1, df2, by = "ID")
# Semi-join
semi_merged <- semi_join(df1, df2, by = "ID")
# Anti-join
anti_merged <- anti_join(df1, df2, by = "ID")
Now let’s merge our data_long with metadata so that our gene expression data also contains our sample annotation.
data_long <- gather(data, key = sample.id, value = cpm, -gene.hgnc.symbol)
merged_data <- left_join(data_long, metadata)
head(merged_data)
So far, we’ve used individual commands to accomplish several tasks, but sometimes we want to do multiple things in one line of code. The %>% is called a pipe and is used to chain commands together.
Piping will take the output from the prior step and use it as the first argument of the next step.
df
df %>%
mutate(age_in_days = age_in_years*365) %>%
filter(age_in_days < 500)
metadata %>%
filter(sample.tumor.type == "cutaneous") %>%
arrange(bor)
groupby() allows us to apply individual functions to grouped objects, and we can use summarise() to perform functions within those individual groups:
metadata %>%
group_by(sample.tumor.type, bor) %>%
summarise(total = n(), mean_age = mean(subject.age, na.rm = TRUE)) # n() performs a count
metadata %>%
group_by(sample.tumor.type, bor) %>% count() # This is another way to quickly count groups
Advanced exercise: Create a wide data frame that summarizes the total number of responders (defined by bor) per tumor type (in each row).
The groupby() functionality in tidyverse allows you to perform many metrics across individual groups, so you don’t have to create a filtered dataset over and over again. Base R also has a series of functions to make it easier to repeat the same function over and over again.
applyThe apply() function in R is a powerful tool for applying a function to the rows or columns of a matrix or data frame. It is particularly useful for performing operations across a dataset without needing to write explicit loops. The syntax for apply() is:
apply(X, margin, function, ...)
# X: This is the array or matrix on which you want to apply the function.
# margin: A value that specifies whether to apply the function over rows (1), columns (2), or both (c(1, 2)).
# function: The function you want to apply to each row or column.
To calculate the sum of each row in a matrix:
# Create a matrix
my_matrix <- matrix(1:9, nrow=3)
# Apply sum function across rows
row_sums <- apply(my_matrix, 1, sum)
print(row_sums)
To find the mean of each column in a data frame:
# Create a data frame
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
# Apply mean function across columns
column_means <- apply(df, 2, mean)
print(column_means)
sapply and lappy
lapply() returns a list, regardless of the output of each application of the function.sapply() attempts to simplify the result into a vector or matrix if possible. If simplification is not possible, it returns a list similar to lapply().Suppose you have a list of numerical vectors and you want to compute the sum of each vector. Here’s how you could use lapply():
# Define a list of vectors
num_list <- list(c(1, 2, 3), c(4, 5), c(6, 7, 8, 9))
# Use lapply to apply the sum function
list_sums <- lapply(num_list, sum)
print(list_sums)
Using the same list of numerical vectors, if you use sapply() to compute the sum, the function will try to simplify the output into a vector:
# Use sapply to apply the sum function
vector_sums <- sapply(num_list, sum)
print(vector_sums)
When to Use Each
lapply(): When you need the robustness of a list output, especially when dealing with heterogeneous data or when the function can return variable lengths or types.sapply(): When you are working with homogeneous data and prefer a simplified output such as a vector or matrix, assuming the lengths and types are consistent across elements.Another option, if you enjoy using tidyverse and groupby() is to convert a list of objects into a data structure amenable to this:
data1 <- data.frame("subject" = paste0("subject", sample(1:10, 5, replace = FALSE)),
"age" = sample(1:100, 5))
data2 <- data.frame("subject" = paste0("subject", sample(11:20, 5, replace = FALSE)),
"age" = sample(1:100, 5))
data3 <- data.frame("subject" = paste0("subject", sample(21:30, 5, replace = FALSE)),
"age" = sample(1:100, 5))
list_of_data <- list("data1" = data1, "data2" = data2, "data3" = data3)
enframe(list_of_data)
enframe(list_of_data) %>% unnest(value)
enframe() will convert a list of objects into a tibble (a type of data frame), and each list element is encoded in the value column.
unnest(value) will unnest the data frames from the value column into individual rows.
This is very helpful if you ever have many files that are the same type, and you’re trying to create one master data frame.
# my_files <- list.files(path = "/path/to/data", pattern = "expression.tsv")
# merged_data <- lapply(my_files, fread) %>% enframe() %>% unnest(value)
Try these exercises, using the metadata and data_long objects that you read into R:
cohort had the highest average expression of B2M?bor) had the highest average expression of NLRC5?If you want to export all objects in the environment, you can use save.image("path/to/file.RData"). Alternatively, individual files are stored and then loaded later using the following commands:
save(data, file = "testdata.RData")
# Next time you open R, you can reload the object with:
load("testdata.RData")
When you want to quit R in your terminal, you can type in the following commmand:
q()
The console will ask you:
Save workspace image? [y/n/c]:
Answering “yes” and saving your workspace image will create a hidden file called .Rdata in your current working directory. This will store all of the data objects in your global environment to that .Rdata file and can be automatically loaded for future use. If you open R again in the future, this will automatically reload all of the stored data objects, recreating the environment. This is helpful for continuing analysis from where you previously left off.
Answering “no” will not save your current environment, and you will need to rerun all of the code that you previously ran. Oftentimes, all of your code will be stored in an R script and will reproduce your prior analysis.
The core idea behind ggplot2 is the concept of a “grammar of graphics”. This concept provides a systematic way to describe and build graphical presentations such as charts and plots. The grammar itself is a set of independent components that can be composed in many different ways. You can explore all of these possibilities here. This grammar includes elements like:
aes): Defines how data are mapped to color, size, shape, and other visual properties.geom): The geometric objects in a plot—lines, points, bars, etc.Here’s how you generally use ggplot2 to create a plot:
ggplot(): Set up the data and, optionally, define default mappings between variables and their aesthetics.geom_point() for scatter plots, geom_line() for line graphs, and so on.library(ggplot2)
# Sample data
df <- data.frame(
x = rnorm(100),
y = rnorm(100),
group = factor(rep(1:2, each = 50))
)
# Creating a scatter plot
ggplot(df, aes(x = x, y = y, color = group)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter Plot Example", x = "X Axis", y = "Y Axis")
Let’s simulate some TCR clonotype data. We will create a dataset where each TCR has a randomly generated number of cells associated with it, representing the clone size. After generating the data, we’ll use the hist() function from base R to plot a histogram of the clone sizes.
library(dplyr)
# Step 1: Simulate data
set.seed(123) # Set seed for reproducibility
num_clonotypes <- 100 # Specify the number of different clonotypes
# Create a data frame with random cell counts for each clonotype
tcr_data <- tibble(
clonotype = paste("TCR", seq_len(num_clonotypes), sep=""),
cell_count = sample(1:1000, num_clonotypes, replace=TRUE) # Random cell counts between 1 and 1000
)
# Step 2: Create a histogram of clone sizes
hist(tcr_data$cell_count,
breaks=20, # You can adjust the number of breaks to change bin sizes
col="skyblue",
main="Histogram of TCR Clone Sizes",
xlab="Clone Size (Number of Cells)",
ylab="Frequency")
The same task can be performed using ggplot2:
# Step 1: Simulate data
set.seed(123) # Set seed for reproducibility
num_clonotypes <- 100 # Specify the number of different clonotypes
# Create a data frame with random cell counts for each clonotype
tcr_data <- tibble(
clonotype = paste("TCR", seq_len(num_clonotypes), sep=""),
cell_count = sample(1:1000, num_clonotypes, replace=TRUE) # Random cell counts between 1 and 1000
)
# Step 2: Create a histogram using ggplot2
ggplot(tcr_data, aes(x = cell_count)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black") +
theme_minimal() +
labs(
title = "Histogram of TCR Clone Sizes",
x = "Clone Size (Number of Cells)",
y = "Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5) # Center the plot title
)
Here is simulated gene expression data for key CD8 T cell genes.
library(tidyr)
# Define genes and number of cells
genes <- c("GZMB", "GZMA", "GNLY", "PRF1", "TOX", "ENTPD1", "LAG3", "HAVCR2", "TIGIT", "CXCL13", "IL7R", "SELL", "LEF1", "TCF7")
num_cells <- 20
# Parameters for negative binomial
size <- 2 # Dispersion parameter
mu_pre <- 20 # Mean for pre-treatment
mu_post <- 30 # Mean for post-treatment
# Simulate gene expression data
set.seed(42)
pre_treatment <- sapply(rep(mu_pre, length(genes)), function(mu) rnbinom(num_cells, size, mu = mu))
post_treatment <- sapply(rep(mu_post, length(genes)), function(mu) rnbinom(num_cells, size, mu = mu))
# Ensure data frames have proper column names
colnames(pre_treatment) <- genes
colnames(post_treatment) <- genes
# Format as data frame
pre_data <- as_tibble(pre_treatment) %>%
mutate(Treatment = "Pre") %>%
pivot_longer(cols = -Treatment, names_to = "Gene", values_to = "Expression")
post_data <- as_tibble(post_treatment) %>%
mutate(Treatment = "Post") %>%
pivot_longer(cols = -Treatment, names_to = "Gene", values_to = "Expression")
# Combine the datasets
combined_data <- bind_rows(pre_data, post_data)
# Printing to see the combined data (optional)
print(combined_data)
Now let’s use this data to build a boxplot of TOX expression pre and post treatment.
# Filter data for the TOX gene
tox_data <- combined_data %>%
filter(Gene == "TOX")
# Plot
ggplot(tox_data, aes(x=Treatment, y=Expression, fill=Treatment)) +
geom_boxplot() +
labs(title="Expression of TOX pre and post treatment", x="Treatment Condition", y="Expression Level") +
theme_minimal() +
scale_fill_brewer(palette="Pastel1") # Enhance aesthetics with color
Same thing a violin plot.
# Filter data for the TOX gene
tox_data <- combined_data %>%
filter(Gene == "TOX")
# Create the violin plot
ggplot(tox_data, aes(x=Treatment, y=Expression, fill=Treatment)) +
geom_violin(trim=FALSE) + # Trim set to FALSE to show the full range of data
labs(title="Expression of TOX pre and post treatment", x="Treatment Condition", y="Expression Level") +
theme_minimal() +
scale_fill_brewer(palette="Pastel1") +
geom_boxplot(width=0.1, fill="white") # Overlay boxplot to show median and quartiles
A t-test could be used to compare the means of two groups, for example, the level of a specific immune marker in patients with and without a certain mutation.
# Randomly generated sample data: Immune marker levels in two patient groups
group1 <- rnorm(30, mean = 5, sd = 1.5) # Patients with a mutation
group2 <- rnorm(30, mean = 4.5, sd = 1.2) # Patients without the mutation
# Perform a t-test
test <- t.test(group1, group2)
# Print the result
print(test)
Assume you’ve identified a TCR clonotype and quantified the number of cells expressing this clonotype at two timepoints:
X number of cellsY number of cellsYou also need the total number of cells sequenced at each timepoint to complete the contingency table for the Fisher’s Exact Test. Let’s say:
N_pre
N_post
X <- 20
Y <- 300
N_pre <- 2450
N_post <- 3001
# Number of cells with the specific clonotype at each timepoint
cells_with_clone <- c(X, Y)
# Number of cells without the clonotype (total cells minus cells with the clonotype)
cells_without_clone <- c(N_pre - X, N_post - Y)
# Create the contingency table
data <- matrix(c(cells_with_clone, cells_without_clone), ncol = 2,
dimnames = list(c("With Clone", "Without Clone"),
c("Pre-Treatment", "Post-Treatment")))
# Perform Fisher's Exact Test
test <- fisher.test(data)
# Print the result
print(test)
fisher.test(data) calculates whether the proportions of cells with the clonotype are significantly different between the two timepoints.# Load the iris dataset (saved as iris in base R)
iris <- iris
```r
# Create a scatter plot of sepal length vs sepal width colored by species where the point size is determined from petal length.
# HINT: Look at size arguments for aes.
# Add a legend to your scatter plot for the size of the points.
# HINT: Look into the guides function and use guide_legend().
# Create a histogram of the petal length by species with 20 bins and add transparency to the plot.
# HINT: Look into the arguments for geom_hist
# Create a density plot of the petal length colored by species.
# HINT: Look up geom_density.
# Combine the density and histogram plots
# HINT: The histogram needs to be normalized to the density plot, look into "..density..".
# Perform a t-test on sepal length between the species setosa and versicolor
# HINT: Pull out the necessary values using subset
# Perform linear regession on the sepal length vs petal length and summarize the results.
# HINT: Look into the lm function
# Plot the regression line over a scatter plot of sepal vs petal length.
# HINT: Look into the geom_smooth