Welcome to the blog

Posts

My thoughts and ideas

TCR/BCR Repertoire Analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

TCR/BCR Repertoire Analysis

In this module, we will use our scRNA seurat object to explore immune receptor T and B cell diversity. To recognize both antigens and tumor neoantigens, T cells generate diverse receptor sequences through somatic V(D)J recombination. B cells perform V(D)J recombination during developement and then, after antigen encounter, use somatic hypermutation to further diversify their receptors.

We will be using scRepertoire to explore TCR and BCR data. This R package provides several convenient processing and visualization functions that are easy to understand and use. We will then add the clonal information for both B and T cells back onto our Seurat object to be used in further analysis.

#BiocManager::install("scRepertoire")

library("scRepertoire")
library("dplyr")
library("Seurat")

Understanding our TCR data

In the output from 10x Genomics Cell Ranger vdj pipeline, the main file that we use to explore our TCR/BCR repertoire is filtered_contig_annotations.csv. filtered_contig_annotations.csv provides high-level annotations of each high-confidence contig from cell-associated barcodes. Each contig aims to represent the complete variable region (and often part of the constant region) of a single TCR or BCR chain from a single cell. More about the outputs…

Lets read in a filtered contigs file and explore what this file contains. This files contains columns such as high confidence, full length, and productive denoting how likely a contig is a true TCR region. Let’s look at one contig.

setwd("/cloud/project/")
Rep1_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep1_ICB-t-filtered_contig_annotations.csv")

colnames(Rep1_ICB_t)
head(Rep1_ICB_t, 1)

If we look at one cell barcode, we expect to see see 2 contigs: an alpha chain and a beta chain.

Rep1_ICB_t[Rep1_ICB_t$barcode == "AAACCTGAGCGGATCA-1", 1:10]
             barcode is_cell                   contig_id high_confidence length chain  v_gene d_gene  j_gene c_gene
1 AAACCTGAGCGGATCA-1    true AAACCTGAGCGGATCA-1_contig_1            true    552   TRA TRAV3-3         TRAJ37   TRAC
2 AAACCTGAGCGGATCA-1    true AAACCTGAGCGGATCA-1_contig_2            true    514   TRB  TRBV14  TRBD1 TRBJ1-1  TRBC1

But sometimes we see more than two. You may observe two alpha chains and one beta chain or vice versa. In theory, it is possible for a cell to express two alpha chains and two beta chains, although normally one allele is supressed/expressed for each. What are other explanations for seeing multiple alpha/beta chains assigned to a single cell?

Rep1_ICB_t[Rep1_ICB_t$barcode == "AAACCTGTCAGTCCCT-1", 1:10]
              barcode is_cell                   contig_id high_confidence length chain        v_gene d_gene  j_gene c_gene
8  AAACCTGTCAGTCCCT-1    true AAACCTGTCAGTCCCT-1_contig_1            true    552   TRB        TRBV16  TRBD1 TRBJ1-4  TRBC1
9  AAACCTGTCAGTCCCT-1    true AAACCTGTCAGTCCCT-1_contig_2            true    603   TRA TRAV14D-3-DV8         TRAJ22   TRAC
10 AAACCTGTCAGTCCCT-1    true AAACCTGTCAGTCCCT-1_contig_3            true    558   TRA       TRAV4-3         TRAJ34   TRAC

Take a look at some more barcodes. How often to we have a barcode that just has an alpha and beta chain? How often do we see barcodes with 3+ chains? You can summarize this with the following command.

table(table(Rep1_ICB_t[, "barcode"]))

The column raw_clonotype_id contains clonotype IDs which represent TCRs. There should be only one unique clonotype ID per cell.

Rep1_ICB_t[Rep1_ICB_t$barcode == "AAACCTGTCAGTCCCT-1", "raw_clonotype_id"]
[1] "clonotype1609" "clonotype1609" "clonotype1609"

Using this column we can count the unique cells per clonotype. This would show us if there was expansion of a single TCR (an possible immune response). First we have to get our data into the correct format. We want to count how many times we see raw_clonotype_id, accounting for the fact that there are duplicate barcodes depending on how many chains are making up the clonotype.

Rep1_ICB_clonotype_counts <- Rep1_ICB_t %>%
  distinct(barcode, raw_clonotype_id) %>%  # First get unique barcode-clonotype pairs
  group_by(raw_clonotype_id) %>%
  summarize(cell_count = n()) %>%
  arrange(desc(cell_count))  # Sort by count for better visualization

head(Rep1_ICB_clonotype_counts)
# A tibble: 6 × 2
  raw_clonotype_id cell_count
  <chr>                 <int>
1 clonotype1                8
2 clonotype2                7
3 clonotype3                7
4 clonotype4                5
5 clonotype5                5
6 clonotype6                5

Cellranger will always assign clonotype1 to the clonotype with the most expansion – the greatest number of cells with that TCR. 8 cells is not a lot of expansion…

We can plot all our clonotypes to see the general distribution of TCRs per cells. We see a majority of cells only have one clonotype assoicated with them.

ggplot(Rep1_ICB_clonotype_counts, aes(x = reorder(raw_clonotype_id, -cell_count), y = cell_count)) +
  geom_bar(stat = "identity") +
  labs(x = "Clonotype ID", 
       y = "Number of Unique Cells",
       title = "Number of Unique Cells per Clonotype") +
  theme(axis.text.x=element_blank())

Number of Unique TCR Clones per Sample

Our barplot is nice but we might want to produce more complicated graphs to better visualize our repertoire. This is where scRepertoire comes into play.

scRepertoire for TCR analysis

scRepertoire is one of the first packages that allows the combining of single cell RNA and immune profiling. We will use it to perform more complex exploration of the TCR diversity within our data.

Loading in data

First lets load in our data for all our replicates.

Rep1_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep1_ICB-t-filtered_contig_annotations.csv")
Rep1_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep1_ICBdT-t-filtered_contig_annotations.csv")
Rep3_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep3_ICB-t-filtered_contig_annotations.csv")
Rep3_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep3_ICBdT-t-filtered_contig_annotations.csv")
Rep5_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep5_ICB-t-filtered_contig_annotations.csv")
Rep5_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep5_ICBdT-t-filtered_contig_annotations.csv")

# create a list of all samples' TCR data 
TCR.contigs <- list(Rep1_ICB_t, Rep1_ICBdT_t, Rep3_ICB_t, Rep3_ICBdT_t, Rep5_ICB_t, Rep5_ICBdT_t)

Combining contigs into clones

Use scRepertoire’s combineTCR function to create an object with all samples’ TCR data. Note the additional filtering parameters: removeNA will remove any chain with NA values, removeMulti will remove barcodes with greater than 2 chains, and filterMulti will allow for the selection of the 2 corresponding chains with the highest expression for a single barcode.

We will set filterMulti to TRUE to avoid the confusing cases where there are multiple chains.

sample_names <- c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", 
                 "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT")

combined.TCR <- combineTCR(TCR.contigs, 
                           samples = sample_names,
                           removeNA = FALSE, 
                           removeMulti = FALSE, 
                           filterMulti = TRUE)

# view the object
View(combined.TCR[[1]])

So far we have not considered that we have filtered out some cells in our seurat object. Let’s filter our scRepertoire object to the same cells in our seurat object.

# Read in your Seurat object
# https://genomedata.org/cri-workshop/preprocessed_object.rds
rep135 <- readRDS("outdir_single_cell_rna/preprocessed_object.rds")

combined.TCR_filter <- combined.TCR # create a variable to hold our filtered object

index <- 1 # set an index for accessing our combined.TCR object
for (sample in sample_names) {

  print(paste0("Sample Name: ", sample))

  screp_barcodes <- combined.TCR[[index]]$barcode
  print(paste0("Number of barcodes in scRep obj: ", length(unique(screp_barcodes))))

  seurat_barcodes <- rownames(rep135@meta.data[rep135@meta.data$orig.ident == sample, ])
  print(paste0("Number of barcodes in seurat obj: ", length(seurat_barcodes)))
  
  TCR_clonotype_barcodes_filtered <- subset(combined.TCR[[index]], screp_barcodes %in% seurat_barcodes)
  
  print(paste0("Number of rows in scRep obj after filtering: ", nrow(TCR_clonotype_barcodes_filtered)))
  
  combined.TCR_filter[[index]] <- TCR_clonotype_barcodes_filtered

  index <- index + 1
}
[1] "Sample Name: Rep1_ICB"
[1] "Number of barcodes in scRep obj: 2586"
[1] "Number of barcodes in seurat obj: 2986"
[1] "Number of rows in scRep obj after filtering: 2397"
[1] "Sample Name: Rep1_ICBdT"
[1] "Number of barcodes in scRep obj: 959"
[1] "Number of barcodes in seurat obj: 3106"
[1] "Number of rows in scRep obj after filtering: 912"
[1] "Sample Name: Rep3_ICB"
[1] "Number of barcodes in scRep obj: 3607"
[1] "Number of barcodes in seurat obj: 5680"
[1] "Number of rows in scRep obj after filtering: 3506"
[1] "Sample Name: Rep3_ICBdT"
[1] "Number of barcodes in scRep obj: 3004"
[1] "Number of barcodes in seurat obj: 5072"
[1] "Number of rows in scRep obj after filtering: 2945"
[1] "Sample Name: Rep5_ICB"
[1] "Number of barcodes in scRep obj: 248"
[1] "Number of barcodes in seurat obj: 1794"
[1] "Number of rows in scRep obj after filtering: 241"
[1] "Sample Name: Rep5_ICBdT"
[1] "Number of barcodes in scRep obj: 2686"
[1] "Number of barcodes in seurat obj: 4547"
[1] "Number of rows in scRep obj after filtering: 2596"

We can compare the number of clonotypes scRepertoire calls compred to cellranger.

nrow(Rep1_ICB_clonotype_counts)
nrow(combined.TCR[[1]])
nrow(combined.TCR_filter[[1]])
> nrow(Rep1_ICB_clonotype_counts)
[1] 2444
> nrow(combined.TCR[[1]])
[1] 2586
> nrow(combined.TCR_filter[[1]])
[1] 2397

Visualize the Number of Clones

scRepertoire defines clones as TCRs with shared/trackable complementarity-determining region 3 (CDR3) sequences. You can define clones using the amino acid sequence (aa), nucleotide (nt), or the V(D)JC genes (genes). The latter genes would be a more permissive definition of “clones”, as multiple amino acid or nucleotide sequences can result from the same gene combination. You can also use a combination of the V(D)JC genes and nucleotide sequence (strict). scRepertoire also allows for the users to select both or individual chains to examine.

clonalQuant(combined.TCR_filter, 
            cloneCall="aa", 
            chain = "both", 
            scale = FALSE)

For the TCR analysis, Freshour et al. 2023. excluded samples containing less than 1,500 TCR+ cells after filtering. We found that the majority of samples with less than 1,500 cells did not appear to have had their naive T cell populations captured during sequencing, which could lead to a skewed appearance of clonal expansion among the T cell populations that were captured

Number of Unique TCR Clones per Sample

# OR view the relative percent of unique clones scaled by the total size of the clonal repertoire
clonalQuant(combined.TCR_filter, 
            cloneCall="aa", 
            chain = "both", 
            scale = TRUE)

# OR export the counts as a table to view the absolute amount
clonalQuant(combined.TCR_filter, 
            cloneCall="aa", 
            chain = "both", 
            scale = TRUE,
            exportTable = TRUE)

Here we see that contigs is our unique clonotypes and total is the number of cells that have a TCR assocaited with them.

  contigs     values total   scaled
1    2339   Rep1_ICB  2397 97.58031
2     429 Rep1_ICBdT   912 47.03947
3    3282   Rep3_ICB  3506 93.61095
4    2687 Rep3_ICBdT  2945 91.23939
5     164   Rep5_ICB   241 68.04979
6    2260 Rep5_ICBdT  2596 87.05701

Try changing the cloneCall argument to see how that changes the number of clones

Visualizing the proportion of the repertoire taken up by a specific clone

We are interested in understanding if there is a T cell response in our data. If there was a T cell response, we might see a single clonotype in a large portion of cells. Earlier we used a barplot to see how many cells each clonotype was seen in. Now we will use screpertoire’s clonalProportion function to produce stacked barplots visualizing how much space each clone is taking up.

We are going to split up our clones in the same way as Freshour et al. 2023, remembering that 1 here means clonotype1 which is assigned to our top clonotype.

clonalProportion(combined.TCR_filter, cloneCall = "aa", clonalSplit = c(1, 10, 25, 100, 500, 1000, 1e+05)) 

The porportion of the repertoire that each group of clonotypes takes up

clonalProportion(combined.TCR_filter, cloneCall = "aa", clonalSplit = c(1, 10, 25, 100, 500, 1000, 1e+05), exportTable=TRUE) 
           [1:1] [2:10] [11:25] [26:100] [101:500] [501:1000] [1001:1e+05]
Rep1_ICB       5     30      32       91       400        500         1339
Rep1_ICBdT   103    244     117      119       329          0            0
Rep3_ICB      18     85      59      162       400        500         2282
Rep3_ICBdT    28     93      75      162       400        500         1687
Rep5_ICB       7     43      36       91        64          0            0
Rep5_ICBdT    31    127      85      193       400        500         1260

We see the highest amount of expansion in Rep1_ICBdT and Rep5_ICB. However, from our counts barplots, we know that these samples have very few number of TCRs called, making this data untrustworthy. For the other samples, we see very little expansion of a single clonotype.

Understanding our BCR data

Now lets perform a similar analysis for our BCRs, keeping in mind the difference between BCRs and TCRs. BCRs can be hard to group because of somatic hypermutation which introduces multiple mutations during B cell maturation.

Rep1_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep1_ICB-b-filtered_contig_annotations.csv")
Rep1_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep1_ICBdT-b-filtered_contig_annotations.csv")
Rep3_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep3_ICB-b-filtered_contig_annotations.csv")
Rep3_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep3_ICBdT-b-filtered_contig_annotations.csv")
Rep5_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep5_ICB-b-filtered_contig_annotations.csv")
Rep5_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep5_ICBdT-b-filtered_contig_annotations.csv")

colnames(Rep1_ICB_t) # TCR column names
colnames(Rep1_ICB_b) # BCR column names
View(Rep1_ICB_b)

Notice that our column names are the exact same for TCRs and BCRs but when you view the BCR data we see that of course the gene names have changed and the chains are either IGH (heavy chain) or IGK/IGL (light chain).

All cells should either have a IGH/IGK combination or IGH/IGL combination.

Let’s use our R skills to again count the unique cells per clonotype.

Rep1_ICB_clonotype_counts <- Rep1_ICB_b %>%
  distinct(barcode, raw_clonotype_id) %>%  # First get unique barcode-clonotype pairs
  group_by(raw_clonotype_id) %>%
  summarize(cell_count = n()) %>%
  arrange(desc(cell_count))  # Sort by count for better visualization

head(Rep1_ICB_clonotype_counts)
# A tibble: 6 × 2
  raw_clonotype_id cell_count
  <chr>                 <int>
1 clonotype1              137
2 clonotype2                4
3 clonotype3                3
4 clonotype10               2
5 clonotype11               2
6 clonotype12               2
ggplot(Rep1_ICB_clonotype_counts, aes(x = reorder(raw_clonotype_id, -cell_count), y = cell_count)) +
  geom_bar(stat = "identity") +
  labs(x = "Clonotype ID", 
       y = "Number of Unique Cells",
       title = "Number of Unique Cells per Clonotype") +
  theme(axis.text.x=element_blank())

Number of Unique BCR Clones per Sample

When we plot it looks like there might be some evidence of expansion. However, lets use scRepertoire to look more in depth.

scRepertoire for BCR analysis

Unlike combineTCR, combineBCR produces a column called CTstrict of an index of nucleotide sequence and the corresponding V gene. This index automatically calculates the Levenshtein distance between sequences with the same V gene and will index sequences using a normalized Levenshtein distance with the same ID. After which, clone clusters are called using the components function. Clones that are clustered across multiple sequences will then be labeled with “Cluster” in the CTstrict header. We use the threshold parameter to set the normalized edit distance to consider, where the higher the number the more similarity of sequence will be used for clustering.

BCR.contigs <- list(Rep1_ICB_b, Rep1_ICBdT_b, Rep3_ICB_b, Rep3_ICBdT_b, Rep5_ICB_b, Rep5_ICBdT_b)

sample_names <- c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT")

combined.BCR_85 <- combineBCR(BCR.contigs, 
                              samples = sample_names, 
                              threshold = 0.85) # more unique clonotypes

combined.BCR_05 <- combineBCR(BCR.contigs, 
                              samples = sample_names, 
                              threshold = 0.05) # less unique clonotypes

clonalQuant(combined.BCR_85, 
            cloneCall="strict", 
            chain = "both", 
            scale = FALSE,
            exportTable = TRUE) 
  
clonalQuant(combined.BCR_05, 
              cloneCall="strict", 
              chain = "both", 
              scale = FALSE,
              exportTable = TRUE)
  contigs     values total
1     262   Rep1_ICB   414
2      30 Rep1_ICBdT    30
3     499   Rep3_ICB   687
4    1376 Rep3_ICBdT  1537
5      10   Rep5_ICB    11
6     990 Rep5_ICBdT  1374

  contigs     values total
1     248   Rep1_ICB   414
2      30 Rep1_ICBdT    30
3     462   Rep3_ICB   687
4    1064 Rep3_ICBdT  1537
5       9   Rep5_ICB    11
6     830 Rep5_ICBdT  1374

When we lower the threshold there are less unique clonotypes. What the best threshold is depends on what data you are looking at. We will use the default of 0.85.

We will also filter the scRepertoire BCR object to the same barcodes as those in the single cell object.

combined.BCR_filter <- combined.BCR_85 # create a variable to hold our filtered object

index <- 1 # set an index for accessing our combined.BCR object
for (sample in sample_names) {
  
  print(paste0("Sample Name: ", sample))
  
  screp_barcodes <- combined.BCR_85[[index]]$barcode
  print(paste0("Number of barcodes in scRep obj: ", length(unique(screp_barcodes))))
  
  seurat_barcodes <- rownames(rep135@meta.data[rep135@meta.data$orig.ident == sample, ])
  print(paste0("Number of barcodes in seurat obj: ", length(seurat_barcodes)))
  
  BCR_clonotype_barcodes_filtered <- subset(combined.BCR_85[[index]], screp_barcodes %in% seurat_barcodes)
  
  print(paste0("Number of rows in scRep obj after filtering: ", nrow(BCR_clonotype_barcodes_filtered)))
  
  combined.BCR_filter[[index]] <- BCR_clonotype_barcodes_filtered
  
  index <- index + 1
}
[1] "Sample Name: Rep1_ICB"
[1] "Number of barcodes in scRep obj: 414"
[1] "Number of barcodes in seurat obj: 2986"
[1] "Number of rows in scRep obj after filtering: 256"
[1] "Sample Name: Rep1_ICBdT"
[1] "Number of barcodes in scRep obj: 30"
[1] "Number of barcodes in seurat obj: 3106"
[1] "Number of rows in scRep obj after filtering: 26"
[1] "Sample Name: Rep3_ICB"
[1] "Number of barcodes in scRep obj: 687"
[1] "Number of barcodes in seurat obj: 5680"
[1] "Number of rows in scRep obj after filtering: 654"
[1] "Sample Name: Rep3_ICBdT"
[1] "Number of barcodes in scRep obj: 1537"
[1] "Number of barcodes in seurat obj: 5072"
[1] "Number of rows in scRep obj after filtering: 1400"
[1] "Sample Name: Rep5_ICB"
[1] "Number of barcodes in scRep obj: 11"
[1] "Number of barcodes in seurat obj: 1794"
[1] "Number of rows in scRep obj after filtering: 6"
[1] "Sample Name: Rep5_ICBdT"
[1] "Number of barcodes in scRep obj: 1374"
[1] "Number of barcodes in seurat obj: 4547"
[1] "Number of rows in scRep obj after filtering: 906"

There are way fewer BCRs called compared to TCRs.

Visualize the Number of Clones

clonalQuant(combined.BCR_filter, 
            cloneCall="strict", 
            chain = "both", 
            scale = FALSE)

Number of Unique BCR Clones per Sample

clonalQuant(combined.BCR_filter, 
            cloneCall="aa", 
            chain = "both", 
            scale = TRUE,
            exportTable = TRUE)

# Note that our threshold will only change the number called in our strict category
clonalQuant(combined.BCR_filter, 
            cloneCall="strict", 
            chain = "both", 
            scale = TRUE,
            exportTable = TRUE)
  contigs     values total    scaled
1     253   Rep1_ICB   256  98.82812
2      26 Rep1_ICBdT    26 100.00000
3     564   Rep3_ICB   654  86.23853
4    1333 Rep3_ICBdT  1400  95.21429
5       6   Rep5_ICB     6 100.00000
6     793 Rep5_ICBdT   906  87.52759

  contigs     values total    scaled
1     248   Rep1_ICB   256  96.87500
2      26 Rep1_ICBdT    26 100.00000
3     485   Rep3_ICB   654  74.15902
4    1273 Rep3_ICBdT  1400  90.92857
5       6   Rep5_ICB     6 100.00000
6     699 Rep5_ICBdT   906  77.15232

We can also look at what clones are expanded but we have such a small number of BCRs that most of this data is probably untrustworthy.

clonalProportion(combined.BCR_filter, cloneCall = "strict", clonalSplit = c(1, 10, 25, 100, 500, 1000, 1e+05)) 

The porportion of the repertoire that each group of clonotypes takes up

Adding the BCR and TCR Data to your Seurat object

We will now add the BCR and TCR information which has been handled by scRepertoire so far to our Seurat object.

Use the combineExpression function to add the TCR data to your Seurat object. The columns CTgene, CTnt, CTaa, CTstrict, clonalProportion, clonalFrequency, and cloneSize data will be added to your Seurat object’s metadata. Notice you can also decide the bins for grouping based on proportion or frequency of the TCRs (or BCRs).

Here we group by frequency:

rep135 <- combineExpression(combined.TCR, rep135, 
                         cloneCall="aa", proportion = FALSE,
                         group.by = "sample",
                         cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))

colnames(rep135@meta.data)

Since we want to add both BCR and TCR information we have to rename the columns to make it clear since scRepertoire uses the same column names for BCR and TCR data.

columns_to_modify <- c("CTgene", "CTnt", "CTaa", "CTstrict", "clonalProportion", "clonalFrequency", "cloneSize")
names(rep135@meta.data)[names(rep135@meta.data) %in% columns_to_modify] <- paste0(columns_to_modify, "_TCR")

colnames(rep135@meta.data) # make sure the column names are changed 

We can now plot the cells with TCRs and compare that to our cell type annotations. Indeed, we see a majority of TCRs in the same clusters which are called T cells!

DimPlot(rep135, group.by = c("cloneSize_TCR", "immgen_singler_main"))

Let’s repeat the same steps with the BCR data.

rep135 <- combineExpression(combined.BCR_85, rep135, 
                              cloneCall="aa", proportion = FALSE,
                              group.by = "sample",
                              cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))

names(rep135@meta.data)[names(rep135@meta.data) %in% columns_to_modify] <- paste0(columns_to_modify, "_BCR")
colnames(rep135@meta.data) # make sure the column names are changed 

Now let’s visualize the TCR, BCR, and cell type annotations with a UMAP.

DimPlot(rep135, group.by = c("cloneSize_TCR", "cloneSize_BCR", "immgen_singler_main"))

BCR vs TCR vs celltype

Exercise: Create a custom bar graph with BCR and TCR information

So now we have this data, but how do we use it? How could we use the information our Seurat object already holds to analysis our BCR/TCR information further?

Say we want to create a bar plot that shows the number of clones by each cell type. So we want the x-axis to be the number of clones and the y-axis to be counts of BCR or TCRs.

Let’s start by deciding what information we need to make this graph.

colnames(rep135@meta.data)

Create a dataframe that contains just that information, mainly for ease. Here we are going to grab the cell type labels and the amino acid sequences for the CDR3s.

rep135_TCR_clones <- rep135@meta.data[, c("CTaa_TCR", "immgen_singler_main")]

head(rep135_TCR_clones)

The dataframe should look something like this, where we have some cells with a TCR and their cell type label.

                                                   CTaa_TCR immgen_singler_main
Rep1_ICBdT_AAACCTGAGCCAACAG-1  CALGAVSAGNKLTF_CASRGGAYAEQFF                 NKT
Rep1_ICBdT_AAACCTGAGCCTTGAT-1                          <NA>             B cells
Rep1_ICBdT_AAACCTGAGTACCGGA-1                          <NA>         Fibroblasts
Rep1_ICBdT_AAACCTGCACGGCCAT-1                          <NA>            NK cells
Rep1_ICBdT_AAACCTGCACGGTAAG-1 CATDGGTGSNRLTF_CASSYGQGDSDYTF             T cells
Rep1_ICBdT_AAACCTGCATGCCACG-1                          <NA>         Fibroblasts

First, we should use groupby to group our data into the categories that we want to plot.

rep135_TCR_clones %>%
  group_by(immgen_singler_main)
# A tibble: 23,185 × 2
# Groups:   immgen_singler_main [18]
   CTaa_TCR                       immgen_singler_main
   <chr>                          <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
 2 NA                             B cells            
 3 NA                             Fibroblasts        
 4 NA                             NK cells           
 5 CATDGGTGSNRLTF_CASSYGQGDSDYTF  T cells            
 6 NA                             Fibroblasts        
 7 CAARLGMSNYNVLYF_CASSQTGGDERLFF T cells            
 8 NA                             Neutrophils        
 9 NA                             Fibroblasts        
10 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
# ℹ 23,175 more rows
# ℹ Use `print(n = ...)` to see more rows

Then we want to use the summarise function to create a count of how many TCRs we see:

rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>%
    summarise(Count = n())
# A tibble: 18 × 2
   immgen_singler_main Count
   <chr>               <int>
 1 B cells              3253
 2 B cells, pro            3
 3 Basophils              37
 4 DC                    295
 5 Endothelial cells      71
 6 Epithelial cells     1238
 7 Fibroblasts           589
 8 ILC                   763
 9 Macrophages           459
10 Mast cells             11
11 Monocytes             633
12 NK cells              565
13 NKT                  2249
14 Neutrophils            92
15 Stem cells              2
16 Stromal cells          18
17 T cells             12714
18 Tgd                   193

This looks good! Let’s plot it with a barplot:

TCR_celltypes_summary <- rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>%
    summarise(Count = n())

ggplot(TCR_celltypes_summary, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of TCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adjust the x-axis labels so that they are not overlapping each other

The number of TCRs in each celltypes (incorrect)

Hmmmmmmmmm, this looks a little suspicious. There are a lot of B cells with TCRs… maybe our summarise function was incorrect?

Let’s check the counts when we sum just our cell types column.

table(rep135@meta.data$immgen_singler_main)

Unfortunately, those are the same numbers as our summary dataframe above. We want to be counting how many TCRs are seen per cell type not how many of each cell type we have. What do you think could be the problem?

          B cells      B cells, pro         Basophils                DC Endothelial cells  Epithelial cells       Fibroblasts               ILC       Macrophages 
             3253                 3                37               295                71              1238               589               763               459 
       Mast cells         Monocytes       Neutrophils          NK cells               NKT        Stem cells     Stromal cells           T cells               Tgd 
               11               633                92               565              2249                 2                18             12714               193 
# A tibble: 23,185 × 2
# Groups:   immgen_singler_main [18]
   CTaa_TCR                       immgen_singler_main
   <chr>                          <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
 2 NA                             B cells            
 3 NA                             Fibroblasts        
 4 NA                             NK cells           
 5 CATDGGTGSNRLTF_CASSYGQGDSDYTF  T cells            
 6 NA                             Fibroblasts        
 7 CAARLGMSNYNVLYF_CASSQTGGDERLFF T cells            
 8 NA                             Neutrophils        
 9 NA                             Fibroblasts        
10 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
# ℹ 23,175 more rows
# ℹ Use `print(n = ...)` to see more rows

It seems that those pesky NAs are being counted during our summarise command which we don’t want. If a cell has no TCR it should not be counted. Throwing in a na.omit() should do the trick!

rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>% na.omit() 
# A tibble: 12,597 × 2
# Groups:   immgen_singler_main [15]
   CTaa_TCR                                  immgen_singler_main
   <chr>                                     <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF              NKT                
 2 CATDGGTGSNRLTF_CASSYGQGDSDYTF             T cells            
 3 CAARLGMSNYNVLYF_CASSQTGGDERLFF            T cells            
 4 CALGAVSAGNKLTF_CASRGGAYAEQFF              NKT                
 5 CAARLGMSNYNVLYF_CASSQTGGDERLFF            DC                 
 6 CAMREGSNNRIFF;CALSGANNNNAPRF_CASSYRGFDYTF T cells            
 7 CAAHSNYQLIW_CASSPGTGGYEQYF                NKT                
 8 CAVKNNRIFF_CASGDARGVEQYF                  ILC                
 9 CAASEGGNYKPTF_CASSRDRYAEQFF               NKT                
10 CARTNTGYQNFYF_CASSPHNSPLYF                Monocytes          
# ℹ 12,587 more rows
# ℹ Use `print(n = ...)` to see more rows

That looks much better. So all together we have a command that looks like this to get the data into the correct format for the barplot.

# Count occurrences of each cell type
cell_type_counts_TCR <- rep135_TCR_clones %>%
  group_by(immgen_singler_main) %>% na.omit() %>%
  summarise(Count = n())

Finally, we get to plot something!

ggplot(cell_type_counts_TCR, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of TCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And it looks great!

The number of TCRs in each celltypes (Correct)

Now we can do the same thing with the BCR data.

rep135_BCR_clones <- rep135@meta.data[, c("CTaa_BCR", "immgen_singler_main")]

# Count occurrences of each cell type
cell_type_counts_BCR <- rep135_BCR_clones %>%
  group_by(immgen_singler_main) %>% na.omit() %>%
  summarise(Count = n())

# Plot
ggplot(cell_type_counts_BCR, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of BCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The number of BCRs in each celltypes (Correct)

Bonus Exercise: Stacked bar plot of cell types by counts grouped by TCR clones

What if we wanted to see not just how many clones there are, but also the number of unique clones, and how many cells we see each unique clone in? Let’s create a stacked barplot that shows the number of clones for each cell type, but grouped by unique clones.

We can start by using the table function again.

rep135_TCR_clones <- rep135@meta.data[, c("CTaa_BCR", "immgen_singler_main", "cloneSize_TCR")]

table(rep135_TCR_clones$cloneSize_TCR)
Hyperexpanded (100 < X <= 500)          Large (20 < X <= 100)           Medium (5 < X <= 20)             Small (1 < X <= 5) 
                           103                            315                            531                            917 
           Single (0 < X <= 1)               None ( < X <= 0) 
                         10731                              0 

We see that most clones are only seen once, but if we look carefully there are some clones which are seen in mutiple cells. So we are on the right track. But we want these counts grouped by cell type. Can we run table with two parameters? (You might think this is silly but I genuinely couldn’t remember when I tried this the first time)

# Count occurrences of each cloneSize_TCR
cloneSize_immgen_table <- table(rep135_TCR_clones$cloneSize_TCR, rep135_TCR_clones$immgen_singler_main)

head(cloneSize_immgen_table)
                                
                                 B cells B cells, pro Basophils   DC Endothelial cells Epithelial cells Fibroblasts  ILC Macrophages
  Hyperexpanded (100 < X <= 500)       0            0         1    2                 0                0           4    0           0
  Large (20 < X <= 100)                0            0         0    1                 0                0           1    4           0
  Medium (5 < X <= 20)                 3            0         0    3                 0                0           4    6           1
  Small (1 < X <= 5)                   9            0         0    2                 1                0           1   10           2
  Single (0 < X <= 1)                 68            0         0   27                 1                3           3  125           6
  None ( < X <= 0)                     0            0         0    0                 0                0           0    0           0
                                
                                 Mast cells Monocytes Neutrophils NK cells  NKT Stem cells Stromal cells T cells  Tgd
  Hyperexpanded (100 < X <= 500)          0         1           0        1   37          0             0      57    0
  Large (20 < X <= 100)                   0         1           0        0  116          0             0     191    1
  Medium (5 < X <= 20)                    0         4           0        0  181          0             0     328    1
  Small (1 < X <= 5)                      0         2           0        1  281          0             0     604    4
  Single (0 < X <= 1)                     1        30           3       11  653          0             0    9712   88
  None ( < X <= 0)                        0         0           0        0    0          0             0       0    0

Well no way, that is exactly the summary we want. We still have a problem where we have a little too much information to be graphed in a way that is useful. How about we get rid of any row where there are 0 clones seen.

cloneSize_immgen_table <- cloneSize_immgen_table[-c(6), ]

head(cloneSize_immgen_table)
                                 B cells B cells, pro Basophils   DC Endothelial cells Epithelial cells Fibroblasts  ILC Macrophages
  Hyperexpanded (100 < X <= 500)       0            0         1    2                 0                0           4    0           0
  Large (20 < X <= 100)                0            0         0    1                 0                0           1    4           0
  Medium (5 < X <= 20)                 3            0         0    3                 0                0           4    6           1
  Small (1 < X <= 5)                   9            0         0    2                 1                0           1   10           2
  Single (0 < X <= 1)                 68            0         0   27                 1                3           3  125           6
                                
                                 Mast cells Monocytes Neutrophils NK cells  NKT Stem cells Stromal cells T cells  Tgd
  Hyperexpanded (100 < X <= 500)          0         1           0        1   37          0             0      57    0
  Large (20 < X <= 100)                   0         1           0        0  116          0             0     191    1
  Medium (5 < X <= 20)                    0         4           0        0  181          0             0     328    1
  Small (1 < X <= 5)                      0         2           0        1  281          0             0     604    4
  Single (0 < X <= 1)                     1        30           3       11  653          0             0    9712   88

This is a little bit more manageable. Let’s make this a dataframe and clean it up to get ready to plot.

# Convert the filtered contingency table to a dataframe
cloneSize_immgen_df <- as.data.frame(cloneSize_immgen_table)

# Rename columns
colnames(cloneSize_immgen_df) <- c("cloneSize", "immgen_singler_main", "Count")
# Plot stacked bar plot
ggplot(cloneSize_immgen_df, aes(x = immgen_singler_main, y = Count, fill = cloneSize)) +
  geom_bar(stat = "identity") +
  labs(x = "immgen_singler_main", y = "Count", title = "Stacked Bar Plot of cloneSize by immgen_singler_main") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The different size categories of clones by cell type

Data wrangling and visualization take a lot of finagling to get just right, especially as your dataset grows and your information becomes more complex. It usually takes some time to figure out how to visualize your data and then playing with that visualization to get it just right.

Resources

There are many, many more tools that can be used for TCR/BCR analysis. This github page summerizes a few!

scRepertoire

CellRanger VDJ Annotations

Differentiation/Trajectory analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differentiation/Trajectory analysis

Trajectory Analysis Using CytoTRACE

Single-cell sequencing gives us a snapshot of what a population of cells is doing. This means we should see many different cells in many different phases of a cell’s lifecycle. We use trajectory analysis to place our cells on a continuous ‘timeline’ based on expression data. The timeline does not have to mean that the cells are ordered from oldest to youngest (although many analysis uses trajectory to quantify developmental time). Generally, tools will create this timeline by finding paths through cellular space that minimize the transcriptional changes between neighboring cells. So for every cell, an algorithm asks the question: what cell or cells is/are most similar to the current cell we are looking at? Unlike clustering, which aims to group cells by what type they are, trajectory analysis aims to order the continuous changes associated with the cell processes.

The metric we use for assigning positions is called pseudotime. Pseudotime is an abstract unit of progress through a dynamic process. When we base our trajectory analysis on the transcriptomic profile of a cell, less mature cells are assigned smaller pseudotimes, and more mature cells are assigned larger pseudotimes.

Performing trajectory analysis on epithelial cells

Earlier we confirmed that our epithelial cell population corresponded to our tumor population. Then, through differential expression analysis, we saw that the epithelial cells form two distinct clusters that we identified as luminal and basal cells. We can further confirm this conclusion by using trajectory analysis to assign pseudotime values to the epithelial cells. We expect see that basal cells are less differentiated than luminal cells.

epcam Clusters

Subsetting epithelial cells

First, let’s load our libraries and our preprocessed object.

library(Seurat)
library(CytoTRACE)
library(monocle3)
library(ggplot2)


rep135 <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

Let’s see what clusters our epithelial cells are located in.

DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE) + 
  DimPlot(rep135, group.by = 'seurat_clusters_res0.8', label = TRUE) 

We can specifically highlight these cells for further clarity.

Epithelial_cells = rep135$immgen_singler_main =="Epithelial cells"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Epithelial cells")
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)

We already know that these two clusters can be separated into basal and luminal cells. We can see the distinction between these two types using markers. For example lets plot basal cell markers:

FeaturePlot(object = rep135, features = c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))

To more comprehensively see where the basal and luminal cells are, we can create a cell type score by averaging the expression of a group of genes and ploting the score.

cell_type_Basal_marker_gene_list <- list(c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))
rep135 <- AddModuleScore(object = rep135, features = cell_type_Basal_marker_gene_list, name = "cell_type_Basal_score") 
FeaturePlot(object = rep135, features = "cell_type_Basal_score1")

cell_type_Luminal_marker_gene_list <- list(c("Cd24a", "Erbb2", "Erbb3", "Foxa1", "Gata3", "Gpx2", "Krt18", "Krt19", "Krt7", "Krt8", "Upk1a"))
rep135 <- AddModuleScore(object = rep135, features = cell_type_Luminal_marker_gene_list, name = "cell_type_Luminal_score")
FeaturePlot(object = rep135, features = "cell_type_Luminal_score1")

For ease and clarity, let’s subset our Seurat object to just the epithelial cell clusters.

### Subsetting dataset epithelial
rep135 <- SetIdent(rep135, value = 'seurat_clusters_res0.8')
rep135_epithelial <- subset(rep135, idents = c('9', '12')) # 1750

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(rep135, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(rep135_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(rep135$seurat_clusters_res0.8)
table(rep135_epithelial$seurat_clusters_res0.8)

Run CytoTRACE

Now let’s run CytoTRACE. CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. CytoTRACE uses gene count signatures (GCS), or the correlation between gene count and gene expression levels to capture differentiation states.

# create a dataframe of expression values to pass to Cytotrce
rep135_epithelial_expression <- data.frame(GetAssayData(object = rep135_epithelial, layer = "data"))

rep135_epithelial_cytotrace_scores <- CytoTRACE(rep135_epithelial_expression, ncores = 1)

We then have to get the Cytotrace data into the correct format.

rep135_epithelial_cytotrace_transposed <- as.data.frame(rep135_epithelial_cytotrace_scores$CytoTRACE) 
names(rep135_epithelial_cytotrace_transposed) <- "cytotrace_scores"

head(rep135_epithelial_cytotrace_transposed)

# fix the barcode formatting 
rownames(rep135_epithelial_cytotrace_transposed) <- sub("\\.", "-", rownames(rep135_epithelial_cytotrace_transposed))
rownames(rep135_epithelial_cytotrace_transposed)

Add the data to the object

rep135_epithelial <- AddMetaData(rep135_epithelial, rep135_epithelial_cytotrace_transposed %>% select("cytotrace_scores"))

rep135_epithelial[['differentiation_scores']] <- 1 - rep135_epithelial[['cytotrace_scores']] # Let's also reverse out CytoTRACE scores so that high means more differentiated and low means less differentiated

FeaturePlot(object = rep135_epithelial, features = c("cell_type_Basal_score1", "cell_type_Luminal_score1", "cytotrace_scores", "differentiation_scores"))

Cytotrace scores of epitheial cells

Subsetting to just luminal cells

The most important part of trajectory analysis is to make sure you have some biological reasoning to back up the pseudotime values. The best practice for pseudotime means using it to support a biological pattern that has already been observed by some other method.

Let’s separate our luminal cells from the basal cells and perform trajectory analysis.

## Subset to just luminal cells
DimPlot(rep135_epithelial) # cluster 10 is our luminal cells
rep135_luminal <- subset(rep135_epithelial, idents = c('9')) # 863 cells

We can also use the CytoTRACE R package to calculate our CytoTRACE scores.

# Creating a dataframe to pass to CytoTRACE
rep135_luminal_expression <- data.frame(GetAssayData(object = rep135_luminal, layer = "data"))

rep135_luminal_cytotrace_scores <- CytoTRACE(rep135_luminal_expression, ncores = 1)

rep135_luminal_cytotrace_transposed <- as.data.frame(rep135_luminal_cytotrace_scores$CytoTRACE) 
names(rep135_luminal_cytotrace_transposed) <- "cytotrace_scores"

head(rep135_luminal_cytotrace_transposed)

# fix the barcode formatting 
rownames(rep135_luminal_cytotrace_transposed) <- sub("\\.", "-", rownames(rep135_luminal_cytotrace_transposed))
rownames(rep135_luminal_cytotrace_transposed)

We then can add those CytoTRACE scores to our luminal cell object and visualize them.

rep135_luminal <- AddMetaData(rep135_luminal, rep135_luminal_cytotrace_transposed)

rep135_luminal[['cytotrace_scores_luminal']] <- 1 - rep135_luminal[['cytotrace_scores']]
rep135_luminal[['differentiation_scores_luminal']] <- 1 - rep135_luminal[['differentiation_scores']]

# compare all epithelial cells CytoTRACE scores to the luminal-only CytoTRACE
(FeaturePlot(object = rep135_luminal, features = c("cytotrace_scores_luminal")) +
    ggtitle("Luminal Cells CytoTRACE Scores")) +
  (FeaturePlot(object = rep135_luminal, features = c("differentiation_scores_luminal")) +
      ggtitle("Luminal Cells Differentiation Scores"))

Cytotrace scores of luminal cells

CytoTRACE will force all given cells onto the same scale meaning that there has to be cells at both the low and high ends of differentiation. The CytoTRACE scores could be a reflection of cell cycling genes. We can check by using a feature plot to compare the S-phase genes, G2/M-phase genes, and differentiation scores.

# View the S-phase genes, G2/M-phase genes, and the Phase to see if that explains the differentiation score
FeaturePlot(object = rep135_luminal, features = c("S.Score", "G2M.Score")) + 
   DimPlot(rep135_luminal, group.by = "Phase")
 
FeaturePlot(object = rep135_luminal, features = c("S.Score", "G2M.Score", "differentiation_scores"))

We still don’t see a clear pattern, which illustrates the challenge and the danger of pseudotime.

Exercise: Create a subset of the Seurat object. You could explore the differences between the T-Cells populations, stem cells vs epithelial cells, or choose your own subset. Then run CytoTRACE on the subsetted dataset either using the webtool or the R package. Do the pseudotime scores make sense? Are there biological factors that support the CytoTRACE calculated trajectory?

Note: CytoTRACE will crash in the posit environment if you give it too many cells, so if there are several cell populations that you want to compare you can use the subset function to downsample your cell types. Make sure your Idents are set to the category you would like to subset too!

merged_subset <- subset(x = merged, downsample = 100)

Comparing CytoTRACE to Monocle3

Now that we have convinced ourselves that we somewhat trust the results of CytoTRACE, we can try the algorithms on the entire dataset and compare it to another trajectory analysis method. Another popular method used is Monocle3. Monocle3 is an analysis toolkit for scRNA and has many of the functions that Seurat has. We can use Monocle3’s trajectory algorithm but since it uses its own unique data structure, we will have to convert our subsetted object to a cell data set object. Luckily, there are tools that make that conversion relatively easy.

You can also refer to the full Monocle3 trajectory tutorial.

Before we start just running our data through the algorithm and seeing what we get, we should consider what we expect to get. Let’s remember what cell types we have in our dataset and where they are on our UMAP. What pseudotime scores do you expect to be assigned to the clusters?

DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE)

Overview of haematopoiesis

Haematopoiesis and red blood cells

Running Monocle3

Let’s now run Monocle3, again we have to convert our Seurat object to a Monocle ‘Cell Data Set’. We will use a package made for this specific purpose.

rep135_cds <- SeuratWrappers::as.cell_data_set(rep135)

Then we run the Monocle function cluster_cells. This function will redo unsupervised clusters and calculate partitions which are groups of cells that Monocle3 puts on separate trajectories. We don’t need the cells to be clustered since we already did that in Seurat but Monocle requires that partitions be calculated for its trajectory functions.

rep135_cds <- cluster_cells(rep135_cds)

Use the Monolce3 plotting functions to visualize partitions. Then we will execute the function learn_graph which will build the trajectory. We will set the use_partition parameter to FALSE so that we learn a trajectory across all clusters. Later you can come back and try setting it to TRUE and see what happens!

plot_cells(rep135_cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

# Monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

rep135_cds <- learn_graph(rep135_cds, use_partition = FALSE) # graph learned across all partitions

Monocle3 requires you to choose a starting point or root for the calculated trajectories. Running the function order_cells will open a pop-up window where you can interactively choose where you want your roots to be.

rep135_cds <- order_cells(rep135_cds)
# Pick a root or multiple roots

Plot the pseudotime:

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

When we choose a root around where the stem cells are located we see that the fibroblast and epithelial cells end with the higher pseudotime scores.

Monocle Pseudotime with Stem Cells as root

But if we choose are roots to be stem cells, fibroblasts, and epithelial cells, we see that monocle changes the pseudotime orderings accordingly.

Monocle Pseudotime with multiple roots

Loading in the CytoTRACE scores

We need a significant amount of computational power to run CytoTRACE on all cells so we have run CytoTRACE on a computing cluster and have saved the results to be loaded in and added to our seurat object. Of course, we have to make sure the data is formatted correctly.

# Cytotrace for all cells

# read in the cytotrace scores
rep135_cytotrace <- read.table("data/single_cell_rna/reference_files/rep135_cytotrace_scores.tsv", sep="\t") 
head(rep135_cytotrace)


rep135_cytotrace_transposed <- t(rep135_cytotrace)
colnames(rep135_cytotrace_transposed) <- "CytoTRACE"
head(rep135_cytotrace_transposed)

rownames(rep135_cytotrace_transposed) <- sub("\\.", "-", rownames(rep135_cytotrace_transposed))
rownames(rep135_cytotrace_transposed)

Now we can add the scores to our seurat object and also create an inverse CytoTRACE score for clarity. So our differentiation score will be 0 for least differentiated (smallest pseudotime) and 1 being most differentiated (biggest pseudotime).

# Add CytoTRACE scores matching on the cell barcodes
rep135 <- AddMetaData(rep135, rep135_cytotrace_transposed)
rep135[["differentiation_score"]] <- 1 - rep135[["CytoTRACE"]]

Finally, let’s compare the pseudotime values to our cell types. Do we get the results we want to get? Does Monocle and CytoTRACE agree with each other? What happens if you choose different roots for the Monocle pseudotime?

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(rep135, features = 'differentiation_score')) +
DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE)

Monocle pseudotime compared with CytoTRACE pseudtime

Exercise: Using Trajectory to Analyze Monocyte Differentiation

For a final exercise, we can apply the same steps as above to analyze another group of cells: macrophages and monocytes.

highlight = rep135$immgen_singler_main =="Macrophages"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Macrophages")
# Plot the UMAP
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


highlight = rep135$immgen_singler_main =="Monocytes"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Monocytes")
# Plot the UMAP
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


# grab all cells that are macrophages and monocytes
Idents(rep135) <- "immgen_singler_main" 
rep135_macro_mono_cells <- subset(rep135, idents = c("Macrophages", "Monocytes"), invert = FALSE) # 1092

DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

# grab all cells that are macrophages and monocytes, we can subset by clusters 6 and 14 which seem to contain 
Idents(rep135) <- "seurat_clusters_res0.8" 
rep135_macro_mono_cells <- subset(rep135, idents = c(6, 14), invert = FALSE) # 1350
DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_fine', label = TRUE)
# create a data frame with the counts from our subsetted obect
rep135_macro_mono_cells_expression <- data.frame(GetAssayData(rep135_macro_mono_cells, layer = "data")) 

# pass that dataframe to the CytoTRACE function
rep135_macro_mono_cells_cytotrace_scores <- CytoTRACE(rep135_macro_mono_cells_expression, ncores = 4)

# Create a dataframe out of the CytoTRACE scores
rep135_macro_mono_cells_cytotrace_scores_df <- as.data.frame(rep135_macro_mono_cells_cytotrace_scores$CytoTRACE)

# Make the rownames of the cytotraace scores function the cell barcodes and rename the CytoTRACE scores column approproately
rownames(rep135_macro_mono_cells_cytotrace_scores_df) <- sub("\\.", "-", rownames(rep135_macro_mono_cells_cytotrace_scores_df))

Now we will incorporate our CytoTRACE scores into our Seurat object.

# Add CytoTRACE scores matching on the cell barcodes
rep135_macro_mono_cells <- AddMetaData(rep135_macro_mono_cells, rep135_macro_mono_cells_cytotrace_scores_df)

CytoTRACE assignes the least differentiated cells a score of 1 and the most differentiated cells a score of 0, which is sometimes not inutive. So lets create a inverse CytoTRACE score which we will call our differentiation score.

rep135_macro_mono_cells[['differentiation_scores']] <- 1 - rep135_macro_mono_cells[['CytoTRACE']]

# Plot the results
DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) +
  FeaturePlot(rep135_macro_mono_cells, features = 'differentiation_scores')

Running Monocole

Let’s compare our CytoTRACE scores to Monocle3’s trajectory calculations.

cds <- SeuratWrappers::as.cell_data_set(rep135_macro_mono_cells)

cds <- cluster_cells(cds)

# View our clusters
plot_cells(cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

cds <- learn_graph(cds, use_partition = FALSE) # graph learned across all partitions

cds <- order_cells(cds) # choose your root(s)
plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)
plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(rep135_macro_mono_cells, features = 'CytoTRACE') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  (FeaturePlot(rep135_macro_mono_cells, features = 'differentiation_score') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE)

Macrophages differentiation

Selecting a root in Monocle3

Monocle3 proposes a method to choose a root more precisely in its trajecotry tutorial. It requires you to do the preprocessing steps specific to Monocle3.

rep135_cds <- SeuratWrappers::as.cell_data_set(rep135)

# Monocle preprocessing steps
rep135_cds <- preprocess_cds(rep135_cds, num_dim = 50, method = 'PCA')
# rep135_cds <- align_cds(rep135_cds, alignment_group = "orig.ident") # removes batch effect by fitting its a linear model to the cells
rep135_cds <- reduce_dimension(rep135_cds, preprocess_method = 'PCA', reduction_method="UMAP") # calculate UMAPs

rep135_cds <- cluster_cells(rep135_cds)

plot_cells(rep135_cds, show_trajectory_graph = FALSE, color_cells_by = "immgen_singler_main")

# monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

rep135_cds <- learn_graph(rep135_cds, use_partition = FALSE) # graph learned across all partitions

# rep135_cds <- order_cells(rep135_cds)
# Pick a root or multiple roots -- stem cell? or stem cell, fibroblasts, epithelial cells

# a helper function to identify the root principal points:
cell_ids <- which(colData(rep135_cds)[, "seurat_clusters_res0.8"] == 5)

closest_vertex <-
  rep135_cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex

closest_vertex <- as.matrix(closest_vertex[colnames(rep135_cds), ])

root_pr_nodes <-
    igraph::V(principal_graph(rep135_cds)[["UMAP"]])$name[as.numeric(names
                                                              (which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes

rep135_cds <- order_cells(rep135_cds, root_pr_nodes=root_pr_nodes)

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

Running Cytotrace Online

First, we have to export the counts.

rep135_mtx <- GetAssayData(rep135_epithelial, layer = "counts")
write.csv(rep135_mtx, "outdir/rep135_epithelial_counts.csv")

Then export the file from posit. In the file window select rep135_epithelial_counts.csv. Then go to More -> Export… and click Download.

Now we go to https://cytotrace.stanford.edu/. We will navigate to the Run CytoTRACE tab on the left menu bar and upload our downloaded csv in the Upload gene expression table. We will not worry about uploading any other files as of now but if we had a larger dataset we could provide cell type and batch information for our cells.

When uploaded click Run CytoTRACE. This may take a few minutes.

Uploading to Cytotrace

We can then spend some time exploring CytoTRACE scores. For CytoTRACE, warmer colors mean less differentiation, and cooler colors mean more differentiated. We can use the Gene radio button to plot the expression of different marker genes for Basal and Luminal cells.

Plotting Krt5 compared to CytoTRACE scores

Plotting Cd24a compared to CytoTRACE scores

Once we have verified that the CytoTRACE scores are assigned in a way that corresponds with our biological knowledge we can click the Download CytoTRACE results button at the top left of the page.

Import CytoTRACE scores

We now want to add our CytoTRACE scores to our Seurat object. So we navigate to the Upload button and select our exported CytoTRACE scores from where the file was downloaded. We then read it into R and add the scores to our Seurat object as another column in the meta.data.


cytotrace_scores <- read.table("outdir/CytoTRACE_results.txt", sep="\t") # read in the cytotrace scores

rownames(cytotrace_scores) <- sub("\\.", "-", rownames(cytotrace_scores)) # the barcodes export with a `.` instead of a '-' at the end of the barcode so we have to remedy that before joining the cytotrace scores unto our seurat object

# Add CytoTRACE scores matching on the cell barcodes
rep135_epithelial <- AddMetaData(rep135_epithelial, cytotrace_scores %>% select("CytoTRACE"))

Now we can plot our basal cell markers, our luminal cell markers, and the CytoTRACE scores together to compare. Since it is a little unintuitive that less differentiated scores are closer to 1 we will also create a differentiation_score which will be an inverse of our CytoTRACE scores so that smaller scores mean less differentiated and larger scores mean more differentiated.

rep135_epithelial[['differentiation_scores']] <- 1 - rep135_epithelial[['CytoTRACE']] # Let's also reverse out CytoTRACE scores so that high means more differentiated and low means less differentiated

FeaturePlot(object = rep135_epithelial, features = c("cell_type_Basal_score1", "cell_type_Luminal_score1", "cytotrace_scores", "differentiation_scores"))

Further Resources

R Tutorial

Python Tutorial

Sanbomics

Does Monocole Use Clusters to Calculate Pseudotime

Using_Monocle_For_Pseudotime_Trajectory