Welcome to the blog

Posts

My thoughts and ideas

Gene set enrichment and pathway analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Gene set enrichment and pathway analysis


After carrying out differential expression analysis, and getting a list of interesting genes, a common next step is enrichment or pathway analyses. Broadly, enrichment analyses can be divided into two types- overrepresentation analysis and gene set enrichment analysis (GSEA).

Overrepresentation analysis takes a list of significantly differentially expressed (DE) genes and determines if these genes are all known to be differentially regulated in a certain pathway or geneset. It is primarily useful if we have a set of genes that are highly differentially expressed and we want to determine what process(es) they may be involved in. Mathematically, it calculates a p-value using a hypergeometric distribution to determine if a gene set (from a database) is significantly over-represented in our DE genes. A couple key points about overrepresentation analysis are that firstly, we get to determine the list of genes that are used as inputs. So, we can set a p-value and log2FC threshold that would in turn determine the gene list. Secondly, since the overrepresentation analysis does not use information about the foldchange values (only a list of genes) it is not directional. So if an overrepresentation analysis gives us a pathway or geneset as being significantly enriched, we are not getting any information about whether the genes in our list are responsible for activating or suppressing the pathway- we can only conclude that our genes are involved in that pathway in some way.

GSEA addresses the second point above by using a list of genes and their corresponding fold change values as inputs to the analysis. Another difference between GSEA and overrepresentation analysis is that in GSEA, we use all the genes as inputs without applying any filters based on log2FC or p-values. GSEA is useful in determining incremental changes at the gene expression level that may come together to have an impact on a specific pathway. GSEA ranks genes based on their ‘enrichment scores’ (ES), which measures the degree to which a set of genes is over-represented at the top or bottom of a list of genes that are ordered based on their log2FC values.

Another crucial part of any enrichment analysis is the database. The main pitfall to avoid is choosing multiple or broad databases as this can result in many spurious results. Therefore, when possible, it is better to choose a reference database based on its biological relevance.


There are various tools available for enrichment analysis, here we chose to use a tool called clusterProfiler. It allows us to perform both overrepresentation and GSEA analyses, is widely used by the field, and has quite a few helpful tutorials/resources available.

We will also use a web tool Enrichr for some of our analysis.

We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let’s load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the FindMarkers function in Seurat, and had ident.1 as cluster 9 and ident.2 as cluster 12. Therefore, we are looking at cluster 9 with respect to cluster 12, that is, positive log2FC values correspond to genes upregulated in cluster 9 or downregulated in cluster 12 and vice versa for negative log2FC values.

#load R libraries
library("Seurat")
library("ggplot2")
library("cowplot")
library("dplyr")
library("clusterProfiler")
library("org.Mm.eg.db")
library("msigdbr")
library("DOSE")
library("stringr")
library("enrichplot")

#read in the epithelial DE file with the first column (genes) as rownames or index
de_gsea_df <- read.csv('outdir_single_cell_rna/epithelial_de_gsea.tsv', sep = '\t', row.names = 1)

head(de_gsea_df)
#open this file in Rstudio and get a sense for the distribution of foldchange values and see if their p values are significant
#alternatively try making a histogram of log2FC values using ggplot and the geom_histogram() function. 
ggplot(de_gsea_df, aes(avg_log2FC)) + geom_histogram()
#You can also go one step further and impose a p-value cutoff (say 0.01) and plot the distribution. 

Overrepresentation analysis

You may notice that we have quite a few genes with fairly large fold change values- while fold change values do not impact the overrepresentation analysis, they can inform the thresholds we use for picking the genes. Since we know that we have quite a few genes with foldchanges greater/lower than +/- 2, we can use that as our cutoff. We will also impose an adj p-value cutoff of 0.01. Thus, for the overrepresentation analysis, we will begin by filtering de_gsea_df based on the log2FC and p-value, and then get the list of genes for our analysis.

#filter de_gsea_df by subsetting it to only include genes that are significantly DE (pval<0.01) and their absolute log2FC is > 2.
#The abs(de_gsea_df$avg_log2FC) ensures that we keep both the up and downregulated genes
overrep_df <- de_gsea_df[de_gsea_df$p_val_adj < 0.01 & abs(de_gsea_df$avg_log2FC) > 2,] 
overrep_gene_list <- rownames(overrep_df)

Next, we will set up our reference. By default clusterProfiler allows us to use the msigdb reference. While that works, here we will show how you can download a mouse specific celltype signature reference geneset from msigdb and use that for your analysis. We will use the M8 geneset from the msigdb mouse collections. We clicked on the Gene Symbols link on the right to download the dataset and uploaded that to your workspace. These files are in a gmt (gene matrix transposed) format, and can be read-in using an in-built R function, read.gmt. And once we have the reference data loaded, we will use the enricher function in the clusterProfiler library for the overrepresentation analysis. The inputs to the function include the DE gene list, the reference database, the statistical method for p-value adjustment, and finally a pvalue cutoff threshold. This generates an overrepresentation R object that can be input into visualization functions like barplot() and dotplot() to make some typical pathway analysis figures. We can also use the webtool, Enrichr, for a quick analysis against multiple databases. For this part, we will save the genelist we’re using for the overrepresentation analysis to a TSV file.

#read in the tabula muris gmt file
msigdb_m8 <- read.gmt('/cloud/project/data/single_cell_rna/reference_files/m8.all.v2023.2.Mm.symbols.gmt')
#click on the dataframe in RStudio to see how it's formatted- we have 2 columns, #the first with the genesets, and the other with genes that are in that geneset.
#try to determine how many different pathways are in this database
overrep_msigdb_m8 <- enricher(gene = overrep_gene_list, TERM2GENE = msigdb_m8, pAdjustMethod = "BH", pvalueCutoff = 0.05)

#visualize data using the barplot and dotplot functions
barplot(overrep_msigdb_m8, showCategory = 10)
dotplot(overrep_msigdb_m8, showCategory = 10)

#save overrep_gene_list to a tsv file (overrep_gene_list is our list of genes and 
#file is the name we want the file to have when it's saved. 
#The remaining arguments are optional- row.names=FALSE stops R from adding numbers (effectively an S.No column), 
#col.names gives our single column TSV a column name, 
#and quote=FALSE ensures the genes don't have quotes around them which is the default way R saves string values to a TSV)
write.table(x = overrep_gene_list, file = 'outdir_single_cell_rna/epithelial_overrep_gene_list.tsv', row.names = FALSE, col.names = 'overrep_genes', quote=FALSE)

Tumor cells Overrep plots

For the Enrichr webtool based analysis, we’ll open that TSV file in our Rstudio session, copy the genes, and paste them directly into the textbox on the right. The webtool should load multiple barplots with different enriched pathways. Feel free to click around and explore here. To compare the results against the results we generated in R, navigate to the Cell Types tab on the top and look for Tabula Muris.

An important component to a ‘good’ overrepresentation analysis is using one’s expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers ([ref])(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5078592/). So, the overrepresentation analysis showing genesets like ‘Tabula Muris senis mammary gland basal cell ageing’ and ‘Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing’ could suggest that the difference in unsupervised clusters 9 and 12 is coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat’s AddModuleScore function and determine if the clusters are split up as basal and luminal. For now we’ll use the same markers defined in this dataset’s original manuscript.

#define lists of marker genes
basal_markers <- c('Cd44', 'Krt14', 'Krt5', 'Krt16', 'Krt6a')
luminal_markers <- c('Cd24a', 'Erbb2', 'Erbb3', 'Foxa1', 'Gata3', 'Gpx2', 'Krt18', 'Krt19', 'Krt7', 'Krt8', 'Upk1a')

#read in the seurat object if it isn't loaded in your R session
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#use AddModuleScore to calculate a single score that summarizes the gene expression for each list of markers
merged <- AddModuleScore(merged, features=list(basal_markers), name='basal_markers_score')
merged <- AddModuleScore(merged, features=list(luminal_markers), name='luminal_markers_score')

#visualize these scores using FeaturePlot and VlnPlots
FeaturePlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'))
VlnPlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'), group.by = 'seurat_clusters_res0.8', pt.size=0)

Interesting! This analysis could lead us to conclude that cluster 12 is composed of basal epithelial cells, while cluster 9 is composed of luminal epithelial cells. Next, let’s see if we can use GSEA to determine if there are certain biological processes that are distinct between these clusters?

GSEA Analysis

For GSEA, we need to start by creating a named vector where the values are the log fold change values and the names are the gene’s names. Recall that GSEA analysis relies on identifying any incremental gene expression changes (not just those that are statistically significant), so we will use our original unfiltered dataframe to get these values. This will be used as input to the gseGO function in the clusterProfiler library, which uses gene ontology for GSEA analysis. The other parameters for the function include OrgDb = org.Mm.eg.db, the organism database from where all the pathways’ genesets will be determined; ont = "ALL", specifies the subontologies, with possible options being BP (Biological Process), MF (Molecular Function), CC (Cellular Compartment), or ALL; keyType = "SYMBOL" tells gseGO that the genes in our named vector are gene symbols as opposed to Entrez IDs, or Ensembl IDs; and pAdjustMethod="BH" and pvalueCutoff=0.05 specify the p-value adjustment statistical method to use and the corresponding cutoff.

#read in the epithelial de df we generated previously
de_gsea_df <- read.csv('outdir_single_cell_rna/epithelial_de_gsea.tsv', sep = '\t', row.names = 1)
#if you can't find the file in your session, we have uploaded a version for you
#de_gsea_df <- read.csv('/cloud/project/data/single_cell_rna/backup_files/epithelial_de_gsea.tsv', sep = '\t')

#get all the foldchange values from the original dataframe
gene_list <- de_gsea_df$avg_log2FC
#set names for this vector to gene names
names(gene_list) <- rownames(de_gsea_df)
#sort list in descending order of log2FC values as that is required by the gseGO function
gene_list = sort(gene_list, decreasing = TRUE)

#now we can run the gseGO function
gse <- gseGO(geneList=gene_list, 
             OrgDb = org.Mm.eg.db,
             ont ="ALL",              
             keyType = "SYMBOL", 
             pAdjustMethod = "BH",             
             pvalueCutoff = 0.05,
             nPermSimple=100000)
#explore the gse object by opening it in RStudio. 
#It basically has a record of all the parameters and inputs used for the function, 
#along with a results dataframe.
#we can pull this result dataframe out to view it in more detail
gse_result <- gse@result

Looking at the dataframe, you might notice that there are around 900 rows, and similar to the overrepresentation analysis, it is unlikely that all of them are truly meaningful. You can skim through all these results to determine which ones might be biologically meaningful, but in the interest of time, here we will subset the gse object to any pathways that have the word epithelial in them, and use those for plotting. In order to subset the object, we will determine the indices of the rows which have these values using R’s which and grepl functions, and then subset the results dataframe in the gse object to those rows. All of these genesets end up having negative enrichment scores (or are downregulated in our putative luminal cell cluster), so we will add one index that has a positive enrichment score to aid in plotting. For plotting, we will use the dotplot, cnetplot, and heatplot functions. Note- we do not suggest this strategy of grabbing pathways of interest in this manner- we are only using it as an illustrative example here.

#start by grabbing the indices we'll need to subset the `gse` object
#epithelial indices from gse object using which and grepl
epithelial_indices <- which(grepl("epithelial", gse@result$Description))
#index for the most positive enrichment score using which.max()
positive_index <- which.max(gse@result$enrichmentScore)
#concatenate these indices to get the list of indices that will be used to subset the gsea object
subset_indices <- c(positive_index,epithelial_indices)

#now create a new gse_epithelial object and subset the gse object to these indices
gse_epithelial <- gse
gse_epithelial@result <- gse_epithelial@result[subset_indices,]

#plot!
#dotplot - splitting by 'sign' and facet_grid together allow us to separate activated and suppressed pathways
dotplot(gse_epithelial, showCategory=20, split=".sign") + facet_grid(.~.sign) 

#heatplot - allows us to see the genes that are being considered 
#for each of the pathways/genesets and their corresponding fold change
heatplot(gse_epithelial, foldChange=gene_list)

#cnetplot - allows us to see the genes along with the various 
#pathways/genesets and how they related to each other
cnetplot(gse_epithelial, foldChange=gene_list)

Tumor cells GSEA plots

Based on these results, we could conclude that cluster 9 (putative luminal cells) have lower expression of quite a few pathways related to epithelial cell proliferation compared to cluster 12 (putative basal cells).

Differential expression analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differential expression analysis


In this section we will use the previously generated Seurat object that has gone through the various preprocessing steps, clustering, and celltyping, and use it for gene expression and differential expression analyses. We will carry out two sets of differential expression analyses in this module. Firstly, since we know that the tumor cells should be epithelial cells, we will begin by trying to identify epithelial cells in our data using expression of Epcam as a marker. Subsequently, we will carry out a DE analysis within the Epcam-positive population(s). Secondly, we will compare the T cell populations of mice treated with ICB therapy against mice with their T cells depleted that underwent ICB therapy to determine differences in T cell phenotypes. We will also briefly explore pseudobulk differential expression analysis. Pseudobulk DE analysis may be more robust in situations where one is comparing conditions and has multiple replicates in their experiment.


Read-in the saved seurat object from the previous step if it is not already loaded in your current R session.

library(Seurat)
library(dplyr)
library(EnhancedVolcano)
library(presto)
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#alternatively we have a seurat object from the last step saved here if you need it.
merged <- readRDS('/cloud/project/data/single_cell_rna/backup_files/preprocessed_object.rds')

Gene expression analysis for epithelial cells

Use Seurat’s FeaturePlot function to color each cell by its Epcam expression on a UMAP. FeaturePlot requires at least 2 arguments- the seurat object, and the ‘feature’ you want to plot (where a ‘feature’ can be a gene, PC scores, any of the metadata columns, etc.). To customize the FeaturePlot, please refer to Seurat’s documentation here

FeaturePlot(merged, features = 'Epcam')

While there are some Epcam positive cells scattered on the UMAP, there appear to be 2 clusters of cells in the UMAP that we may be able to pull apart as potentially being malignant cells.

There are a few different ways to go about identifying what those clusters are. We can start by trying to use the DimPlot plotting function from before along with the FeaturePlot function. Separating plots by the + symbol allows us to plot multiple plots side-by-side.

DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
FeaturePlot(merged, features = 'Epcam') + 
DimPlot(merged, group.by = 'immgen_singler_main')

EPCAM cluster singler feature and dimplot

While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 9 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot VlnPlot may be more helpful. Similar to FeaturePlot, VlnPlot also takes the Seurat object and features as input. It also requires a group.by argument that determines the categorical variable that will be used to group the cells.
To learn more about customizing a Violin plot, please refer to the Seurat documentation

VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')

EPCAM cluster singler vlnplot

Great! Looks like we can confirm that clusters 9 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.

Differential expression for epithelial cells

We can begin by restricting the Seurat object to the cells we are interested in. We will do so using Seurat’s subset function, that allows us to create an object that is filtered to any values of interest in the metadata column. However, we first need to set the default identity of the Seurat object to the metadata column we want to use for the ‘subset’, and we can do so using the SetIdent function. After subsetting the object, we can plot the original object and subsetted object side-by-side to ensure the subsetting happened as expected. We can also count the number of cells of each type to confirm the same.

#set ident to seurat clusters metadata column and subset object to Epcam positive clusters
merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
merged_epithelial <- subset(merged, idents = c('9', '12'))

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
DimPlot(merged_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$seurat_clusters_res0.8)
table(merged_epithelial$seurat_clusters_res0.8)

Epithelial subset object dimplot

Now we will use Seurat’s FindMarkers function to carry out a differential expression analysis between both groups. FindMarkers also requires that we use SetIdent to change the default ‘Ident’ to the metadata column we want to use for our comparison. More information about FindMarkers is available here.

Note that here we use FindMarkers to compare clusters 9 and 12. The default syntax of FindMarkers requires that we provide each group of cells as ident.1 and ident.2. The output of FindMarkers is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of ident.1 with respect to ident.2. Therefore, genes upregulated in ident.1 have positive log2FC, while those downregulated in ident.1 have negative log2FC. The min.pct=0.25 argument tests genes that are expressed in 25% of cells in either of the ident.1 or ident.2 groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. The logfc.threshold=0.1 parameter ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the min.pct and logfc.threshold parameters can also result in the function running faster by reducing the number of genes being tested.

#carry out DE analysis between both groups
merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "9", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 9 changes wrt cluster 12

On opening epithelial_de in your RStudio session, you’ll see that it is a dataframe with the genes as rownames, and the following columns- p_val, avg_log2FC, pct.1, pct.2, p_val_adj. The p-values are dependent on the test used while running FindMarkers, and the adjusted p-value is based on the bonferroni correction test. pct.1 and pct.2 are the percentages of cells where the gene is detected in the ident.1 and ident.2 groups respectively.

Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the ‘significant DE genes only’ dataframe to the top 20 genes with the highest absolute log2FC. Looking at the absolute log2FC allows us to capture both, upregulated and downregulated genes.

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,] 

#get the top 20 genes by fold change
epithelial_de_sig_top20 <- epithelial_de_sig %>%
  top_n(n = 20, wt = abs(avg_log2FC)) 

epithelial_de_sig_top20 is a dataframe that is restricted to the top20 most differentially expressed genes by log2FC.

There are a few ways we can visualize the differentially expressed genes. We’ll start with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the EnhancedVolcano package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on parameters (pCutoff, FCcutoff) we specify.

#get list of top 20 DE genes for ease
epithelial_de_sig_top20_genes <- rownames(epithelial_de_sig_top20)

#plot all 20 genes in violin plots
VlnPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8', ncol = 5, pt.size = 0)

#plot all 20 genes in UMAP plots
FeaturePlot(merged_epithelial, features = epithelial_de_sig_top20_genes, ncol = 5)

#plot all 20 genes in a DotPlot
DotPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8') + RotatedAxis()

#plot all differentially expressed genes in a volcano plot
EnhancedVolcano(epithelial_de,
  lab = rownames(epithelial_de),
  x = 'avg_log2FC',
  y = 'p_val_adj',
  title = 'Cluster9 wrt Cluster 12',
  pCutoff = 0.05,
  FCcutoff = 0.5,
  pointSize = 3.0,
  labSize = 5.0,
  colAlpha = 0.3)

To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. For now, let’s create a TSV file containing our DE results for use later on. We will need to rerun FindMarkers with slightly different parameters for this- we will change the logfc.threshold parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).

#rerun FindMarkers
epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "9", 
  ident.2 = "12", min.pct=0.25, logfc.threshold=0)
#save this table as a TSV file (first move index to first column)
epithelial_de_gsea <- tibble::rownames_to_column(epithelial_de_gsea, var = "gene")
write.table(x = epithelial_de_gsea, file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t', row.names = FALSE)

Classic single-cell based differential expression analysis is notorious for having a high number of false positives. This has been attributed to a few reasons such as each cell being considered an independent observation resulting in inflated p-values and it not taking replicates into account. Psuedobulk based differential expression analyses can be incorporated to at least in part get around these (Junttila S, Smolander J, et al. (2022); Squair, J.W. et al. (2021)). Depending on the type of experiment, you’re unlikely to always have replicates, but here we do, therefore, we will carry out a pseudobulk DE analysis and compare the results against our conventional single-cell DE analysis. The first step is to generate a pseudobulk Seurat object using Seurat’s in-built function AggregateExpression, more information on the function is available here and Seurat’s pseudobulk DE vignette is here. The function makes groups based on the metadata columns we specify, sums the counts for each gene within the group, and then log normalizes and scales the data. The resulting object is structured like our single-cell Seurat object with different layers for the raw counts (counts), log normalized (data) and scaled (scale.data) data. However, instead of the rows being our cell barcodes, they are a combination of the grouping categories we used as an input. Here, we will use the sample (given by orig.ident) as our replicates and the differential expression analysis will use DESeq2 to compare clusters 9 and 12. Specifying DESeq2 in the test.use argument will allow us to use Seurat’s implementation of DESeq2. There are quite a few other tests that can be used, we primarily chose DESeq2 here because it was in Seurat’s DE vignette linked previously, but we encourage you to look into the various other tests as well.

# Aggregate counts for each sample and cluster combination
pb_epithelial <- AggregateExpression(merged_epithelial, assays = 'RNA', 
                                         return.seurat = T, group.by = c('orig.ident', 'seurat_clusters_res0.8'))
# See the first few rows of the log normalized data layer
print(head(pb_epithelial@assays$RNA$data))

#Change ident to seurat clusters
Idents(pb_epithelial) <- "seurat_clusters_res0.8"

# Use FindMarkers with DESeq2 as the test to compare cluster 9 wrt cluster 12
pb_epithelial_de <- FindMarkers(object = pb_epithelial, test.use = "DESeq2",
                              ident.1 = "9", ident.2 = "12")


# The DESeq2 analysis results in NAs in the pvalue columns for some cases 
pb_epithelial_de <- na.omit(pb_epithelial_de)

# Restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
pb_epithelial_de_sig <- pb_epithelial_de[pb_epithelial_de$p_val_adj < 0.01,] 


# Compare significantly differentially expressed genes
pb_and_sc_genes <- intersect(rownames(epithelial_de_sig), rownames(pb_epithelial_de_sig))
only_sc_genes <- setdiff(rownames(epithelial_de_sig), rownames(pb_epithelial_de_sig))
only_pb_genes <- setdiff(rownames(pb_epithelial_de_sig), rownames(epithelial_de_sig))

print(paste0("Genes differentially expressed in both single-cell and pseudobulk: ", length(pb_and_sc_genes)))
print(paste0("Genes differentially expressed in single-cell but not pseudobulk: ", length(only_sc_genes)))
print(paste0("Genes differentially expressed in pseudobulk but not single-cell: ", length(only_pb_genes)))

Genes that were detected by single-cell and pseudobulk (~2800) are the most likely to be true positives. Far fewer genes are detected by pseudobulk alone (~600) compared to single-cell alone (~3000). The pseudobulk genes may also be false positives though, as by pseudobulking we lose information about the percent of cells expressing a gene. Therefore, some of these genes may be differentially expressed because they are only expressed in a few cells. Some tools, such as decoupler allow users to ensure their pseudobulked objects have a minimum number of cells prior to pseudobulking to get around this. It is also unlikely that all ~3000 genes detected by single-cell alone are false positives- therefore, it’s difficult to advocate for only one approach, but based on the downstream application one approach may be better than the other. (If the goal is to come up with a shorter list of genes to follow up on in the wet lab, then the consensus DE gene list may be appropriate. But if the goal is more exploratory, the single cell genes can be used, but you’ll likely want to use violin plots and feature plots to make sure the genes are indeed differentially expressed before you report them).

Differential expression for T cells

For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our merged object to only have T cells, by combining the various T cell annotations from celltyping section. We’ll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will SetIdent to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we’ll doublecheck that the subsetting happened as we expected it to.

#check all the annotated celltypes
unique(merged$immgen_singler_main)

#pick the ones that are related to T cells
t_celltypes_names <- c('T cells', 'NKT', 'Tgd')
merged <- SetIdent(merged, value = 'immgen_singler_main')
merged_tcells <- subset(merged, idents = t_celltypes_names)

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'immgen_singler_main') + 
DimPlot(merged_tcells, group.by = 'immgen_singler_main')

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$immgen_singler_main)
table(merged_tcells$immgen_singler_main)

Tcells subset object dimplot

Now we want to compare T cells from mice treated with ICB vs ICBdT. First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand meta.data to get a snapshot of the columns and the what kind of data they hold.

We can leverage the orig.ident column again as it has information about the condition and replicates. For the purposes of this DE analysis, we want a meta.data column that combines the replicates of each condition. That is, we want to combine the replicates of each condition together into a single category (ICB vs ICBdT).

#we'll start by checking the possible names each replicate has.
unique(merged_tcells$orig.ident)

#there are 6 possible values, 3 replicates for the ICB treatment condition, and 3 for the ICBdT condition
#so we can combine "Rep1_ICB", "Rep3_ICB", "Rep5_ICB" to ICB, and "Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT" to ICBdT. 
#first initialize a metadata column for experimental_condition
merged_tcells@meta.data$experimental_condition <- NA

#Now we can take all cells that are in each replicate-condition, 
#and assign them to the appropriate condition
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"

#double check that the new column we generated makes sense 
#(each replicate should correspond to its experimental condition)
table(merged_tcells@meta.data$orig.ident, merged_tcells@meta.data$experimental_condition)

With the experimental conditions now defined, we can compare the T cells from both groups. We’ll start by using FindMarkers using similar parameters to last time, and see how ICBdT changes with respect to ICB. Next, restrict the dataframe to significant genes only, and then look at the top 5 most upregulated and downregulated DE genes by log2FC.

#carry out DE analysis between both groups
merged_tcells <- SetIdent(merged_tcells, value = "experimental_condition")
tcells_de <- FindMarkers(merged_tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25)

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
tcells_de_sig <- tcells_de[tcells_de$p_val_adj < 0.001,]

#find the top 5 most downregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = -avg_log2FC)
#find the top 5 most upregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = avg_log2FC)

The most downregulated gene in the ICBdT condition based on foldchange is Cd4. That makes sense, because the monoclonal antibody (GK1.5) used for T cell depletion specifically targets CD4 T cells! You can read more about the antibody here.