Commit cab1b1d7 authored by Ruqian Lyu's avatar Ruqian Lyu
Browse files

workflow rmd

parent 2d93fac2
Pipeline #8769 passed with stage
in 3 seconds
title: "BIOS_analysis-workflow"
author: "Ruqian Lyu, Christina Azodi, Jeffrey Pullin"
date: "11/4/2020"
bibliography: ../bios.bib
output: html_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,dpi=100)
## Introduction
This workshop provides a quick (~2 hour), hands-on introduction to current computational analysis approaches for studying single-cell RNA sequencing data. We will discuss the data formats used frequently in single-cell analysis, typical steps in quality control and processing of single-cell data, and some of the biological questions these data can be used to answer.
Given we only have 2 hours, we will start with the data already processed into a count matrix, which contains the number of sequencing reads mapping to each gene for each cell. The steps to generate such a count matrix depend on the type of single-cell sequencing technology used and the experimental design. For a more detailed introduction to these methods we recommend the long-form single-cell RNA seq analysis workshop from the BioCellGen group ([available here]( and the analysis of single-cell RNA-seq data course put together by folks at the Sanger Institute [available here]( Briefly, a count matrix is generated from raw sequencing data using the following steps:
1. If multiple sample libraries were pooled together for sequencing (typically to reduce cost), sequences are separated (i.e. demultiplexed) by sample using barcodes that were added to the reads during library preparation.
1. Quality control on the raw sequencing reads (e.g. using [FastQC](
1. Trimming the reads to remove sequencing adapters and low quality sequences from the ends of each read (e.g. using [Trim Galore!](
1. Aligning QCed and trimmed reads to the genome (e.g. using [STARsolo](, [Kalliso-BUStools](, or [CellRanger](
1. Quality control of mapping results.
1. Quantify the expression level of each gene for each cell (e.g. using bulk tools or single-cell specific tools from STAR, Kallisto-BUStools, or CellRanger).
Starting with the expression counts matrix, this workshop will cover the following topics:
1. The SingleCellExperiment object
1. Empty droplet identification
1. Cell-level quality control
1. Normalisation
1. Dimension Reduction and visualisaion
1. Clustering
1. Marker gene/cell annotation
## Required R Packages
The list of R packages we will be using today is below. We will discuss these packages and which functions we use in more detail as we go through the workshop.
This material is adapted from [@Amezquita2019-yn].
## Download Dataset
We will use the peripheral blood mononuclear cell `(PBMC) dataset` from 10X
Genomics which consist of different kinds of lymphocytes (T cells, B cells, NK
cells) and monocytes. These PBMCs are from a healthy donor and are expected to
have a relatively small amounts of RNA (~1pg RNA/cell) [@Zheng2017-hd]. This
means the dataset is relatively small and easy to manage.
We can download the gene count matrix for this dataset from 10X Genomics and
unpack the zipped file directly in R. We also use `BiocFileCache` function
from the [`BiocFileCache` package](,
which helps manage local and remote resource files stored locally.
```{r download_data,eval=F}
# creates a "sqlite" database file at the path "raw_data", it stores information
# about the local copy and the remote resource files. So when remote data source
# get updated, it helps to update the local copy as well.
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("",
untar(raw.path, exdir=file.path("untar_path", "pbmc4k"))
## The `SingleCellExperiment` object
The SingleCellExperiment class (from the `SingleCellExperiment` package) serves as
the common format for data exchange across 70+ single-cell-related Bioconductor
packages. This class implements a data structure that stores all aspects of our
single-cell data (i.e. gene-by-cell expression data, per-cell metadata, and
per-gene annotations) and allows you to manipulate them in a synchronized
manner[@Amezquita2019-yn]. A visual description of the `SingleCellExperiment` structure
is shown below.
## Load Dataset into R
We next use the `read10xCounts` function from the
package to load the count matrix. This function automatically identifies the
count matrix file from CellRanger's output and reads it in as a
`SingleCellExperiment` object.
```{r sce_pbmc}
fname <- file.path("untar_path", "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
With the data loaded, we can inspect the SCE object to see what information it contains.
```{r print_sce_pbmc}
- How many cells are in the SCE object? How many genes?
## Removing empty droplets from unfiltered datasets
Droplet-based scRNA-seq methods like 10x increase the throughput of single-cell
transcriptomics studies. However, one drawback to droplet-based sequencing is
that some of the droplets will not contain a cell and will need to be filtered
out computationally. This would be an easy task if empty droplets were truly
empty, but in reality these droplets contain “ambient” RNA (free transcripts
present in the cell suspension solution that are either actively secreted by
cells or released upon cell lysis). These ambient RNAs serve as the base material
for reverse transcription and scRNA-seq library preparation, resulting in
non-zero counts for cell-less droplets/barcodes.
We can generate a knee plot to visualize the library sizes (i.e. total counts)
for each barcode in the dataset. In this plot, high quality barcodes are located
on the left and the "knee" represents the inflection point, where to the right
barcodes have relatively low counts and may represent empty droplets.
```{r knee_plot}
bcrank <- barcodeRanks(counts(sce.pbmc))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
# Total UMI count for each barcode in the PBMC dataset,
# plotted against its rank (in decreasing order of total counts).
# The inferred locations of the inflection and knee points are also shown.
Filtering just by library size may result in eliminating barcodes that contained cells with naturally low expression levels. Luckily, more accurate methods have been developed to filter out cell-less barcodes from droplet-based data. Here we use the `emptyDrops` method from the `DropletUtils` package, which estimates the profile of the ambient RNA pool and then tests each barcode for deviations from this profile [@Lun2019-tg].
```{r cell_callig}
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
unfiltered <- sce.pbmc
- How many cells are in the SCE object after filtering out empty barcodes?
## Annotate genes
Let's first add row names to rowData and we are going to use gene symbols as the
row names. We have to ensure the row name for each gene is unique using the
`uniquifyFeatureNames` function from `scater`. It will find the duplicated
Symbols (some Ensembl IDs share the same gene symbol) and add Ensemble ID as
prefix for them so that the rownames are unique.
The PBMC SCE object has the Ensembl ID and the gene symbol for each gene in the
rowData. We can add additional annotation information (e.g. chromosome origin)
using the annotation database package `EnsDb.Hsapiens.v86`. Then, to get
additional annotations of these genes we can use the IDs as keys to retrieve
the information from the database using the `mapIds` function.
```{r sce_rowdata,include=F,eval=F}
```{r feature_annot}
rownames(sce.pbmc) <- uniquifyFeatureNames( ID = rowData(sce.pbmc)$ID,
names = rowData(sce.pbmc)$Symbol)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
## Cell-level Quality Control
### Calculate cell-level QC metrics
With empty barcodes removed, we can now focus on filtering out low-quality cells
(i.e. damaged or dying cells) which might interference with our downstream
analysis. It's common practice to filter cells based on low/high total counts,
but we may also want to remove cells with fewer genes expressed. However, for
heterogeneous cell populations, like PBMCs, this may remove cell types with
naturally low RNA content. So here we will be more relaxed in our filtering and
only remove cells with large fraction of read counts from mitochondrial genes.
This metric is a proxy for cell damage, where a high proportion of mitochondrial
genes would suggest that cytoplasmic mRNA leaked out through a broken membrane,
leaving only the mitochondrial mRNAs still in place.
We first calculate QC metrics by `perCellQCMetrics` function from `scater` and
then filter out the cells with outlying mitochondrial gene proportions.
```{r cell_filter}
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
colData(sce.pbmc) <- cbind(colData(sce.pbmc), stats)
sce.pbmc <- sce.pbmc[,!high.mito]
As sanity check, we can look at the library size (i.e. total counts per cell)
for cells that will remain after filtering and those that will be removed for
having high mitochondrial mRNA proportions. As expected, the cells that will be
filtered out (orange) tend to have lower library sizes, but not all cells with
low library sizes are removed.
If a large proportion of cells will be removed, then it suggests that the QC
metrics we are using for filtering might be correlated with some biological
```{r plot_coldata}
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- high.mito
plotColData(unfiltered, y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count")
In another view, we plot the `subsets_Mito_percent` against `sum`:
```{r plot_sum_mito}
plotColData(unfiltered, x="sum", y="subsets_Mito_percent", colour_by="discard")
The aim is to confirm that there are no cells with both large total counts and
large mitochondrial counts, to ensure that we are not inadvertently removing
high-quality cells that happen to be highly metabolically active (e.g., hepatocytes).
Points in the top-right corner might potentially correspond to metabolically
active, undamaged cells.
### Hands on exercises
---1. Try identifying outlier cells based on low/high total counts----
```{r, include=F,eval=F}
high.count <- isOutlier(stats$total, type="higher")
---2. Why would be usefule to filter out cells with higher total counts---
---3. Use `plotColData` to plot other useful cell-level QC metrics
(e.g. number of detected features, subsets_Mito_detected) and colour by whether cells would be discarded.---
Hint: you may find the following code helpful.
```{r hint}
plotColData(unfiltered, y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
plotColData(unfiltered, y="subsets_Mito_percent",
colour_by="discard") + ggtitle("Mito percent"),
## Normalisation
Normalisation aims to remove systematic or technical differences (for example,
from systematic differences in sequencing coverage across libraries or
cells) among cells and samples so they do not interfere with comparisons of the
expression profiles.
We are going to use a normalisation method named Normalisation by Deconvolution.
It is a scaling-based procedure that involves dividing all counts for each
cell by a cell-specific scaling factor, often called a “size factor”.The size
factor for each cell represents the estimate of the relative bias in that cell,
so division of its counts by its size factor should remove that bias.
Adapted from bulk RNAseq analysis, we assume the majority of genes between
two cells are non-DE (not differential expressed) and the difference among the
nonDE genes are used to represent bias that is used to compute an appropriate
size factor for its removal.
Given the characteristics of low-counts in single cell dataset, we first pool
counts of cells together and estimate a pool-based size factor. The size factor
for the pool can be written as a linear equation of the size factors for the cells
within the pool. Repeating this process for multiple pools will yield a linear
system that can be solved to obtain the size factors for the individual cells,
hence deconvolved.
In addition, to avoid pooling too distinct cells together, we first do a quick
clustering on the cells, and the normalisation (cell-pooling and deconvolution)
is done for each cell cluster separately. [@Lun2016-ta].
After calculation of the size factors we apply a log transformation to the counts.
This transformation is of the form:
\log(\frac{y_{i j}}{s_i} + 1)
where $y_ij$ is the observed count for gene $j$ in cell $i$ and $s_i$ is the
calculated size factor for cell $i$. the division by the size factor removes the
scaling bias, as described above, while the $\log$ transformation helps to
stabilize the variance of the counts making downstream visualisation and methods
such as PCA more effective. The addition of 1 (sometimes called a pseudocount)
is needed as $\log(0)$ is $-\infty$ - a value not useful for data analysis.
```{r size_factor_normalisation}
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
## normalise using size factor and log-transform them
sce.pbmc <- logNormCounts(sce.pbmc)
```{r plot_size_factors}
plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16,
xlab="Library size factors", ylab="Deconvolution factors", log="xy")+abline(a=0,b=1)
### Alternative normalisation method
Recently, more sophisticated normalisation methods have been developed, for example
`sctransform` [@Hafemeister2019-zh]. The linked webpage contains the instruction
for using `sctransform` for normalisation and feature selection. You can come
back to this and try it out later.
--- sctransform ---
1. [sctransform normalisation approach](BIOS_sctransform-normalisation.html)
======================================= Break ================================
## Variance modeling and feature selection
In this step, we would like to select the features/genes that represent the
biological differences among cells while removing noise by other genes. We could
simply select the genes with larger variance across the cell population based on
the assumption that the most variable genes represent the biological differences
rather than just technical noise.
The simplest approach to find such genes is to compute the variance of the
log-normalised gene expression values for each gene across all cells. However,
the variance of a gene in single-cell RNAseq dataset has a relationship with its
mean. log-transformation does not achieve perfect variance stabilisation,
which means that the variance of a gene is often driven more by its abundance
than its biological heterogeneity.
To deal with this effect, we fit a trend to the variance with respect to the gene's
abundance across all genes to model the mean-variance relationship:
```{r model_variance}
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
The trend is fitted on simulated Poisson count-derived variances with assumption
that the technical component is Poisson-distributed and most genes do not exhibit
large biological variability.
```{r plot_modelled_variance}
plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
Under this assumption, our trend represents an estimate of the technical noise
as a function of abundance. The total variance can then be decomposed into
technical and biological components.
We can then get the list of highly variable genes after variance decomposition:
```{r top_hvgs}
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)
About the number of genes to keep, if we believe no more than 5% of genes are
differentially expressed across all the cells, we can set the prop to be 5%. In
a more heterogeneous cell population, we would raise that number. Here we are
using top 10%.
Note that this number selection is actually quite arbitrary. It is recommended
to simply pick a value and go ahead with the downstream analysis but with the
intention of testing other parameters if needed.
## Dimensionality reduction and visualisation
Now we have selected the highly variable genes, we can use these genes's
information to reveal cell population structures in a constructed lower dimensional
space. Working with this lower dimensional representation allows for easier
exploration and reduces the computational burden of downstream analyses,
like clustering. It also helps to reduce noise by averaging information across
multiple genes to reveal the structure of the data more precisely.
We first do a PCA (Principal Components Analysis) transformation. By definition,
the top PCs capture the dominant factors of heterogeneity in the data set. Thus,
we can perform dimensionality reduction by only keeping the top PCs for downstream
analyses. The `denoisePCA` function will automatically choose the number of PCs
to keep, using the information returned by the variance decomposition step to
remove principal components corresponding to technical noise.
```{r pca}
## This function performs a principal components analysis to eliminate random
## technical noise in the data.
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)
tSNE (t-stochastic neighbor embedding ) is a popular dimensionality reduction
methods for visualising single-cell dataset with functions to run them available
in `scater`. They attempt to find a low-dimensional representation of the data
that preserves the distances between each point and its neighbours in the
high-dimensional space.
In addition, UMAP (uniform manifold approximation and projection) which is roughly
similar to tSNE, is also gaining popularity. They are ways to "visualise" the
structure in the dataset and generate nice plots, but shouldn't be regarded as
the real clustering structure.
We actually instruct the function to perform the t-SNE calculations on the top
PCs to use the data compaction and noise removal provided by the PCA.
```{r tsneUMAP}
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
Visualising column data or plotting gene expression in tSNE plot:
```{r tse_plot}
plotTSNE(sce.pbmc, colour_by="detected")
plotTSNE(sce.pbmc, colour_by="CD14")
### Hexbin plot for overcoming overplotting
Notice that in the 2D plots of single cells, sometimes it can get really
"crowded", especially when the number of cells being plotted is large, and it
runs into a over-plotting issue where cells are two close overlaying/blocking
To deal with that, we can use the hexbin plots [@Freytag2020-us].
`schex` package provides binning strategies of cells into hexagon cells. Plotting
summarized information of all cells in their respective hexagon cells presents
information without obstructions.
We use the `plot_hexbin_feature` to plot of feature expression of single cells in
bivariate hexagon cells.
```{r hexgonplot}
pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "TSNE")
type = "logcounts",
plot_hexbin_meta(pbmc_hex, col = "total", action = "mean") +
theme(legend.position = "none")
### Hands-on practise
---run UMAP and plot----
---Plot the hexbin plot for UMAP dims---
sce.pbmc <- runUMAP(sce.pbmc,dimred="PCA")
plotUMAP(sce.pbmc,colour_by = "CD14")
pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "UMAP")
type = "logcounts",
## Clustering
One of the most promising applications of scRNA-seq is de novo discovery and
annotation of cell-types based on transcription profiles. We are going to use
unsupervised clustering method, that is, we need to identify groups of cells
based on the similarities of the transcriptomes without any prior knowledge of
the labels.
In scRNAseq analyses, shared K-nearest neighbour graph-based clustering is
often applied. Each cell is represented in the graph as a node, and the first
step is to find all the KNN (K-neareast neighbours) for each cell by comparing
their pair-wise Euclidean distances. After that, the nodes are connected if they
have at least one shared nearest neighbors, and weights are also added to the
edges by how many shared neighbours they have.
![shared nearest neighbour](
Once the SNN graph is built, we can then use different algorithms to detect
communities/clusters in the graph. Here we use the method called 'walktrap' from
the igraph package which tries to detect densely connected communities in a graph
via random walks.
```{r builkGraph}
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)
### Questions
--- Choose different values for `k` and observe how many clusters you get ---
### Clustree
`clustree` can be used to visualise the different clustering results in a
tree structure. It helps to visualise how cells move as the clustering resolution
Note, it looks very much like a hierarchical clustering result but it is not.
```{r hint,clustree,include=T,eval=T,fig.width=12}
g <- buildSNNGraph(sce.pbmc, k=5, use.dimred = 'PCA')
k5 <- igraph::cluster_walktrap(g)$membership
g <- buildSNNGraph(sce.pbmc, k=8, use.dimred = 'PCA')
k8 <- igraph::cluster_walktrap(g)$membership
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
k10 <- igraph::cluster_walktrap(g)$membership