Commit 7d69d6aa authored by Jeffrey Pullin's avatar Jeffrey Pullin
Browse files

Jeffrey edits take 2

Filled in:

* Doublet detection
* Gene filtering
* Exploratory data analysis
* Matrix factorization and latent spaces

I removed the Autoencoders section as we don't have a good example and
because of time constraints.
parent 4bcd6066
......@@ -40,7 +40,7 @@ The list of R packages we will be using today is below. We will discuss these pa
This material is adapted from [@Amezquita2019-yn].
```{r}
```{r packages}
suppressPackageStartupMessages({
library(scRNAseq)
library(scater)
......@@ -136,7 +136,6 @@ on the left and the "knee" represents the inflection point, where to the right
barcodes have relatively low counts and may represent empty droplets.
```{r knee_plot}
bcrank <- barcodeRanks(counts(sce.pbmc))
# Only showing unique points for plotting speed.
......@@ -152,7 +151,6 @@ legend("bottomleft", legend=c("Inflection", "Knee"),
# Total UMI count for each barcode in the PBMC dataset,
# plotted against its rank (in decreasing order of total counts).
# The inferred locations of the inflection and knee points are also shown.
```
Filtering just by library size may result in eliminating barcodes that contained cells with naturally low expression levels. Luckily, more accurate methods have been developed to filter out cell-less barcodes from droplet-based data. Here we use the `emptyDrops` method from the `DropletUtils` package, which estimates the profile of the ambient RNA pool and then tests each barcode for deviations from this profile [@Lun2019-tg].
......@@ -222,10 +220,6 @@ leaving only the mitochondrial mRNAs still in place.
We first calculate QC metrics by `perCellQCMetrics` function from `scater` and
then filter out the cells with outlying mitochondrial gene proportions.
```{r cell_filter}
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
......@@ -233,7 +227,6 @@ high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
colData(sce.pbmc) <- cbind(colData(sce.pbmc), stats)
sce.pbmc <- sce.pbmc[,!high.mito]
```
As sanity check, we can look at the library size (i.e. total counts per cell)
......@@ -252,18 +245,12 @@ unfiltered$discard <- high.mito
plotColData(unfiltered, y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count")
```
In another view, we plot the `subsets_Mito_percent` against `sum`:
```{r plot_sum_mito}
plotColData(unfiltered, x="sum", y="subsets_Mito_percent", colour_by="discard")
```
The aim is to confirm that there are no cells with both large total counts and
large mitochondrial counts, to ensure that we are not inadvertently removing
......@@ -304,10 +291,44 @@ gridExtra::grid.arrange(
### Doublet detection
Another form of QC is detecting doublets: droplets which contain multiple cells. Methods to detect doublets use the idea that doublets will contain co-expressed pairs of genes that we would not normally expect to be co-expressed. Here we will use the scds package to detect doublets.
```{r doublet_detection}
library(scds)
umi <- cxds(sce.pbmc)
umi <- bcds(sce.pbmc)
umi <- cxds_bcds_hybrid(sce.pbmc)
CD <- colData(sce.pbmc)
head(cbind(CD$cxds_score, CD$bcds_score, CD$hybrid_score))
```
### Gene QC
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
```{r plot_highest_genes}
plotHighestExprs(sce.pbmc, exprs_values = "counts")
```
It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (note `colData(umi)$use` filter applied to the umi dataset).
```{r filter_genes}
keep_feature <- nexprs(
pbmc.sce[,colData(pbmc.sce)$use],
byrow = TRUE,
detection_limit = 1
) >= 2
rowData(pbmc.sce)$use <- keep_feature
```
### Data visualisation and exploratory data analysis
After having performed QC it is very important to visually explore the data to checkfor batch effects or other artifacts. To do this we can use a variety of visualisations. We will describe these in the [Latent spaces] section below.
## Normalisation, confounders and batch correction
### Normalisation theory
......@@ -358,7 +379,6 @@ stabilize the variance of the counts making downstream visualisation and methods
such as PCA more effective. The addition of 1 (sometimes called a pseudocount)
is needed as $\log(0)$ is $-\infty$ - a value not useful for data analysis.
```{r size_factor_normalisation}
library(scran)
set.seed(1000)
......@@ -448,7 +468,6 @@ It is recommended to simply pick a value and go ahead with the downstream analys
but with the intention of testing other parameters if needed.
## Latent spaces
### Dimensionality reduction and visualisation
......@@ -467,7 +486,6 @@ analyses. The `denoisePCA` function will automatically choose the number of PCs
to keep, using the information returned by the variance decomposition step to
remove principal components corresponding to technical noise.
```{r pca}
set.seed(10000)
## This function performs a principal components analysis to eliminate random
......@@ -484,7 +502,6 @@ sce.pbmc
reducedDims(sce.pbmc)$PCA[1:5,]
```
tSNE (t-stochastic neighbor embedding) is a popular dimensionality reduction
methods for visualising single-cell dataset with functions to run them available
in `scater`. They attempt to find a low-dimensional representation of the data
......@@ -530,7 +547,6 @@ information without obstructions.
We use the `plot_hexbin_feature` to plot the feature expression of single cells in
bivariate hexagonal cells.
```{r hexgonplot}
pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "TSNE")
......@@ -563,7 +579,7 @@ plot_hexbin_feature(pbmc_hex, feature = "CD14",
### Matrix factorisation and factor analysis
### Autoencoders
Another method for understanding single cell data through 'reduction' is matrix factorization and factor analysis. The key concept of factor analysis is that the original data are associated with some underlying unobserved variables: the latent factor. Hopefully. these latent factor are biologically meaningful, for example in single cell data a factor could correspond to a specific regulatory process. Research into factor analysis and matrix factorization is still ongoing but one leading method is [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html) which aims to make the latent factors more interpretable by enforcing sparsity constraints.
## Clustering and cell annotation
......@@ -580,7 +596,6 @@ their pair-wise Euclidean distances. After that, the nodes are connected if they
have at least one shared nearest neighbors, and weights are also added to the
edges by how many shared neighbours they have.
![shared nearest neighbour](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/figures/SNN.jpg)
Once the SNN graph is built, we can then use different algorithms to detect
......@@ -599,7 +614,6 @@ table(factor(clust))
--- Choose different values for `k` and observe how many clusters you get ---
### Clustree
The `clustree` package can be used to visualise the different clustering results in a
......@@ -608,7 +622,6 @@ changes.
Note, it looks very much like a hierarchical clustering result but it is not.
```{r hint,clustree,include=T,eval=T,fig.width=12}
g <- buildSNNGraph(sce.pbmc, k = 3, use.dimred = 'PCA')
k3 <- igraph::cluster_walktrap(g)$membership
......@@ -691,13 +704,13 @@ what cells types each cluster has ---
| Markers | Cell Type |
|:------|:-----|
| IL7R | CD4 T cells |
| CD14, LYZ | CD14+ Monocytes |
| CD14, LYZ | CD14+ Monocytes |
| MS4A1 | B cells |
| CD8A | CD8 T cells |
| FCGR3A, MS4A7 | FCGR3A+ Monocytes |
| GNLY , NKG7|NK cells |
| PPBP |Megakaryocytes|
| FCER1A, CST3 |Dendritic Cells|
| FCGR3A, MS4A7 | FCGR3A+ Monocytes |
| GNLY, NKG7 |NK cells |
| PPBP | Megakaryocytes|
| FCER1A, CST3 | Dendritic Cells|
```{r,include=F,eval=F}
pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "TSNE")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment