Commit 104919f1 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Tidying up code and text in the workshop Rmd

parent e904f4f4
......@@ -291,18 +291,32 @@ gridExtra::grid.arrange(
### Doublet detection
Another form of QC is detecting doublets: droplets which contain multiple cells. Methods to detect doublets use the idea that doublets will contain co-expressed pairs of genes that we would not normally expect to be co-expressed. Here we will use the scds package to detect doublets.
Another form of QC is detecting doublets: droplets which contain multiple cells. Methods to detect doublets use the idea that doublets will contain co-expressed pairs of genes that we would not normally expect to be co-expressed. Here we will use the `scds` package to detect doublets.
The `scds` contains two separate methods `cxds()`, a co-expression based approach, and `bcds()`, a binary classification approach to discriminate artificial doublets from original data. The package also implements a hybrid method, `cxds_bcds_hybrid()`. For details, please see the `scds` package documentation and the [scds paper](https://academic.oup.com/bioinformatics/article/36/4/1150/5566507). Here, we demonstrate the use of the `cxds_bcds_hybrid()` approach.
```{r doublet_detection}
library(scds)
umi <- cxds(sce.pbmc)
umi <- bcds(sce.pbmc)
umi <- cxds_bcds_hybrid(sce.pbmc)
sce.pbmc <- cxds_bcds_hybrid(sce.pbmc)
sce.pbmc$doublet_outlier <- isOutlier(sce.pbmc$hybrid_score, nmads = 5)
hist(sce.pbmc$hybrid_score)
abline(v = min(sce.pbmc$hybrid_score[sce.pbmc$doublet_outlier]), col = "firebrick",
lty = 2)
table(sce.pbmc$doublet_outlier)
gridExtra::grid.arrange(
plotColData(sce.pbmc, x = "total", y = "hybrid_score", colour_by = "detected"),
plotColData(sce.pbmc, x = "total", y = "hybrid_score", colour_by = "doublet_outlier"),
ncol = 2)
CD <- colData(sce.pbmc)
head(cbind(CD$cxds_score, CD$bcds_score, CD$hybrid_score))
```
In this example, `scds's` hybrid method finds 264 cells (barcodes) to be likely doublets if we define as outliers any cells with a `hybrid_score` more than 5 median absolute deviations away from the median. Even with a highly automated outlier detection method, there are choices for the analyst to make that will affect the final doublet detection results.
Please note that `scds` is by no means the only good or useful package available for expression based doublet detection for scRNA-seq data. It is also worth explicitly stating that this is a very challenging problem, and perfect accuracy in doublet detection is an unreasonable expectation
### Gene QC
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
......@@ -310,19 +324,21 @@ In addition to removing cells with poor quality, it is usually a good idea to ex
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
```{r plot_highest_genes}
plotHighestExprs(sce.pbmc, exprs_values = "counts")
plotHighestExprs(sce.pbmc[,sample(seq_len(ncol(sce.pbmc)), 200)], exprs_values = "counts")
```
It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (note `colData(umi)$use` filter applied to the umi dataset).
[In this example we look just at expression in a random sample of 200 genes, for computational speed. If you have more time available you might like to replicate the plot using all cells from the `sce.pbmc` object.]
It is typically a good idea to remove genes whose expression level is considered undetectable. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (note `umi$use` filter applied to the umi dataset).
```{r filter_genes}
keep_feature <- nexprs(
pbmc.sce[,colData(pbmc.sce)$use],
keep_feature <- scater::nexprs(
sce.pbmc,
byrow = TRUE,
detection_limit = 1
) >= 2
rowData(pbmc.sce)$use <- keep_feature
rowData(sce.pbmc)$use <- keep_feature
```
### Data visualisation and exploratory data analysis
......@@ -545,19 +561,24 @@ Plotting summarized information of all cells in their respective hexagonal cells
information without obstructions.
We use the `plot_hexbin_feature` to plot the feature expression of single cells in
bivariate hexagonal cells.
bivariate hexagonal cells. These plots are extremely useful both for QC and biological investigations of the data (see plots below for doublet outlier status, mean number of total counts per cell and median expression of CD14).
```{r hexgonplot}
pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "TSNE")
pbmc_hex <- schex::make_hexbin(sce.pbmc, 100, dimension_reduction = "TSNE")
pbmc_hex$doublet_outlier <- as.numeric(pbmc_hex$doublet_outlier)
plot_hexbin_meta(pbmc_hex, col = "doublet_outlier", action = "prop_0") +
ggtitle("Proportion of doublet outliers per hexbin")
plot_hexbin_meta(pbmc_hex, col = "total", action = "mean") +
ggtitle("Mean total number of counts per cell")
plot_hexbin_feature(pbmc_hex,feature = "CD14",
type = "logcounts",
action = "median")
plot_hexbin_meta(pbmc_hex, col = "total", action = "mean") +
theme(legend.position = "none")
```
### Hands-on practise
### Hands-on practice
---run UMAP and plot----
......@@ -575,6 +596,11 @@ pbmc_hex <- schex::make_hexbin(sce.pbmc, 50, dimension_reduction = "UMAP")
plot_hexbin_feature(pbmc_hex, feature = "CD14",
type = "logcounts",
action = "median")
plot_hexbin_feature(pbmc_hex, feature = "CD69",
type = "logcounts",
action = "median")
```
### Matrix factorisation and factor analysis
......@@ -657,10 +683,7 @@ Perhaps add a section here on MRtree (Peng et al, NAR, 2021)
## Marker dection/cluster annotation
To interpret our clustering results, we identify the genes that drive separation
between clusters. These so-called 'marker genes' allow us to assign biological meaning to
each cluster based by comparing the expression profiles of the marker genes to the
expression profiles of previously identified cell types. We demonstrate this approach
in the section ["Manual" annotation].
between clusters. These so-called 'marker genes' allow us to assign biological meaning to each cluster based by comparing the expression profiles of the marker genes to the expression profiles of previously identified cell types. We demonstrate this approach in the section ["Manual" annotation].
An alternative way is to use some automatic annotation approach, see the
["Automatic" annotation] section.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment