# Total UMI count for each barcode in the PBMC dataset,
# plotted against its rank (in decreasing order of total counts).
# The inferred locations of the inflection and knee points are also shown.
```
Filtering just by library size may result in eliminating barcodes that contained cells with naturally low expression levels. Luckily, more accurate methods have been developed to filter out cell-less barcodes from droplet-based data. Here we use the `emptyDrops` method from the `DropletUtils` package, which estimates the profile of the ambient RNA pool and then tests each barcode for deviations from this profile [@Lun2019-tg].
...
...
@@ -222,10 +220,6 @@ leaving only the mitochondrial mRNAs still in place.
We first calculate QC metrics by `perCellQCMetrics` function from `scater` and
then filter out the cells with outlying mitochondrial gene proportions.
The aim is to confirm that there are no cells with both large total counts and
large mitochondrial counts, to ensure that we are not inadvertently removing
...
...
@@ -304,10 +291,44 @@ gridExtra::grid.arrange(
### Doublet detection
Another form of QC is detecting doublets: droplets which contain multiple cells. Methods to detect doublets use the idea that doublets will contain co-expressed pairs of genes that we would not normally expect to be co-expressed. Here we will use the scds package to detect doublets.
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (note `colData(umi)$use` filter applied to the umi dataset).
```{r filter_genes}
keep_feature <- nexprs(
pbmc.sce[,colData(pbmc.sce)$use],
byrow = TRUE,
detection_limit = 1
) >= 2
rowData(pbmc.sce)$use <- keep_feature
```
### Data visualisation and exploratory data analysis
After having performed QC it is very important to visually explore the data to checkfor batch effects or other artifacts. To do this we can use a variety of visualisations. We will describe these in the [Latent spaces] section below.
## Normalisation, confounders and batch correction
### Normalisation theory
...
...
@@ -358,7 +379,6 @@ stabilize the variance of the counts making downstream visualisation and methods
such as PCA more effective. The addition of 1 (sometimes called a pseudocount)
is needed as $\log(0)$ is $-\infty$ - a value not useful for data analysis.
```{r size_factor_normalisation}
library(scran)
set.seed(1000)
...
...
@@ -448,7 +468,6 @@ It is recommended to simply pick a value and go ahead with the downstream analys