In this workshop, we will introduce the data infrastructure for scRNA-seq analysis
in R and practice a workflow of scRNAseq analysis; from pre-processing, quality
control to dimensionality reduction and clustering. We will then demonstrate the
usage of marker genes for cell type annotation and an automatic approach for matching
query cells to a reference atlas with labels.
control to dimensionality reduction and clustering, marker gene detection and cell type annotation - with plenty more along the way!
## Setup requirements
We use the Bioconductor single-cell ecosystem for this workshop. Thus, participants will need a recent version of R (version 4.0+) and a set of specific packages that we use.
The code snippet below will install the necessary packages for you in R (i.e. run the following code at the R prompt in an R or RStudio session). The first line installs the `BiocManager` package, which is the preferred package for then installing Bioconductor packages. The next (long) line then installs the necessary Bioconductor packages (and any dependencies).
```{r package-installation, eval=FALSE}
install.packages("BiocManager")
BiocManager::install(c("scRNAseq",
"scater",
"scran",
"clustree",
"BiocSingular",
"Rtsne",
"BiocFileCache",
"DropletUtils",
"EnsDb.Hsapiens.v86",
"schex",
"celldex",
"SingleR",
"gridExtra"))
```
If you have trouble with the package installation, please ask a colleague for help.
Once you have installed the packages, you're ready to go with the workshop. So let's...
## Get crackin'
Follow this link for the [single-cell RNA-seq analysis workflow focusing on cell type identification](sahmri_analysis-workflow.html).
**What's not covered?** Single-cell data analysis is a huge topic - way bigger than could be crammed into a one-day workshop. So to squeeze the topics we will cover into one day we have had to leave out huge and important areas of single-cell data analysis including: trajectory (pseudotime) analysis, RNA velocity, sophisticated treatment of comparison of cells across samples/conditions, and any tools not easily available for use in R. There'sawonderfulworldofsingle-cellanalysisouttherethatwe're only scratching the surface of here - we hope it provides a foundation for you to explore further!
## Required R Packages
The list of R packages we will be using today is below. We will discuss these packages and which functions we use in more detail as we go through the workshop.
Thismaterialisadaptedfrom[@Amezquita2019-yn].
This material is adapted from the outstanding [Bioconductor guide to single-cell analysis](https://osca.bioconductor.org).
cells) and monocytes. These PBMCs are from one healthy donor and are expected to
have a relatively small amounts of RNA (~1pg RNA/cell) ([Zheng et al, 2017](https://www.nature.com/articles/ncomms14049)). This
means the dataset is relatively small and easy to manage, but doesn'thavethecomplexityofdatasetsthatcontaincellsfrommultipledonorsand/orexperimentalconditions.
Another method for understanding single cell data through 'reduction' is matrix factorization and factor analysis. The key concept of factor analysis is that the original data are associated with some underlying unobserved variables: the latent factor. Hopefully. these latent factor are biologically meaningful, for example in single cell data a factor could correspond to a specific regulatory process. Research into factor analysis and matrix factorization is still ongoing but one leading method is [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html) which aims to make the latent factors more interpretable by enforcing sparsity constraints.
## Clustering and cell annotation
### Autoencoders
The [scVI paper](https://www.nature.com/articles/s41592-018-0229-2) can be credited with introducing and raising the profile of variational autoencoders - a relatively recent method from machine learning - for single-cell analysis. Variational autoencoders also use the idea of a latent space to understand structure in a dataset. Unlike more "traditional" matrix factorisation approaches, however, they use neural networks to compress information and can take advantage of gradient descent and GPU acceleration to scale to very large datasets. Since the publication of scVI, at least 40 methods have been proposed with twists on autoencoders; many of them provide excellent utility for data analysis, but a comprehensive survey is beyond the scope of what we can provide here. The [scvi-tools webpage](https://scvi-tools.org/) is now a great resource for related probabilistic methods for single-cell data analysis available in both Python and R that can interact with Bioconductor and Seurat workflows. We don'thavethespacetoexplorethemhere,buttheyareverymuchworthexploring.