Commit d13d36fd authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Updating workshop content

parent 9ec26366
......@@ -14,8 +14,37 @@ knitr::opts_chunk$set(echo = TRUE)
In this workshop, we will introduce the data infrastructure for scRNA-seq analysis
in R and practice a workflow of scRNAseq analysis; from pre-processing, quality
control to dimensionality reduction and clustering. We will then demonstrate the
usage of marker genes for cell type annotation and an automatic approach for matching
query cells to a reference atlas with labels.
control to dimensionality reduction and clustering, marker gene detection and cell type annotation - with plenty more along the way!
## Setup requirements
We use the Bioconductor single-cell ecosystem for this workshop. Thus, participants will need a recent version of R (version 4.0+) and a set of specific packages that we use.
The code snippet below will install the necessary packages for you in R (i.e. run the following code at the R prompt in an R or RStudio session). The first line installs the `BiocManager` package, which is the preferred package for then installing Bioconductor packages. The next (long) line then installs the necessary Bioconductor packages (and any dependencies).
```{r package-installation, eval=FALSE}
install.packages("BiocManager")
BiocManager::install(c("scRNAseq",
"scater",
"scran",
"clustree",
"BiocSingular",
"Rtsne",
"BiocFileCache",
"DropletUtils",
"EnsDb.Hsapiens.v86",
"schex",
"celldex",
"SingleR",
"gridExtra"))
```
If you have trouble with the package installation, please ask a colleague for help.
Once you have installed the packages, you're ready to go with the workshop. So let's...
## Get crackin'
Follow this link for the [single-cell RNA-seq analysis workflow focusing on cell type identification](sahmri_analysis-workflow.html).
......@@ -12,33 +12,41 @@ knitr::opts_chunk$set(echo = TRUE, dpi = 100, warning = FALSE)
## Introduction
This workshop provides a quick (~2 hour), hands-on introduction to current computational analysis approaches for studying single-cell RNA sequencing data. We will discuss the data formats used frequently in single-cell analysis, typical steps in quality control and processing of single-cell data, and some of the biological questions these data can be used to answer.
This workshop provides a quick (~5 hour), hands-on introduction to current computational analysis approaches for studying single-cell RNA sequencing data. We will discuss the data formats used frequently in single-cell analysis, typical steps in quality control and processing of single-cell data, some of the biological questions these data can be used to answer and data analysis steps to answer them.
Given we only have 2 hours, we will start with the data already processed into a count matrix, which contains the number of sequencing reads mapping to each gene for each cell. The steps to generate such a count matrix depend on the type of single-cell sequencing technology used and the experimental design. For a more detailed introduction to these methods we recommend the long-form single-cell RNA seq analysis workshop from the BioCellGen group ([available here](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/)) and the analysis of single-cell RNA-seq data course put together by folks at the Sanger Institute [available here](https://scrnaseq-course.cog.sanger.ac.uk/website/processing-raw-scrna-seq-data.html). Briefly, a count matrix is generated from raw sequencing data using the following steps:
Given we only have a few hours, we will start with the data already processed into a count matrix, which contains the number of unique molecular identifiers (UMIs) mapping to each gene for each cell. The steps to generate such a count matrix depend on the type of single-cell sequencing technology used and the experimental design. For a more detailed introduction to these methods we recommend the long-form single-cell RNA seq analysis workshop from the SVI BioCellGen group ([available here](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/)) and the analysis of single-cell RNA-seq data course put together by folks at the Sanger Institute ([available here](https://scrnaseq-course.cog.sanger.ac.uk/website/processing-raw-scrna-seq-data.html)).
Briefly, a count matrix is generated from raw sequencing data using the following steps:
1. If multiple sample libraries were pooled together for sequencing (typically to reduce cost), sequences are separated (i.e. demultiplexed) by sample using barcodes that were added to the reads during library preparation.
1. Quality control on the raw sequencing reads (e.g. using [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
1. Trimming the reads to remove sequencing adapters and low quality sequences from the ends of each read (e.g. using [Trim Galore!](https://github.com/FelixKrueger/TrimGalore)).
1. Aligning QCed and trimmed reads to the genome (e.g. using [STARsolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md), [Kalliso-BUStools](https://www.kallistobus.tools/about), or [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ov)).
1. Quality control of mapping results.
1. Quantify the expression level of each gene for each cell (e.g. using bulk tools or single-cell specific tools from STAR, Kallisto-BUStools, or CellRanger).
1. Quantify the expression level of each gene for each cell (e.g. using bulk tools or single-cell specific tools from STAR, Kallisto-BUStools, Salmon-alevin, or CellRanger).
Starting with the expression counts matrix, this workshop will cover the following topics:
1. The SingleCellExperiment object
1. Empty droplet identification
1. Cell-level quality control
1. Normalisation
1. Dimension Reduction and visualisaion
1. Gene annotation
1. Quality control and exploratory data analysis
1. Normalisation, confounders and batch correction
1. Variance modeling (feature selection)
1. Latent spaces and visualisation
1. Clustering
1. Marker gene/cell annotation
1. Marker gene detection and cell annotation
1. Differential expression analysis
We will work through these topics together in this hands-on workshop.
**What's not covered?** Single-cell data analysis is a huge topic - way bigger than could be crammed into a one-day workshop. So to squeeze the topics we will cover into one day we have had to leave out huge and important areas of single-cell data analysis including: trajectory (pseudotime) analysis, RNA velocity, sophisticated treatment of comparison of cells across samples/conditions, and any tools not easily available for use in R. There's a wonderful world of single-cell analysis out there that we're only scratching the surface of here - we hope it provides a foundation for you to explore further!
## Required R Packages
The list of R packages we will be using today is below. We will discuss these packages and which functions we use in more detail as we go through the workshop.
This material is adapted from [@Amezquita2019-yn].
This material is adapted from the outstanding [Bioconductor guide to single-cell analysis](https://osca.bioconductor.org).
```{r packages}
suppressPackageStartupMessages({
......@@ -62,9 +70,9 @@ suppressPackageStartupMessages({
We will use the peripheral blood mononuclear cell `(PBMC) dataset` from 10X
Genomics which consist of different kinds of lymphocytes (T cells, B cells, NK
cells) and monocytes. These PBMCs are from a healthy donor and are expected to
have a relatively small amounts of RNA (~1pg RNA/cell) [@Zheng2017-hd]. This
means the dataset is relatively small and easy to manage.
cells) and monocytes. These PBMCs are from one healthy donor and are expected to
have a relatively small amounts of RNA (~1pg RNA/cell) ([Zheng et al, 2017](https://www.nature.com/articles/ncomms14049)). This
means the dataset is relatively small and easy to manage, but doesn't have the complexity of datasets that contain cells from multiple donors and/or experimental conditions.
We can download the gene count matrix for this dataset from 10X Genomics and
unpack the zipped file directly in R. We also use `BiocFileCache` function
......@@ -963,7 +971,11 @@ plot_hexbin_meta(pbmc_hex, col = "hybrid_score", action = "mean") +
Another method for understanding single cell data through 'reduction' is matrix factorization and factor analysis. The key concept of factor analysis is that the original data are associated with some underlying unobserved variables: the latent factor. Hopefully. these latent factor are biologically meaningful, for example in single cell data a factor could correspond to a specific regulatory process. Research into factor analysis and matrix factorization is still ongoing but one leading method is [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html) which aims to make the latent factors more interpretable by enforcing sparsity constraints.
## Clustering and cell annotation
### Autoencoders
The [scVI paper](https://www.nature.com/articles/s41592-018-0229-2) can be credited with introducing and raising the profile of variational autoencoders - a relatively recent method from machine learning - for single-cell analysis. Variational autoencoders also use the idea of a latent space to understand structure in a dataset. Unlike more "traditional" matrix factorisation approaches, however, they use neural networks to compress information and can take advantage of gradient descent and GPU acceleration to scale to very large datasets. Since the publication of scVI, at least 40 methods have been proposed with twists on autoencoders; many of them provide excellent utility for data analysis, but a comprehensive survey is beyond the scope of what we can provide here. The [scvi-tools webpage](https://scvi-tools.org/) is now a great resource for related probabilistic methods for single-cell data analysis available in both Python and R that can interact with Bioconductor and Seurat workflows. We don't have the space to explore them here, but they are very much worth exploring.
## Clustering
One of the most promising applications of scRNA-seq is *de novo* discovery and
annotation of cell-types based on transcription profiles. We are going to use
......@@ -1026,7 +1038,7 @@ scRNA-seq. It is challenging to find the best resolution, and you usually need t
experiment with different parameters multiple times and to find the best result.
## Plot Clusters
### Plot Clusters
Plot clustering result in a t-SNE plot
......@@ -1034,11 +1046,11 @@ Plot clustering result in a t-SNE plot
plotTSNE(sce.pbmc, colour_by = "label")
```
## Merging clusters
### Merging clusters
At this point, merging clusters to refine computationally-derived clusters into optimally biologically-relevant clusters is more art than science. In many settings it would be desirable to understand the hierarchical relationships between cell types and the annotations of cells to those cell types. At this point in time, we are not aware of computational methods that merge clusters in the way we might want. Readers interested in the topic, however, might be interested in following up with the `MRtree` method [(Peng et al, *NAR*, 2021)](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab481/6298623).
## Marker dection/cluster annotation
## Marker detection and cluster annotation
To interpret our clustering results, we identify the genes that drive separation
between clusters. These so-called 'marker genes' allow us to assign biological meaning to each cluster based by comparing the expression profiles of the marker genes to the expression profiles of previously identified cell types. We demonstrate this approach in the section ["Manual" annotation].
......@@ -1184,7 +1196,7 @@ DE analysis possible using single cell data are:
Each of these different possible analyses require different methodologies.
The first type of analysis is the same as finding marker genes (described
above) and any marker gene methods which use differential expression testing
can be used. A variety of methods are avaliable for the second type of
can be used. A variety of methods are available for the second type of
analysis. Recent benchmarking [@Soneson2018-hy] has suggested that [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html),
a differential expression method developed for bulk RNA-seq data performs well. Other options are [MAST](https://bioconductor.org/packages/release/bioc/html/MAST.html),
a differential expression method developed for scRNA-seq data or [limma-voom](https://bioconductor.org/packages/release/bioc/html/limma.html),
......@@ -1194,13 +1206,11 @@ each sample are not independent. Currently, the best methods for this analysis u
a pseudobulk approach - aggregating cells in each sample and then using methods
designed for bulk data. This strategy is implemented in the [muscat](https://www.bioconductor.org/packages/release/bioc/html/muscat.html) package.
In addition, other types of analysis, similar to DE analysis, are possible in scRNA-seq
data. Firstly, we can test for differences in the number of cells in each cluster
between conditions. This is called differential abundance or differential composition
testing. Secondly, the large number of cells in single cell data allow testing of differences
in distribution - not just mean as in traditional DE analysis between samples. For both
of these analyses method development is ongoing and systematic benchmarking of available
methods does not yet exist.
In addition, other types of analysis, similar to DE analysis, are possible in scRNA-seq data. Firstly, we can test for differences in the number of cells in each cluster between conditions. This is called differential abundance or differential composition testing. If you are interested in differential abundance testing, take a look at the [Milo](https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1) method implemented in the [miloR](http://bioconductor.org/packages/release/bioc/html/miloR.html) R/Bioconductor package.
Secondly, the large number of cells in single cell data allow testing of differences in distribution - not just mean as in traditional DE analysis between samples. Several methods have been proposed in this area and there are fascinating results where single-cell expression variance is studied as an interesting phenotype [in its own right](https://science.sciencemag.org/content/355/6332/1433). In the overall picture of scRNA-seq data analysis, however, such differential distribution testing remains relatively niche and those interested in such approaches are best served talking to a statistical bioinformatic colleague about tailoring an approach specifically to their question(s) of interest.
For both of these analysis method development is ongoing and systematic benchmarking of available methods does not yet exist, making it difficult to make general recommendations about tools to use. If interested, please consult your friendly neighbourhood bioinformatician!
### DE in a real dataset
......@@ -1210,9 +1220,7 @@ Let us take some time now to discuss DE queries that may arise for different dat
## Further exploration
The scRNAseq package provides convenient access to
a list of publicly available data sets in the form of `SingleCellExperiment`
objects. You can choose one of them to practice these steps we introduced above.
The scRNAseq package provides convenient access to a list of publicly available data sets in the form of `SingleCellExperiment` objects. You can choose one of them to practice these steps we introduced above.
More resources:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment