Commit 01a23270 authored by Jeffrey Pullin's avatar Jeffrey Pullin
Browse files

Edits from Jeffrey

parent f1e61d56
......@@ -283,10 +283,10 @@ high.count <- isOutlier(stats$total, type="higher")
table(high.count)
```
---2. Why would be usefule to filter out cells with higher total counts---
---2. Why would be useful to filter out cells with higher total counts---
---3. Use `plotColData` to plot other useful cell-level QC metrics
(e.g. number of detected features, subsets_Mito_detected) and colour by whether cells would be discarded.---
(e.g. number of detected features, `subsets_Mito_detected`) and colour by whether cells would be discarded.---
Hint: you may find the following code helpful.
......@@ -324,13 +324,13 @@ expression profiles.
We are going to use a normalisation method named Normalisation by Deconvolution.
It is a scaling-based procedure that involves dividing all counts for each
cell by a cell-specific scaling factor, often called a “size factor”.The size
cell by a cell-specific scaling factor, often called a “size factor”. The size
factor for each cell represents the estimate of the relative bias in that cell,
so division of its counts by its size factor should remove that bias.
Adapted from bulk RNAseq analysis, we assume the majority of genes between
two cells are non-DE (not differential expressed) and the difference among the
nonDE genes are used to represent bias that is used to compute an appropriate
non-DE genes are used to represent bias that is used to compute an appropriate
size factor for its removal.
Given the characteristics of low-counts in single cell dataset, we first pool
......@@ -443,9 +443,9 @@ differentially expressed across all the cells, we can set the prop to be 5%. In
a more heterogeneous cell population, we would raise that number. Here we are
using top 10%.
Note that this number selection is actually quite arbitrary. It is recommended
to simply pick a value and go ahead with the downstream analysis but with the
intention of testing other parameters if needed.
Note that the selection of the number of genes to keep is actually quite arbitrary.
It is recommended to simply pick a value and go ahead with the downstream analysis
but with the intention of testing other parameters if needed.
......@@ -453,7 +453,7 @@ intention of testing other parameters if needed.
### Dimensionality reduction and visualisation
Now we have selected the highly variable genes, we can use these genes's
Now we have selected the highly variable genes, we can use these genes'
information to reveal cell population structures in a constructed lower dimensional
space. Working with this lower dimensional representation allows for easier
exploration and reduces the computational burden of downstream analyses,
......@@ -485,7 +485,7 @@ reducedDims(sce.pbmc)$PCA[1:5,]
```
tSNE (t-stochastic neighbor embedding ) is a popular dimensionality reduction
tSNE (t-stochastic neighbor embedding) is a popular dimensionality reduction
methods for visualising single-cell dataset with functions to run them available
in `scater`. They attempt to find a low-dimensional representation of the data
that preserves the distances between each point and its neighbours in the
......@@ -510,7 +510,7 @@ reducedDims(sce.pbmc)
Visualising column data or plotting gene expression in tSNE plot:
```{r tse_plot}
```{r tsne_plot}
plotTSNE(sce.pbmc, colour_by = "detected")
plotTSNE(sce.pbmc, colour_by = "CD14")
......@@ -523,12 +523,12 @@ Notice that in the 2D plots of single cells, sometimes it can get really
runs into a over-plotting issue where cells are two close overlaying/blocking
To deal with that, we can use the hexbin plots [@Freytag2020-us].
`schex` package provides binning strategies of cells into hexagon cells. Plotting
summarized information of all cells in their respective hexagon cells presents
The `schex` package strategies for binning the observed cells into hexagonal cells.
Plotting summarized information of all cells in their respective hexagonal cells presents
information without obstructions.
We use the `plot_hexbin_feature` to plot of feature expression of single cells in
bivariate hexagon cells.
We use the `plot_hexbin_feature` to plot the feature expression of single cells in
bivariate hexagonal cells.
```{r hexgonplot}
......@@ -567,13 +567,13 @@ plot_hexbin_feature(pbmc_hex, feature = "CD14",
## Clustering and cell annotation
One of the most promising applications of scRNA-seq is de novo discovery and
One of the most promising applications of scRNA-seq is *de novo* discovery and
annotation of cell-types based on transcription profiles. We are going to use
unsupervised clustering method, that is, we need to identify groups of cells
based on the similarities of the transcriptomes without any prior knowledge of
the labels.
In scRNAseq analyses, shared K-nearest neighbour graph-based clustering is
In scRNA-seq analyses, shared K-nearest neighbour graph-based clustering is
often applied. Each cell is represented in the graph as a node, and the first
step is to find all the KNN (K-neareast neighbours) for each cell by comparing
their pair-wise Euclidean distances. After that, the nodes are connected if they
......@@ -602,7 +602,7 @@ table(factor(clust))
### Clustree
`clustree` can be used to visualise the different clustering results in a
The `clustree` package can be used to visualise the different clustering results in a
tree structure. It helps to visualise how cells move as the clustering resolution
changes.
......@@ -625,8 +625,8 @@ clustree(multipleKResults, prefix = "k", prop_filter = 0.05, layout = "tree")
```
Varying clustering resolutions is an important part of data exploration for
scRNAseq. It is challenging to find the best resolution, and you usually need to
experiment with different parameters multiple times and try find the proper result.
scRNA-seq. It is challenging to find the best resolution, and you usually need to
experiment with different parameters multiple times and to find the best result.
## Plot Clusters
......@@ -641,42 +641,44 @@ plotTSNE(sce.pbmc, colour_by = "label")
Perhaps add a section here on MRtree (Peng et al, NAR, 2021)
## Marker dection/cluster annotation
To interpret our clustering results, we identify the genes that drive separation
between clusters. These marker genes allow us to assign biological meaning to
each cluster based on their functional annotation. We demonstrate this approach
in the ["Manual" annotation].
between clusters. These so-called 'marker genes' allow us to assign biological meaning to
each cluster based by comparing the expression profiles of the marker genes to the
expression profiles of previously identified cell types. We demonstrate this approach
in the section ["Manual" annotation].
An alternative way is to use some automatic annotation approach, see the
[Automatic annotation ]
["Automatic" annotation] section.
### "Manual" annotation
`findMarkers` is done via pair-wise comparisions of all pari-wise combinations
of clusters. Each cluster is compared to each of the remaining clusters and
up-regulated genes are identified using t-test. The returned results for one
cluster (versus any of the remaining clusters) are combined into a single list
and the combined.P.values are calculated. One can specifty the `pval.type` which
determines how the DEGs are combined: `any`, return genes that are determined
as up-regulated in any of the comparisons; `all`, return genes that are determined
as up-regulated in `all` of the comparisons,`some` is a compromise between `any`
and `all`.
```{r, findMarkers}
To identify the marker genes for each cluster we use the `findMarkers()` function
from the scran package. The `findMarkers()` function uses pair-wise differential
expression testing to identify marker genes. This means that, for each gene, it
tests for differences in expression between the cluster of interest and each
of the other clusters in the dataset seperately. By default, a t-test is used, but
a binomial test or the Wilcoxon rank sum test are also implemented. In the code below
we specify that we only want to idenfity genes which are `"up"` regulated in the
cluster of interest. The returned results for one cluster (versus all of the remaining
clusters) are combined into a single list and combined p-values are calculated.
One can specifty the `pval.type` which determines how the DEGs are combined: `any`,
return genes that are determined as up-regulated in any of the comparisons; `all`,
return genes that are determined as up-regulated in `all` of the comparisons, `some`
is a compromise between `any` and `all`.
```{r findMarkers}
markers <- findMarkers(sce.pbmc, pval.type = "some", direction = "up")
```
We examine the markers for cluster 7 and 14 in more detail.
```{r,plotMarkers}
```{r plotMarkers}
marker.set <- markers[["7"]]
as.data.frame(marker.set[1:10,1:3])
marker.set <- markers[["14"]]
as.data.frame(marker.set[1:10,1:3])
```
### Question
......@@ -730,15 +732,14 @@ gridExtra::grid.arrange(
)
```
```{r}
plotExpression(sce.pbmc, features = c("CD14"), x = "label", colour_by = "label")
```
### Automatic annotation
### "Automatic" annotation
The intuition of automatic annotation is by comparing current dataset with
previously annotation reference samples and find the cell type of the most
Intuitively, automatic annotation works by comparing the current dataset with
previously annotated reference samples and finding the cell type of the most
similar/correlated cells. `SingleR` method [@Aran2019-pw] provides one such option
for cell type annotation. This method assigns labels to cells based on the
reference samples with the highest Spearman rank correlations, using only the
......@@ -772,13 +773,40 @@ colData(sce.pbmc) <- cbind(colData(sce.pbmc),pred)
plotTSNE(sce.pbmc, colour_by = "labels")
```
## Differential expression (DE) analysis
### Introduction to DE analysis
### DE in a real dataset
Unlike analysing bulk RNA-seq data, where only one type of DE analysis is possible,
many types of DE analysis can be performed on single cell data. Some different types
DE analysis possible using single cell data are:
1. Comparing one cluster to all other clusters in a dataset
1. Comparing a specific pair of clusters (or groups of clusters) within a single dataset
1. Comparing the same cell type across samples in different conditions
Each of these different possible analyses require different methodologies.
The first type of analysis is the same as finding marker genes (described
above) and any marker gene methods which use differential expression testing
can be used. A variety of methods are avaliable for the second type of
analysis. Recent benchmarking [@Soneson2018-hy] has suggested that [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html),
a differential expression method developed for bulk RNA-seq data performs well. Other options are [MAST](https://bioconductor.org/packages/release/bioc/html/MAST.html),
a differential expression method developed for scRNA-seq data or [limma-voom](https://bioconductor.org/packages/release/bioc/html/limma.html),
another method developed for bulk RNA-seq data. Finally, for the third type of
analysis is it critical to account for the fact that measurements of cells within
each sample are not independent. Currently, the best methods for this analysis use
a pseudobulk approach - aggregating cells in each sample and then using methods
designed for bulk data. This strategy is implemented in the [muscat](https://www.bioconductor.org/packages/release/bioc/html/muscat.html) package.
In addition, other types of analysis, similar to DE analysis, are possible in scRNA-seq
data. Firstly, we can test for differences in the number of cells in each cluster
between conditions. This is called differential abundance or differential composition
testing. Secondly, the large number of cells in single cell data allow testing of differences
in distribution - not just mean as in traditional DE analysis between samples. For both
of these analyses method development is ongoing and systematic benchmarking of available
methods does not yet exist.
### DE in a real dataset
## Further exploration
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment