diff --git a/analysis/sahmri_analysis-workflow.Rmd b/analysis/sahmri_analysis-workflow.Rmd index 4737ef1833d56e57bf252aacae83ce5e4460d5c5..b7b60b3aefa9cb43174c036e88f6cc7f9a14396e 100644 --- a/analysis/sahmri_analysis-workflow.Rmd +++ b/analysis/sahmri_analysis-workflow.Rmd @@ -283,10 +283,10 @@ high.count <- isOutlier(stats$total, type="higher") table(high.count) ``` ----2. Why would be usefule to filter out cells with higher total counts--- +---2. Why would be useful to filter out cells with higher total counts--- ---3. Use `plotColData` to plot other useful cell-level QC metrics -(e.g. number of detected features, subsets_Mito_detected) and colour by whether cells would be discarded.--- +(e.g. number of detected features, `subsets_Mito_detected`) and colour by whether cells would be discarded.--- Hint: you may find the following code helpful. @@ -324,13 +324,13 @@ expression profiles. We are going to use a normalisation method named Normalisation by Deconvolution. It is a scaling-based procedure that involves dividing all counts for each -cell by a cell-specific scaling factor, often called a “size factor”.The size +cell by a cell-specific scaling factor, often called a “size factor”. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. Adapted from bulk RNAseq analysis, we assume the majority of genes between two cells are non-DE (not differential expressed) and the difference among the -nonDE genes are used to represent bias that is used to compute an appropriate +non-DE genes are used to represent bias that is used to compute an appropriate size factor for its removal. Given the characteristics of low-counts in single cell dataset, we first pool @@ -443,9 +443,9 @@ differentially expressed across all the cells, we can set the prop to be 5%. In a more heterogeneous cell population, we would raise that number. Here we are using top 10%. -Note that this number selection is actually quite arbitrary. It is recommended -to simply pick a value and go ahead with the downstream analysis but with the -intention of testing other parameters if needed. +Note that the selection of the number of genes to keep is actually quite arbitrary. +It is recommended to simply pick a value and go ahead with the downstream analysis +but with the intention of testing other parameters if needed. @@ -453,7 +453,7 @@ intention of testing other parameters if needed. ### Dimensionality reduction and visualisation -Now we have selected the highly variable genes, we can use these genes's +Now we have selected the highly variable genes, we can use these genes' information to reveal cell population structures in a constructed lower dimensional space. Working with this lower dimensional representation allows for easier exploration and reduces the computational burden of downstream analyses, @@ -485,7 +485,7 @@ reducedDims(sce.pbmc)$PCA[1:5,] ``` -tSNE (t-stochastic neighbor embedding ) is a popular dimensionality reduction +tSNE (t-stochastic neighbor embedding) is a popular dimensionality reduction methods for visualising single-cell dataset with functions to run them available in `scater`. They attempt to find a low-dimensional representation of the data that preserves the distances between each point and its neighbours in the @@ -510,7 +510,7 @@ reducedDims(sce.pbmc) Visualising column data or plotting gene expression in tSNE plot: -```{r tse_plot} +```{r tsne_plot} plotTSNE(sce.pbmc, colour_by = "detected") plotTSNE(sce.pbmc, colour_by = "CD14") @@ -523,12 +523,12 @@ Notice that in the 2D plots of single cells, sometimes it can get really runs into a over-plotting issue where cells are two close overlaying/blocking To deal with that, we can use the hexbin plots [@Freytag2020-us]. -`schex` package provides binning strategies of cells into hexagon cells. Plotting -summarized information of all cells in their respective hexagon cells presents +The `schex` package strategies for binning the observed cells into hexagonal cells. +Plotting summarized information of all cells in their respective hexagonal cells presents information without obstructions. -We use the `plot_hexbin_feature` to plot of feature expression of single cells in -bivariate hexagon cells. +We use the `plot_hexbin_feature` to plot the feature expression of single cells in +bivariate hexagonal cells. ```{r hexgonplot} @@ -567,13 +567,13 @@ plot_hexbin_feature(pbmc_hex, feature = "CD14", ## Clustering and cell annotation -One of the most promising applications of scRNA-seq is de novo discovery and +One of the most promising applications of scRNA-seq is *de novo* discovery and annotation of cell-types based on transcription profiles. We are going to use unsupervised clustering method, that is, we need to identify groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels. -In scRNAseq analyses, shared K-nearest neighbour graph-based clustering is +In scRNA-seq analyses, shared K-nearest neighbour graph-based clustering is often applied. Each cell is represented in the graph as a node, and the first step is to find all the KNN (K-neareast neighbours) for each cell by comparing their pair-wise Euclidean distances. After that, the nodes are connected if they @@ -602,7 +602,7 @@ table(factor(clust)) ### Clustree -`clustree` can be used to visualise the different clustering results in a +The `clustree` package can be used to visualise the different clustering results in a tree structure. It helps to visualise how cells move as the clustering resolution changes. @@ -625,8 +625,8 @@ clustree(multipleKResults, prefix = "k", prop_filter = 0.05, layout = "tree") ``` Varying clustering resolutions is an important part of data exploration for -scRNAseq. It is challenging to find the best resolution, and you usually need to -experiment with different parameters multiple times and try find the proper result. +scRNA-seq. It is challenging to find the best resolution, and you usually need to +experiment with different parameters multiple times and to find the best result. ## Plot Clusters @@ -641,42 +641,44 @@ plotTSNE(sce.pbmc, colour_by = "label") Perhaps add a section here on MRtree (Peng et al, NAR, 2021) - ## Marker dection/cluster annotation To interpret our clustering results, we identify the genes that drive separation -between clusters. These marker genes allow us to assign biological meaning to -each cluster based on their functional annotation. We demonstrate this approach -in the ["Manual" annotation]. +between clusters. These so-called 'marker genes' allow us to assign biological meaning to +each cluster based by comparing the expression profiles of the marker genes to the +expression profiles of previously identified cell types. We demonstrate this approach +in the section ["Manual" annotation]. An alternative way is to use some automatic annotation approach, see the -[Automatic annotation ] +["Automatic" annotation] section. ### "Manual" annotation -`findMarkers` is done via pair-wise comparisions of all pari-wise combinations -of clusters. Each cluster is compared to each of the remaining clusters and -up-regulated genes are identified using t-test. The returned results for one -cluster (versus any of the remaining clusters) are combined into a single list -and the combined.P.values are calculated. One can specifty the `pval.type` which -determines how the DEGs are combined: `any`, return genes that are determined -as up-regulated in any of the comparisons; `all`, return genes that are determined -as up-regulated in `all` of the comparisons,`some` is a compromise between `any` -and `all`. - - -```{r, findMarkers} +To identify the marker genes for each cluster we use the `findMarkers()` function +from the scran package. The `findMarkers()` function uses pair-wise differential +expression testing to identify marker genes. This means that, for each gene, it +tests for differences in expression between the cluster of interest and each +of the other clusters in the dataset seperately. By default, a t-test is used, but +a binomial test or the Wilcoxon rank sum test are also implemented. In the code below +we specify that we only want to idenfity genes which are `"up"` regulated in the +cluster of interest. The returned results for one cluster (versus all of the remaining +clusters) are combined into a single list and combined p-values are calculated. +One can specifty the `pval.type` which determines how the DEGs are combined: `any`, +return genes that are determined as up-regulated in any of the comparisons; `all`, +return genes that are determined as up-regulated in `all` of the comparisons, `some` +is a compromise between `any` and `all`. + +```{r findMarkers} markers <- findMarkers(sce.pbmc, pval.type = "some", direction = "up") ``` We examine the markers for cluster 7 and 14 in more detail. -```{r,plotMarkers} +```{r plotMarkers} marker.set <- markers[["7"]] as.data.frame(marker.set[1:10,1:3]) marker.set <- markers[["14"]] as.data.frame(marker.set[1:10,1:3]) - ``` ### Question @@ -730,15 +732,14 @@ gridExtra::grid.arrange( ) ``` - ```{r} plotExpression(sce.pbmc, features = c("CD14"), x = "label", colour_by = "label") ``` -### Automatic annotation +### "Automatic" annotation -The intuition of automatic annotation is by comparing current dataset with -previously annotation reference samples and find the cell type of the most +Intuitively, automatic annotation works by comparing the current dataset with +previously annotated reference samples and finding the cell type of the most similar/correlated cells. `SingleR` method [@Aran2019-pw] provides one such option for cell type annotation. This method assigns labels to cells based on the reference samples with the highest Spearman rank correlations, using only the @@ -772,13 +773,40 @@ colData(sce.pbmc) <- cbind(colData(sce.pbmc),pred) plotTSNE(sce.pbmc, colour_by = "labels") ``` - ## Differential expression (DE) analysis ### Introduction to DE analysis -### DE in a real dataset +Unlike analysing bulk RNA-seq data, where only one type of DE analysis is possible, +many types of DE analysis can be performed on single cell data. Some different types +DE analysis possible using single cell data are: + +1. Comparing one cluster to all other clusters in a dataset +1. Comparing a specific pair of clusters (or groups of clusters) within a single dataset +1. Comparing the same cell type across samples in different conditions + +Each of these different possible analyses require different methodologies. +The first type of analysis is the same as finding marker genes (described +above) and any marker gene methods which use differential expression testing +can be used. A variety of methods are avaliable for the second type of +analysis. Recent benchmarking [@Soneson2018-hy] has suggested that [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), +a differential expression method developed for bulk RNA-seq data performs well. Other options are [MAST](https://bioconductor.org/packages/release/bioc/html/MAST.html), +a differential expression method developed for scRNA-seq data or [limma-voom](https://bioconductor.org/packages/release/bioc/html/limma.html), +another method developed for bulk RNA-seq data. Finally, for the third type of +analysis is it critical to account for the fact that measurements of cells within +each sample are not independent. Currently, the best methods for this analysis use +a pseudobulk approach - aggregating cells in each sample and then using methods +designed for bulk data. This strategy is implemented in the [muscat](https://www.bioconductor.org/packages/release/bioc/html/muscat.html) package. + +In addition, other types of analysis, similar to DE analysis, are possible in scRNA-seq +data. Firstly, we can test for differences in the number of cells in each cluster +between conditions. This is called differential abundance or differential composition +testing. Secondly, the large number of cells in single cell data allow testing of differences +in distribution - not just mean as in traditional DE analysis between samples. For both +of these analyses method development is ongoing and systematic benchmarking of available +methods does not yet exist. +### DE in a real dataset ## Further exploration