Commit c925d654 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Bug fix in DE chapter; adding data integration chapter content

parent 2ed62f5b
Pipeline #958 passed with stage
in 5 seconds
......@@ -3,10 +3,76 @@ output: html_document
```{r setup, echo=FALSE}
knitr::opts_chunk$set(out.width='90%', fig.align = 'center', eval=FALSE)
knitr::opts_chunk$set(out.width='90%', fig.align = 'center', eval=TRUE)
knitr::opts_knit$set(root.dir = normalizePath(".."))
# Integrating single-cell 'omics datasets
This is a very big topic, too big to cover in depth in this workshop!
However, we have already seen a few approaches to integrating single-cell
RNA-seq data in various ways. This chapter provides the opportunity to discuss
these ideas further and develop something of a taxonomy of data integration aims
and approaches for single-cell 'omics data.
"Data integration" of single-cell 'omics data may cover any or all of the
- Batch correction within an experiment/study;
- Using a reference ("atlas") to inform analysis of a generated dataset;
- Combining data from the same biological system but across different
- Making use of multiple 'omics "views": from the same cells or not;
- ... and many more possibilities!
The [12 Grand Challenges in Single-Cell Data
Science]( preprint provides a more systematic
way of thinking about the different types of single-cell 'omics data integration
that we may want to achieve. The figure below lays out several distinct types of approach.
```{r, echo=FALSE, merged_seurat.width = '90%', fig.cap="Reproduction of Figure 6 from Laehnemann et al (2019). Approaches for integrating single-cell measurement datasets across measurement types, samples and experiments. Approach 0: Clustering of cells from one sample from one experiment, no data integration is needed. Approach 1: Cell populations / clusters from multiple samples but the same measurement type need to be linked. Approach 2: For cell populations / clusters across multiple experiments, stable reference systems like cell atlases are needed (compare Figure 1). Approach 3: Whenever multiple measurement types can be obtained from the same cell, they are automatically linked. However, this setup highlights the problem of data sparsity of all available measurement types and the dependency of measurement types that needs to be accounted for. Approach 4: When multiple measurement types cannot be obtained from the same cell, a solution is to obtain them from cells of the same cell population. However, this combines the problems of Approach 1 with those of Approach 3. Approach 5: One possibility for easing data integration across measurement types from separate cells would be to have a stable reference (cell atlas) across multiple measurement types. Effectively, this combines the problems of Approaches 2, 3 and 4."}
This table provides some more details and examples:
```{r, echo=FALSE, merged_seurat.width = '90%', fig.cap="Reproduction of Table 2 from Laehnemann et al (2019)."}
As we can see here, there are many different approaches to integrating data, and
the approaches depend on the data types we have and what we want to achieve.
Some types of data integration are already eminently feasible; others require
much more methods and software development before they are achievable.
Ultimately, it all comes back to our __biological questions__. What questions we
want to answer will drive the data we generate and the approaches we might
sensibly take to integrate that data.
_A final thought:_ in some (many?) what we might call __data synthesis__ might be
preferable to __data integration__. That is, we might not need or want to
combine disparate data sets and data types into one holistic (and likely very
challenging) analysis. Rather, we might instead analyse different data
sets/types separately and __synthesise__ what we learn from each of them to
answer biological questions of interest. Kind of how science is supposed to
## Further reading
We recommend that you read section 6.1 of the "Grand Challenges" paper for more
detailed discussion of the current status of data integration for single-cell
'omics data and open problems that remain:
- Laehnemann,D. et al. (2019) 12 Grand challenges in single-cell data science _PeerJ Preprints_. [link](
......@@ -204,7 +204,7 @@ experessed genes. Here, we will keep genes with non-zero expression in at least
The first steps for this analysis then involve
```{r edger-plot, fig.cap="Biological coefficient of variation plot for edgeR.", message=FALSE}
```{r edger-bcvplot, fig.cap="Biological coefficient of variation plot for edgeR.", message=FALSE}
keep_gene <- (rowSums(counts_mat > 0) > 29.5 & rowMeans(counts_mat) > 0.2)
dge <- DGEList(
......@@ -223,7 +223,7 @@ plotBCV(dge)
Next we fit a negative binomial quasi-likelihood model for differential
```{r edger-plot, fig.cap="ROC curve for edgeR.", message=FALSE}
```{r edger-rocplot, fig.cap="ROC curve for edgeR.", message=FALSE}
fit <- glmQLFit(dge, design)
res <- glmQLFTest(fit)
pVals <- res$table[,4]
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment