From c925d654d4e6e0b46b05b0809ae472a990cfa138 Mon Sep 17 00:00:00 2001
From: Davis McCarthy <davismcc@gmail.com>
Date: Thu, 3 Oct 2019 00:04:36 +1000
Subject: [PATCH] Bug fix in DE chapter; adding data integration chapter
 content

---
 course_files/data-integration.Rmd | 68 ++++++++++++++++++++++++++++++-
 course_files/de-real.Rmd          |  4 +-
 2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/course_files/data-integration.Rmd b/course_files/data-integration.Rmd
index 9085c44..844d557 100644
--- a/course_files/data-integration.Rmd
+++ b/course_files/data-integration.Rmd
@@ -3,10 +3,76 @@ output: html_document
 ---
 
 ```{r setup, echo=FALSE}
-knitr::opts_chunk$set(out.width='90%', fig.align = 'center', eval=FALSE)
+knitr::opts_chunk$set(out.width='90%', fig.align = 'center', eval=TRUE)
 knitr::opts_knit$set(root.dir = normalizePath(".."))
 ```
 
 # Integrating single-cell 'omics datasets
 
+This is a very big topic, too big to cover in depth in this workshop! 
+
+However, we have already seen a few approaches to integrating single-cell
+RNA-seq data in various ways. This chapter provides the opportunity to discuss
+these ideas further and develop something of a taxonomy of data integration aims
+and approaches for single-cell 'omics data.
+
+"Data integration" of single-cell 'omics data may cover any or all of the
+following:
+
+- Batch correction within an experiment/study;
+- Using a reference ("atlas") to inform analysis of a generated dataset;
+- Combining data from the same biological system but across different
+  studies/labs/platforms;
+- Making use of multiple 'omics "views": from the same cells or not;
+- ... and many more possibilities!
+
+The [12 Grand Challenges in Single-Cell Data
+Science](https://peerj.com/preprints/27885/) preprint provides a more systematic
+way of thinking about the different types of single-cell 'omics data integration
+that we may want to achieve. The figure below lays out several distinct types of approach.
+
+
+```{r, echo=FALSE, merged_seurat.width = '90%', fig.cap="Reproduction of Figure 6 from Laehnemann et al (2019). Approaches for integrating single-cell measurement datasets across measurement types, samples and experiments. Approach 0: Clustering of cells from one sample from one experiment, no data integration is needed. Approach 1: Cell populations / clusters from multiple samples but the same measurement type need to be linked. Approach 2: For cell populations / clusters across multiple experiments, stable reference systems like cell atlases are needed (compare Figure 1). Approach 3: Whenever multiple measurement types can be obtained from the same cell, they are automatically linked. However, this setup highlights the problem of data sparsity of all available measurement types and the dependency of measurement types that needs to be accounted for. Approach 4: When multiple measurement types cannot be obtained from the same cell, a solution is to obtain them from cells of the same cell population. However, this combines the problems of Approach 1 with those of Approach 3. Approach 5: One possibility for easing data integration across measurement types from separate cells would be to have a stable reference (cell atlas) across multiple measurement types. Effectively, this combines the problems of Approaches 2, 3 and 4."}
+knitr::include_graphics("figures/data-integration-fig1.png")
+```
+
+This table provides some more details and examples:
+
+```{r, echo=FALSE, merged_seurat.width = '90%', fig.cap="Reproduction of Table 2 from Laehnemann et al (2019)."}
+knitr::include_graphics("figures/data-integration-tab1.png")
+```
+
+As we can see here, there are many different approaches to integrating data, and
+the approaches depend on the data types we have and what we want to achieve.
+Some types of data integration are already eminently feasible; others require
+much more methods and software development before they are achievable.
+Ultimately, it all comes back to our __biological questions__. What questions we
+want to answer will drive the data we generate and the approaches we might
+sensibly take to integrate that data.
+
+_A final thought:_ in some (many?) what we might call __data synthesis__ might be
+preferable to __data integration__. That is, we might not need or want to
+combine disparate data sets and data types into one holistic (and likely very
+challenging) analysis. Rather, we might instead analyse different data
+sets/types separately and __synthesise__ what we learn from each of them to
+answer biological questions of interest. Kind of how science is supposed to
+work!
+
+
+## Further reading
+
+We recommend that you read section 6.1 of the "Grand Challenges" paper for more
+detailed discussion of the current status of data integration for single-cell
+'omics data and open problems that remain:
+
+- Laehnemann,D. et al. (2019) 12 Grand challenges in single-cell data science _PeerJ Preprints_. [link](https://peerj.com/preprints/27885/)
+
+
+
+
+
+
+
+
+
 
diff --git a/course_files/de-real.Rmd b/course_files/de-real.Rmd
index ff961c8..3a07166 100644
--- a/course_files/de-real.Rmd
+++ b/course_files/de-real.Rmd
@@ -204,7 +204,7 @@ experessed genes. Here, we will keep genes with non-zero expression in at least
 
 The first steps for this analysis then involve 
 
-```{r edger-plot, fig.cap="Biological coefficient of variation plot for edgeR.", message=FALSE}
+```{r edger-bcvplot, fig.cap="Biological coefficient of variation plot for edgeR.", message=FALSE}
 keep_gene <- (rowSums(counts_mat > 0) > 29.5 & rowMeans(counts_mat) > 0.2)
 table(keep_gene)
 dge <- DGEList(
@@ -223,7 +223,7 @@ plotBCV(dge)
 Next we fit a negative binomial quasi-likelihood model for differential
 expression.
 
-```{r edger-plot, fig.cap="ROC curve for edgeR.", message=FALSE}
+```{r edger-rocplot, fig.cap="ROC curve for edgeR.", message=FALSE}
 fit <- glmQLFit(dge, design)
 res <- glmQLFTest(fit)
 pVals <- res$table[,4]
-- 
GitLab