diff --git a/README.md b/README.md index 785d36f1438ae7e13af34f5d40e726daf323a362..b0e8119323bdce07ed95d5be02d564cb5e5a320e 100644 --- a/README.md +++ b/README.md @@ -282,6 +282,14 @@ p│  └── FACS_metadata.csv With the files in these locations, everything is set up to run the code as presented in the RMarkdown files in the workshop. +##### Case study dataset + +The dataset we used in the case study is from (O’Koren et al), you can download +all the relevant files via [this link](https://www.svi.edu.au/MIG_2019_scRNAseq-workshop/case_study_data.tar.gz) + +It includes the raw fastqs and processed count matrix data and is of size 2.1GB. +If you would like to start with the count matrix data, please follow the instruction in the [RMarkdown](https://gitlab.svi.edu.au/biocellgen-public/mig_2019_scrnaseq-workshop/blob/master/case_study_data/case_study.Rmd) +to download the processed count matrix data from GEO. ### RStudio diff --git a/course_files/figures/FA.png b/course_files/figures/FA.png index ebbfebffbc374e73c134c2d70734b7a3ddb4970b..5e17a7274d3236635e3d34e780dce290b25c49a0 100644 Binary files a/course_files/figures/FA.png and b/course_files/figures/FA.png differ diff --git a/course_files/figures/FA_matrix.png b/course_files/figures/FA_matrix.png new file mode 100644 index 0000000000000000000000000000000000000000..c2d2b68fcf2024990556f5cf864772bbef309eea Binary files /dev/null and b/course_files/figures/FA_matrix.png differ diff --git a/course_files/figures/slab_spike.png b/course_files/figures/slab_spike.png new file mode 100644 index 0000000000000000000000000000000000000000..d88d5916719cbcf546e3a1828f7f22513ede3d22 Binary files /dev/null and b/course_files/figures/slab_spike.png differ diff --git a/course_files/figures/slalom_anno.png b/course_files/figures/slalom_anno.png new file mode 100644 index 0000000000000000000000000000000000000000..77a37b361609bab3e0a76603f7c9f61611f4d34d Binary files /dev/null and b/course_files/figures/slalom_anno.png differ diff --git a/course_files/latent-spaces.Rmd b/course_files/latent-spaces.Rmd index c3861628c837088b9ec0d0f46d887abd5fa0b7cd..ec109be889a87de271063bbc1d821e53f6304847 100644 --- a/course_files/latent-spaces.Rmd +++ b/course_files/latent-spaces.Rmd @@ -3,7 +3,7 @@ output: html_document --- ```{r setup, echo=FALSE} -knitr::opts_chunk$set(fig.align = "center") +knitr::opts_chunk$set(fig.align = "center", eval = TRUE) knitr::opts_knit$set(root.dir = normalizePath("..")) ``` @@ -433,22 +433,66 @@ ggplot(dt, aes(x=PHATE1, y=PHATE2, color=clust)) + ## Matrix factorization and factor analysis -Factor Analysis is similar to PCA in that, -they both aim to obtain a new set of distinct summary variables, -which are fewer in number than the original number of variables. - -The key concept of factor analysis is that the original, observed variables are +__The key concept of factor analysis__: The original, observed variables are correlated because they are all associated with some unobservable variables, -called latent factors. +the __latent factors__. + +It looks similar to PCA, but instead of dimensionality reduction, factor analysis +focuses on studying the latent factors. -The variance of a variable can be splitted into two parts: \ +The variance of an observed variable can be splitted into two parts: \ - Common variance: the part of variance that is explained by latent factors; \ -- Unique variance: the part that is specific to only one variable, usually considered as an error component or residual. +- Unique variance: the part that is specific to only one variable, usually considered as an error component or __residual__. + +The __factor loadings__ or weights indicate how much each latent factor is affecting the observed features. -<center> {width=80%} </center> +<center> {width=60%} </center> ### [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html): Interpretable latent spaces +Highlight of Slalom: + +- It incorporates prior information to help the model estimation; + +- It learns whatever not provided by prior knowledge in the model training process; + +- It enforces sparsity in the weight matrix. + +#### Methodology + +__Matrix expression of factor analysis:__ + +<center>{width=80%} </center> + +__How prior knowledge affects the model:__ + +<center> </center> + +- $I_{g, k}$: (observed) Indicator of whether a gene $g$ is annotated to a given pathway or factor $k$;\ +- $z_{g, k}$: (latent) Indicator of whether factor $k$ has a regulatory effect on gene $g$;\ +- $w_{g, k}$: (estimated) weights. + +__grey arrow__: +$$ P(I_{g, k}\vert z_{g, k}) = \begin{cases} +\text{Bernoulli}(p_1), \text{if } z_{g, k} = 1\\ +\text{Bernoulli}(p_2), \text{if } z_{g, k} = 0\\ +\end{cases}$$ + +__green arrow__: +$$ P(w_{g, k}\vert z_{g, k}) = \begin{cases} +N(w_{g, k}, 1/\alpha), \text{ if } z_{g, k} = 1\\ +\delta_0(w_{g, k}), \text{ if } z_{g, k} = 0\\ +\end{cases}$$ + +<center></center> + +We only look at the part of the __likelihood__ that is relavant to this part: +$\prod_{g} \prod_{k}P(I_{g, k}, w_{g, k}, z_{g, k})$, \ +where $P(I_{g, k}, w_{g, k}, z_{g, k}) = P(I_{g, k}, w_{g, k}| z_{g, k})P(z_{g,k}) += P( I_{g, k}| z_{g, k})P( w_{g, k}| z_{g, k})P(z_{g,k})$. +Since we do not know anything about $z_{g,k}$, it is assumed as Bernoulli(1/2). + +#### Example First, get a geneset in a `GeneSetCollection` object. ```{r} gmtfile <- system.file("extdata", "reactome_subset.gmt", package = "slalom")