Commit 252f2e8b authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Merge branch 'master' of gitlab.svi.edu.au:biocellgen-public/mig_2019_scrnaseq-workshop

parents d9ad8cd3 4b7ce491
Pipeline #933 passed with stage
in 4 seconds
......@@ -282,6 +282,14 @@ p│   └── FACS_metadata.csv
With the files in these locations, everything is set up to run the code as
presented in the RMarkdown files in the workshop.
##### Case study dataset
The dataset we used in the case study is from (O’Koren et al), you can download
all the relevant files via [this link](https://www.svi.edu.au/MIG_2019_scRNAseq-workshop/case_study_data.tar.gz)
It includes the raw fastqs and processed count matrix data and is of size 2.1GB.
If you would like to start with the count matrix data, please follow the instruction in the [RMarkdown](https://gitlab.svi.edu.au/biocellgen-public/mig_2019_scrnaseq-workshop/blob/master/case_study_data/case_study.Rmd)
to download the processed count matrix data from GEO.
### RStudio
......
course_files/figures/FA.png

457 KB | W: | H:

course_files/figures/FA.png

299 KB | W: | H:

course_files/figures/FA.png
course_files/figures/FA.png
course_files/figures/FA.png
course_files/figures/FA.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -3,7 +3,7 @@ output: html_document
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(fig.align = "center", eval = TRUE)
knitr::opts_knit$set(root.dir = normalizePath(".."))
```
......@@ -433,22 +433,66 @@ ggplot(dt, aes(x=PHATE1, y=PHATE2, color=clust)) +
## Matrix factorization and factor analysis
Factor Analysis is similar to PCA in that,
they both aim to obtain a new set of distinct summary variables,
which are fewer in number than the original number of variables.
The key concept of factor analysis is that the original, observed variables are
__The key concept of factor analysis__: The original, observed variables are
correlated because they are all associated with some unobservable variables,
called latent factors.
the __latent factors__.
It looks similar to PCA, but instead of dimensionality reduction, factor analysis
focuses on studying the latent factors.
The variance of a variable can be splitted into two parts: \
The variance of an observed variable can be splitted into two parts: \
- Common variance: the part of variance that is explained by latent factors; \
- Unique variance: the part that is specific to only one variable, usually considered as an error component or residual.
- Unique variance: the part that is specific to only one variable, usually considered as an error component or __residual__.
The __factor loadings__ or weights indicate how much each latent factor is affecting the observed features.
<center> ![](figures/FA.png){width=80%} </center>
<center> ![](figures/FA.png){width=60%} </center>
### [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html): Interpretable latent spaces
Highlight of Slalom:
- It incorporates prior information to help the model estimation;
- It learns whatever not provided by prior knowledge in the model training process;
- It enforces sparsity in the weight matrix.
#### Methodology
__Matrix expression of factor analysis:__
<center>![](figures/FA_matrix.png){width=80%} </center>
__How prior knowledge affects the model:__
<center>![](figures/slalom_anno.png) </center>
- $I_{g, k}$: (observed) Indicator of whether a gene $g$ is annotated to a given pathway or factor $k$;\
- $z_{g, k}$: (latent) Indicator of whether factor $k$ has a regulatory effect on gene $g$;\
- $w_{g, k}$: (estimated) weights.
__grey arrow__:
$$ P(I_{g, k}\vert z_{g, k}) = \begin{cases}
\text{Bernoulli}(p_1), \text{if } z_{g, k} = 1\\
\text{Bernoulli}(p_2), \text{if } z_{g, k} = 0\\
\end{cases}$$
__green arrow__:
$$ P(w_{g, k}\vert z_{g, k}) = \begin{cases}
N(w_{g, k}, 1/\alpha), \text{ if } z_{g, k} = 1\\
\delta_0(w_{g, k}), \text{ if } z_{g, k} = 0\\
\end{cases}$$
<center>![](figures//slab_spike.png)</center>
We only look at the part of the __likelihood__ that is relavant to this part:
$\prod_{g} \prod_{k}P(I_{g, k}, w_{g, k}, z_{g, k})$, \
where $P(I_{g, k}, w_{g, k}, z_{g, k}) = P(I_{g, k}, w_{g, k}| z_{g, k})P(z_{g,k})
= P( I_{g, k}| z_{g, k})P( w_{g, k}| z_{g, k})P(z_{g,k})$.
Since we do not know anything about $z_{g,k}$, it is assumed as Bernoulli(1/2).
#### Example
First, get a geneset in a `GeneSetCollection` object.
```{r}
gmtfile <- system.file("extdata", "reactome_subset.gmt", package = "slalom")
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment