Commit 4b7ce491 authored by Puxue Qiao's avatar Puxue Qiao
Browse files

slalom methodology

parent 2bf1d423
Pipeline #932 passed with stage
in 5 seconds
course_files/figures/FA.png

457 KB | W: | H:

course_files/figures/FA.png

299 KB | W: | H:

course_files/figures/FA.png
course_files/figures/FA.png
course_files/figures/FA.png
course_files/figures/FA.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -3,7 +3,7 @@ output: html_document
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(fig.align = "center", eval = TRUE)
knitr::opts_knit$set(root.dir = normalizePath(".."))
```
......@@ -433,22 +433,66 @@ ggplot(dt, aes(x=PHATE1, y=PHATE2, color=clust)) +
## Matrix factorization and factor analysis
Factor Analysis is similar to PCA in that,
they both aim to obtain a new set of distinct summary variables,
which are fewer in number than the original number of variables.
The key concept of factor analysis is that the original, observed variables are
__The key concept of factor analysis__: The original, observed variables are
correlated because they are all associated with some unobservable variables,
called latent factors.
the __latent factors__.
It looks similar to PCA, but instead of dimensionality reduction, factor analysis
focuses on studying the latent factors.
The variance of a variable can be splitted into two parts: \
The variance of an observed variable can be splitted into two parts: \
- Common variance: the part of variance that is explained by latent factors; \
- Unique variance: the part that is specific to only one variable, usually considered as an error component or residual.
- Unique variance: the part that is specific to only one variable, usually considered as an error component or __residual__.
The __factor loadings__ or weights indicate how much each latent factor is affecting the observed features.
<center> ![](figures/FA.png){width=80%} </center>
<center> ![](figures/FA.png){width=60%} </center>
### [Slalom](https://bioconductor.org/packages/release/bioc/html/slalom.html): Interpretable latent spaces
Highlight of Slalom:
- It incorporates prior information to help the model estimation;
- It learns whatever not provided by prior knowledge in the model training process;
- It enforces sparsity in the weight matrix.
#### Methodology
__Matrix expression of factor analysis:__
<center>![](figures/FA_matrix.png){width=80%} </center>
__How prior knowledge affects the model:__
<center>![](figures/slalom_anno.png) </center>
- $I_{g, k}$: (observed) Indicator of whether a gene $g$ is annotated to a given pathway or factor $k$;\
- $z_{g, k}$: (latent) Indicator of whether factor $k$ has a regulatory effect on gene $g$;\
- $w_{g, k}$: (estimated) weights.
__grey arrow__:
$$ P(I_{g, k}\vert z_{g, k}) = \begin{cases}
\text{Bernoulli}(p_1), \text{if } z_{g, k} = 1\\
\text{Bernoulli}(p_2), \text{if } z_{g, k} = 0\\
\end{cases}$$
__green arrow__:
$$ P(w_{g, k}\vert z_{g, k}) = \begin{cases}
N(w_{g, k}, 1/\alpha), \text{ if } z_{g, k} = 1\\
\delta_0(w_{g, k}), \text{ if } z_{g, k} = 0\\
\end{cases}$$
<center>![](figures//slab_spike.png)</center>
We only look at the part of the __likelihood__ that is relavant to this part:
$\prod_{g} \prod_{k}P(I_{g, k}, w_{g, k}, z_{g, k})$, \
where $P(I_{g, k}, w_{g, k}, z_{g, k}) = P(I_{g, k}, w_{g, k}| z_{g, k})P(z_{g,k})
= P( I_{g, k}| z_{g, k})P( w_{g, k}| z_{g, k})P(z_{g,k})$.
Since we do not know anything about $z_{g,k}$, it is assumed as Bernoulli(1/2).
#### Example
First, get a geneset in a `GeneSetCollection` object.
```{r}
gmtfile <- system.file("extdata", "reactome_subset.gmt", package = "slalom")
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment