Skip to content
Snippets Groups Projects
Commit aeb023f8 authored by Jeffrey Pullin's avatar Jeffrey Pullin
Browse files

Delete old `simulation.Rmd`

parent 27e64466
No related branches found
No related tags found
No related merge requests found
---
title: "Simulation"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries, message = FALSE, warning = FALSE}
library(splatter)
library(scran)
library(scater)
library(pheatmap)
```
## Aim
This document aims to outline the simulation framework used in this project
## Analysis
We will use {splatter} as the basic simulation framework for the project.
We will use `method = "groups"` as the basic setup for the simulations.
The relevant parameters for the groups method are:
* `nGroups`: number of groups (not set directly)
* `group.prob`: probability of cell being in each group
In {splatter} groups are created by simulating a random number of DE genes in each group. The relevant parameters for the DE method are:
* `de.prob`: probability that a gene is DE in any group (default: 0.1)
* `de.downProb`: probability that a DE gene is down-regulated (default: 0.5)
* `de.facLoc`: location (meanlog) of the DE factor log-normal distribution
```{r sim-splatter}
n_groups <- 5
# Can't set nCells directly but batchCells = nCells if 1 batch.
params <- newSplatParams(group.prob = rep(1/n_groups, n_groups),
de.facLoc = 2,
batchCells = 1000)
splatter_sim <- splatSimulate(params, method = "groups", verbose = FALSE)
splatter_sim
```
After we do the simulation we need to do several things to the simulated SingleCellExperiment object:
1. Extract the indices of the DE genes in each group. These are the genes with FacLoc not equal to 1.
```{r extract-de-genes}
# From code/simulation.R
# Should this just return the gene names?
extract_de_inds <- function(sce) {
stopifnot(is(sce, "SingleCellExperiment"))
n_groups <- length(unique(colData(sce)$Group))
col_names <- paste0("DEFacGroup", 1:n_groups)
data <- rowData(sce)[, col_names]
out <- lapply(data, function(x) which(x != 1))
names(out) <- paste0("group_", 1:n_groups)
out
}
de_inds <- extract_de_inds(splatter_sim)
str(de_inds)
```
Next, we need to process the object into the form that can be used by marker gene methods. Specifically we need to:
* Filter genes? (not at the moment)
* Normalize the data (using just log-counts for now)
* Make the colLabels the group id
NB: quickClusters no longer warns when nCells = 1000
```{r process-sim}
quick_clusters <- quickCluster(splatter_sim)
# Gives message:
# assuming UMI data when setting 'min.mean'
splatter_sim <- computeSumFactors(splatter_sim, clusters = quick_clusters)
splatter_sim <- logNormCounts(splatter_sim)
colLabels(splatter_sim) <- colData(splatter_sim)$Group
```
We perform EDA to check that the simulation produces reasonable data
NB: 1000 cells/10000 genes is computationally manageable
```{r eda}
dec_splatter_sim <- modelGeneVarByPoisson(splatter_sim)
splatter_sim <- denoisePCA(splatter_sim, technical = dec_splatter_sim)
plotPCA(splatter_sim, colour_by = "Group")
splatter_sim <- runTSNE(splatter_sim, dimred = "PCA")
plotTSNE(splatter_sim, colour_by = "Group")
```
PCA, tSNE show grouping but the tSNA appears weaker.
When de.FacLoc is increased to 2 then the clusters are very well seperated.
With the transformed data we can run the marker gene selection methods. For now we run scran only.
```{r run-scran}
scran_mgs <- findMarkers(splatter_sim, pval.type = "all")
```
With the output we can then calculate various summaries of quality of the calculated marker genes.
Initially we will focus on the markers for group 1 only.
One important question is how to choose the top marker genes from the {scran} output.
```{r mg-metrics}
scran_group_1_mgs <- scran_mgs[[1]]
# This selects the the top 6 genes in each pairwise comparison.
# scran_group_1_mgs <- scran_group_1_mgs[scran_group_1_mgs$Top <= 2, ]
# Just select the top 30
scran_group_1_mgs <- scran_group_1_mgs[1:30, ]
scran_group_1_nums <- readr::parse_number(rownames(scran_group_1_mgs))
length(scran_group_1_nums)
length(intersect(scran_group_1_nums, de_inds$group_1))
```
Really poor performance even with the the simplest possible simulation... The clusters are clear and number of cells is large so this is unexpected...
Even when the number of MGs selected is small many selected are not real marker genes
Let's try to understand why the performance is so bad...
```{r mg-eda}
logFCs <- getMarkerEffects(scran_group_1_mgs)
pheatmap(logFCs, breaks = seq(-5, 5, length.out=101))
```
Need to test the different scran options
`pval.type = "all"` gives much better performance in simple simulation
## Seurat
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment