Delete old `simulation.Rmd`

aeb023f8 · Jeffrey Pullin · 27e64466 · 27e64466
Commit aeb023f8 authored 4 years ago by Jeffrey Pullin
--- a/analysis/simulation.Rmd
+++ b/analysis/simulation.Rmd
---
-title: "Simulation"
-output: html_document
---
-
-```{r setup, include=FALSE}
-knitr::opts_chunk$set(echo = TRUE)
-```
-
-```{r libraries, message = FALSE, warning = FALSE}
-library(splatter)
-library(scran)
-library(scater)
-library(pheatmap)
-```
-
-## Aim 
-
-This document aims to outline the simulation framework used in this project
-
-## Analysis
-
-We will use {splatter} as the basic simulation framework for the project. 
-
-We will use `method = "groups"` as the basic setup for the simulations.
-
-The relevant parameters for the groups method are:
-
-* `nGroups`: number of groups (not set directly)
-* `group.prob`: probability of cell being in each group
-
-In {splatter} groups are created by simulating a random number of DE genes in each group. The relevant parameters for the DE method are:
-
-* `de.prob`: probability that a gene is DE in any group (default: 0.1)
-* `de.downProb`: probability that a DE gene is down-regulated (default: 0.5)
-* `de.facLoc`: location (meanlog) of the DE factor log-normal distribution
-
-```{r sim-splatter}
-n_groups <- 5
-
-# Can't set nCells directly but batchCells = nCells if 1 batch.
-params <- newSplatParams(group.prob = rep(1/n_groups, n_groups), 
-                         de.facLoc = 2, 
-                         batchCells = 1000)
-splatter_sim <- splatSimulate(params, method = "groups", verbose = FALSE)
-splatter_sim
-```
-
-After we do the simulation we need to do several things to the simulated SingleCellExperiment object:
-
-1. Extract the indices of the DE genes in each group. These are the genes with FacLoc not equal to 1. 
-
-```{r extract-de-genes}
-# From code/simulation.R
-# Should this just return the gene names?
-extract_de_inds <- function(sce) {
-  stopifnot(is(sce, "SingleCellExperiment"))
-
-  n_groups <- length(unique(colData(sce)$Group))
-  col_names <- paste0("DEFacGroup", 1:n_groups)
-  data <- rowData(sce)[, col_names]
-
-  out <- lapply(data, function(x) which(x != 1))
-
-  names(out) <- paste0("group_", 1:n_groups)
-  out
-}
-
-de_inds <- extract_de_inds(splatter_sim)
-str(de_inds)
-```
-
-Next, we need to process the object into the form that can be used by marker gene methods. Specifically we need to: 
-
-* Filter genes? (not at the moment)
-* Normalize the data (using just log-counts for now)
-* Make the colLabels the group id
-
-NB: quickClusters no longer warns when nCells = 1000
-
-```{r process-sim}
-quick_clusters <- quickCluster(splatter_sim)
-# Gives message: 
-# assuming UMI data when setting 'min.mean'
-splatter_sim <- computeSumFactors(splatter_sim, clusters = quick_clusters)
-splatter_sim <- logNormCounts(splatter_sim)
-
-colLabels(splatter_sim) <- colData(splatter_sim)$Group
-```
-
-We perform EDA to check that the simulation produces reasonable data
-
-NB: 1000 cells/10000 genes is computationally manageable
-
-```{r eda}
-dec_splatter_sim <- modelGeneVarByPoisson(splatter_sim)
-splatter_sim <- denoisePCA(splatter_sim, technical = dec_splatter_sim)
-
-plotPCA(splatter_sim, colour_by = "Group")
-
-splatter_sim <- runTSNE(splatter_sim, dimred = "PCA")
-plotTSNE(splatter_sim, colour_by = "Group")
-```
-
-PCA, tSNE show grouping but the tSNA appears weaker.
-
-When de.FacLoc is increased to 2 then the clusters are very well seperated. 
-
-With the transformed data we can run the marker gene selection methods. For now we run scran only. 
-
-```{r run-scran}
-scran_mgs <- findMarkers(splatter_sim, pval.type = "all")
-```
-
-With the output we can then calculate various summaries of quality of the calculated marker genes.
-
-Initially we will focus on the markers for group 1 only.
-
-One important question is how to choose the top marker genes from the {scran} output.
-
-```{r mg-metrics}
-scran_group_1_mgs <- scran_mgs[[1]]
-
-# This selects the the top 6 genes in each pairwise comparison.
-# scran_group_1_mgs <- scran_group_1_mgs[scran_group_1_mgs$Top <= 2, ]
-
-# Just select the top 30
-scran_group_1_mgs <- scran_group_1_mgs[1:30, ]
-
-scran_group_1_nums <- readr::parse_number(rownames(scran_group_1_mgs))
-
-length(scran_group_1_nums)
-length(intersect(scran_group_1_nums, de_inds$group_1))
-```
-
-Really poor performance even with the the simplest possible simulation... The clusters are clear and number of cells is large so this is unexpected...
-
-Even when the number of MGs selected is small many selected are not real marker genes 
-
-Let's try to understand why the performance is so bad...
-
-```{r mg-eda}
-logFCs <- getMarkerEffects(scran_group_1_mgs)
-pheatmap(logFCs, breaks = seq(-5, 5, length.out=101))
-```
-
-Need to test the different scran options
-
-`pval.type = "all"` gives much better performance in simple simulation 
-
-## Seurat 
-
-