Last updated: 2021-02-10

Checks: 7 0

Knit directory: mage_2020_marker-gene-benchmarking/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190102) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1ad9d6d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .snakemake/
    Ignored:    config/
    Ignored:    data/sim_data/
    Ignored:    logs/
    Ignored:    results/

Unstaged changes:
    Deleted:    analysis/method-concordance.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/simulation.Rmd) and HTML (public/simulation.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e1ef749 Jeffrey Pullin 2020-12-08 Update 8/12/2020

library(splatter)
library(scran)
library(scater)
library(pheatmap)

Aim

This document aims to outline the simulation framework used in this project

Analysis

We will use {splatter} as the basic simulation framework for the project.

We will use method = "groups" as the basic setup for the simulations.

The relevant parameters for the groups method are:

  • nGroups: number of groups (not set directly)
  • group.prob: probability of cell being in each group

In {splatter} groups are created by simulating a random number of DE genes in each group. The relevant parameters for the DE method are:

  • de.prob: probability that a gene is DE in any group (default: 0.1)
  • de.downProb: probability that a DE gene is down-regulated (default: 0.5)
  • de.facLoc: location (meanlog) of the DE factor log-normal distribution
n_groups <- 5

# Can't set nCells directly but batchCells = nCells if 1 batch.
params <- newSplatParams(group.prob = rep(1/n_groups, n_groups), 
                         de.facLoc = 2, 
                         batchCells = 1000)
splatter_sim <- splatSimulate(params, method = "groups", verbose = FALSE)
splatter_sim
class: SingleCellExperiment 
dim: 10000 1000 
metadata(1): Params
assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
rowData names(9): Gene BaseGeneMean ... DEFacGroup4 DEFacGroup5
colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
colData names(4): Cell Batch Group ExpLibSize
reducedDimNames(0):
altExpNames(0):

After we do the simulation we need to do several things to the simulated SingleCellExperiment object:

  1. Extract the indices of the DE genes in each group. These are the genes with FacLoc not equal to 1.
# From code/simulation.R
# Should this just return the gene names?
extract_de_inds <- function(sce) {
  stopifnot(is(sce, "SingleCellExperiment"))

  n_groups <- length(unique(colData(sce)$Group))
  col_names <- paste0("DEFacGroup", 1:n_groups)
  data <- rowData(sce)[, col_names]

  out <- lapply(data, function(x) which(x != 1))

  names(out) <- paste0("group_", 1:n_groups)
  out
}

de_inds <- extract_de_inds(splatter_sim)
str(de_inds)
List of 5
 $ group_1: int [1:911] 7 9 19 31 36 42 56 57 65 101 ...
 $ group_2: int [1:1018] 22 26 29 33 34 36 43 48 50 54 ...
 $ group_3: int [1:1000] 5 16 32 41 43 61 65 71 72 85 ...
 $ group_4: int [1:1023] 8 10 24 37 43 48 57 66 90 95 ...
 $ group_5: int [1:981] 3 9 18 25 26 37 50 71 72 122 ...

Next, we need to process the object into the form that can be used by marker gene methods. Specifically we need to:

  • Filter genes? (not at the moment)
  • Normalize the data (using just log-counts for now)
  • Make the colLabels the group id

NB: quickClusters no longer warns when nCells = 1000

quick_clusters <- quickCluster(splatter_sim)
# Gives message: 
# assuming UMI data when setting 'min.mean'
splatter_sim <- computeSumFactors(splatter_sim, clusters = quick_clusters)
Warning in .guess_min_mean(x, min.mean = min.mean, BPPARAM = BPPARAM): assuming
UMI data when setting 'min.mean'
splatter_sim <- logNormCounts(splatter_sim)

colLabels(splatter_sim) <- colData(splatter_sim)$Group

We perform EDA to check that the simulation produces reasonable data

NB: 1000 cells/10000 genes is computationally manageable

dec_splatter_sim <- modelGeneVarByPoisson(splatter_sim)
splatter_sim <- denoisePCA(splatter_sim, technical = dec_splatter_sim)

plotPCA(splatter_sim, colour_by = "Group")

splatter_sim <- runTSNE(splatter_sim, dimred = "PCA")
plotTSNE(splatter_sim, colour_by = "Group")

PCA, tSNE show grouping but the tSNA appears weaker.

When de.FacLoc is increased to 2 then the clusters are very well seperated.

With the transformed data we can run the marker gene selection methods. For now we run scran only.

scran_mgs <- findMarkers(splatter_sim, pval.type = "all")

With the output we can then calculate various summaries of quality of the calculated marker genes.

Initially we will focus on the markers for group 1 only.

One important question is how to choose the top marker genes from the {scran} output.

scran_group_1_mgs <- scran_mgs[[1]]

# This selects the the top 6 genes in each pairwise comparison.
# scran_group_1_mgs <- scran_group_1_mgs[scran_group_1_mgs$Top <= 2, ]

# Just select the top 30
scran_group_1_mgs <- scran_group_1_mgs[1:30, ]

scran_group_1_nums <- readr::parse_number(rownames(scran_group_1_mgs))

length(scran_group_1_nums)
[1] 30
length(intersect(scran_group_1_nums, de_inds$group_1))
[1] 30

Really poor performance even with the the simplest possible simulation… The clusters are clear and number of cells is large so this is unexpected…

Even when the number of MGs selected is small many selected are not real marker genes

Let’s try to understand why the performance is so bad…

logFCs <- getMarkerEffects(scran_group_1_mgs)
pheatmap(logFCs, breaks = seq(-5, 5, length.out=101))

Need to test the different scran options

pval.type = "all" gives much better performance in simple simulation

Seurat


devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Red Hat Enterprise Linux    
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Melbourne         
 date     2021-02-10                  

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.3)
 backports              1.2.0    2020-11-02 [1] CRAN (R 4.0.3)
 beachmat               2.6.2    2020-11-24 [1] Bioconductor  
 beeswarm               0.2.3    2016-04-25 [1] CRAN (R 4.0.3)
 Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
 BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
 BiocNeighbors          1.8.1    2020-11-11 [1] Bioconductor  
 BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
 BiocSingular           1.6.0    2020-10-27 [1] Bioconductor  
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.3)
 bluster                1.0.0    2020-10-27 [1] Bioconductor  
 callr                  3.5.1    2020-10-13 [1] CRAN (R 4.0.3)
 checkmate              2.0.0    2020-02-06 [1] CRAN (R 4.0.3)
 cli                    2.2.0    2020-11-20 [1] CRAN (R 4.0.3)
 colorspace             2.0-0    2020-11-11 [1] CRAN (R 4.0.3)
 cowplot                1.1.0    2020-09-08 [1] CRAN (R 4.0.3)
 crayon                 1.3.4    2017-09-16 [1] CRAN (R 4.0.3)
 DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
 DelayedMatrixStats     1.12.1   2020-11-24 [1] Bioconductor  
 desc                   1.2.0    2018-05-01 [1] CRAN (R 4.0.3)
 devtools               2.3.2    2020-09-18 [1] CRAN (R 4.0.3)
 digest                 0.6.27   2020-10-24 [1] CRAN (R 4.0.3)
 dplyr                  1.0.2    2020-08-18 [1] CRAN (R 4.0.3)
 dqrng                  0.2.1    2019-05-17 [1] CRAN (R 4.0.3)
 edgeR                  3.32.0   2020-10-27 [1] Bioconductor  
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.3)
 evaluate               0.14     2019-05-28 [1] CRAN (R 4.0.3)
 fansi                  0.4.1    2020-01-08 [1] CRAN (R 4.0.3)
 farver                 2.0.3    2020-01-16 [1] CRAN (R 4.0.3)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.0.3)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.0.3)
 GenomeInfoDb         * 1.26.1   2020-11-20 [1] Bioconductor  
 GenomeInfoDbData       1.2.4    2020-12-07 [1] Bioconductor  
 GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
 ggbeeswarm             0.6.0    2017-08-07 [1] CRAN (R 4.0.3)
 ggplot2              * 3.3.2    2020-06-19 [1] CRAN (R 4.0.3)
 git2r                  0.28.0   2021-01-10 [1] CRAN (R 4.0.3)
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.0.3)
 gridExtra              2.3      2017-09-09 [1] CRAN (R 4.0.3)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.3)
 hms                    0.5.3    2020-01-08 [1] CRAN (R 4.0.3)
 htmltools              0.5.0    2020-06-16 [1] CRAN (R 4.0.3)
 httpuv                 1.5.4    2020-06-06 [1] CRAN (R 4.0.3)
 igraph                 1.2.6    2020-10-06 [1] CRAN (R 4.0.3)
 IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
 irlba                  2.3.3    2019-02-05 [1] CRAN (R 4.0.3)
 knitr                  1.30     2020-09-22 [1] CRAN (R 4.0.3)
 labeling               0.4.2    2020-10-20 [1] CRAN (R 4.0.3)
 later                  1.1.0.1  2020-06-05 [1] CRAN (R 4.0.3)
 lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.3)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.3)
 limma                  3.46.0   2020-10-27 [1] Bioconductor  
 locfit                 1.5-9.4  2020-03-25 [1] CRAN (R 4.0.3)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
 Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.3)
 MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
 matrixStats          * 0.57.0   2020-09-25 [1] CRAN (R 4.0.3)
 memoise                1.1.0    2017-04-21 [1] CRAN (R 4.0.3)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.3)
 pheatmap             * 1.0.12   2019-01-04 [1] CRAN (R 4.0.3)
 pillar                 1.4.7    2020-11-20 [1] CRAN (R 4.0.3)
 pkgbuild               1.1.0    2020-07-13 [1] CRAN (R 4.0.3)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.3)
 pkgload                1.1.0    2020-05-29 [1] CRAN (R 4.0.3)
 prettyunits            1.1.1    2020-01-24 [1] CRAN (R 4.0.3)
 processx               3.4.5    2020-11-30 [1] CRAN (R 4.0.3)
 promises               1.1.1    2020-06-09 [1] CRAN (R 4.0.3)
 ps                     1.5.0    2020-12-05 [1] CRAN (R 4.0.3)
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.0.3)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.0.3)
 RColorBrewer           1.1-2    2014-12-07 [1] CRAN (R 4.0.3)
 Rcpp                   1.0.5    2020-07-06 [1] CRAN (R 4.0.3)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.3)
 readr                  1.4.0    2020-10-05 [1] CRAN (R 4.0.3)
 remotes                2.2.0    2020-07-21 [1] CRAN (R 4.0.3)
 rlang                  0.4.9    2020-11-26 [1] CRAN (R 4.0.3)
 rmarkdown              2.5      2020-10-21 [1] CRAN (R 4.0.3)
 rprojroot              2.0.2    2020-11-15 [1] CRAN (R 4.0.3)
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.0.3)
 rsvd                   1.0.3    2020-02-17 [1] CRAN (R 4.0.3)
 Rtsne                  0.15     2018-11-10 [1] CRAN (R 4.0.3)
 S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.3)
 scater               * 1.18.3   2020-11-08 [1] Bioconductor  
 scran                * 1.18.1   2020-11-05 [1] Bioconductor  
 scuttle                1.0.3    2020-11-23 [1] Bioconductor  
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.3)
 SingleCellExperiment * 1.12.0   2020-10-27 [1] Bioconductor  
 sparseMatrixStats      1.2.0    2020-10-27 [1] Bioconductor  
 splatter             * 1.14.1   2020-12-01 [1] Bioconductor  
 statmod                1.4.35   2020-10-19 [1] CRAN (R 4.0.3)
 stringi                1.5.3    2020-09-09 [1] CRAN (R 4.0.3)
 stringr                1.4.0    2019-02-10 [1] CRAN (R 4.0.3)
 SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
 testthat               3.0.0    2020-10-31 [1] CRAN (R 4.0.3)
 tibble                 3.0.4    2020-10-12 [1] CRAN (R 4.0.3)
 tidyselect             1.1.0    2020-05-11 [1] CRAN (R 4.0.3)
 usethis                2.0.0    2020-12-10 [1] CRAN (R 4.0.3)
 vctrs                  0.3.5    2020-11-17 [1] CRAN (R 4.0.3)
 vipor                  0.4.5    2017-03-22 [1] CRAN (R 4.0.3)
 viridis                0.5.1    2018-03-29 [1] CRAN (R 4.0.3)
 viridisLite            0.3.0    2018-02-01 [1] CRAN (R 4.0.3)
 whisker                0.4      2019-08-28 [1] CRAN (R 4.0.3)
 withr                  2.3.0    2020-09-22 [1] CRAN (R 4.0.3)
 workflowr              1.6.2    2020-04-30 [1] CRAN (R 4.0.3)
 xfun                   0.19     2020-10-30 [1] CRAN (R 4.0.3)
 XVector                0.30.0   2020-10-27 [1] Bioconductor  
 yaml                   2.2.1    2020-02-01 [1] CRAN (R 4.0.3)
 zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  

[1] /mnt/mcfiles/jpullin/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.3/lib/R/library