Last updated: 2021-02-10
Checks: 7 0
Knit directory: mage_2020_marker-gene-benchmarking/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190102)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1ad9d6d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .snakemake/
Ignored: config/
Ignored: data/sim_data/
Ignored: logs/
Ignored: results/
Unstaged changes:
Deleted: analysis/method-concordance.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/simulation.Rmd
) and HTML (public/simulation.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | e1ef749 | Jeffrey Pullin | 2020-12-08 | Update 8/12/2020 |
This document aims to outline the simulation framework used in this project
We will use {splatter} as the basic simulation framework for the project.
We will use method = "groups"
as the basic setup for the simulations.
The relevant parameters for the groups method are:
nGroups
: number of groups (not set directly)group.prob
: probability of cell being in each groupIn {splatter} groups are created by simulating a random number of DE genes in each group. The relevant parameters for the DE method are:
de.prob
: probability that a gene is DE in any group (default: 0.1)de.downProb
: probability that a DE gene is down-regulated (default: 0.5)de.facLoc
: location (meanlog) of the DE factor log-normal distributionn_groups <- 5
# Can't set nCells directly but batchCells = nCells if 1 batch.
params <- newSplatParams(group.prob = rep(1/n_groups, n_groups),
de.facLoc = 2,
batchCells = 1000)
splatter_sim <- splatSimulate(params, method = "groups", verbose = FALSE)
splatter_sim
class: SingleCellExperiment
dim: 10000 1000
metadata(1): Params
assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
rowData names(9): Gene BaseGeneMean ... DEFacGroup4 DEFacGroup5
colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
colData names(4): Cell Batch Group ExpLibSize
reducedDimNames(0):
altExpNames(0):
After we do the simulation we need to do several things to the simulated SingleCellExperiment object:
# From code/simulation.R
# Should this just return the gene names?
extract_de_inds <- function(sce) {
stopifnot(is(sce, "SingleCellExperiment"))
n_groups <- length(unique(colData(sce)$Group))
col_names <- paste0("DEFacGroup", 1:n_groups)
data <- rowData(sce)[, col_names]
out <- lapply(data, function(x) which(x != 1))
names(out) <- paste0("group_", 1:n_groups)
out
}
de_inds <- extract_de_inds(splatter_sim)
str(de_inds)
List of 5
$ group_1: int [1:911] 7 9 19 31 36 42 56 57 65 101 ...
$ group_2: int [1:1018] 22 26 29 33 34 36 43 48 50 54 ...
$ group_3: int [1:1000] 5 16 32 41 43 61 65 71 72 85 ...
$ group_4: int [1:1023] 8 10 24 37 43 48 57 66 90 95 ...
$ group_5: int [1:981] 3 9 18 25 26 37 50 71 72 122 ...
Next, we need to process the object into the form that can be used by marker gene methods. Specifically we need to:
NB: quickClusters no longer warns when nCells = 1000
quick_clusters <- quickCluster(splatter_sim)
# Gives message:
# assuming UMI data when setting 'min.mean'
splatter_sim <- computeSumFactors(splatter_sim, clusters = quick_clusters)
Warning in .guess_min_mean(x, min.mean = min.mean, BPPARAM = BPPARAM): assuming
UMI data when setting 'min.mean'
We perform EDA to check that the simulation produces reasonable data
NB: 1000 cells/10000 genes is computationally manageable
dec_splatter_sim <- modelGeneVarByPoisson(splatter_sim)
splatter_sim <- denoisePCA(splatter_sim, technical = dec_splatter_sim)
plotPCA(splatter_sim, colour_by = "Group")
PCA, tSNE show grouping but the tSNA appears weaker.
When de.FacLoc is increased to 2 then the clusters are very well seperated.
With the transformed data we can run the marker gene selection methods. For now we run scran only.
With the output we can then calculate various summaries of quality of the calculated marker genes.
Initially we will focus on the markers for group 1 only.
One important question is how to choose the top marker genes from the {scran} output.
scran_group_1_mgs <- scran_mgs[[1]]
# This selects the the top 6 genes in each pairwise comparison.
# scran_group_1_mgs <- scran_group_1_mgs[scran_group_1_mgs$Top <= 2, ]
# Just select the top 30
scran_group_1_mgs <- scran_group_1_mgs[1:30, ]
scran_group_1_nums <- readr::parse_number(rownames(scran_group_1_mgs))
length(scran_group_1_nums)
[1] 30
[1] 30
Really poor performance even with the the simplest possible simulation… The clusters are clear and number of cells is large so this is unexpected…
Even when the number of MGs selected is small many selected are not real marker genes
Let’s try to understand why the performance is so bad…
Need to test the different scran options
pval.type = "all"
gives much better performance in simple simulation
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.3 (2020-10-10)
os Red Hat Enterprise Linux
system x86_64, linux-gnu
ui X11
language (EN)
collate en_AU.UTF-8
ctype en_AU.UTF-8
tz Australia/Melbourne
date 2021-02-10
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3)
backports 1.2.0 2020-11-02 [1] CRAN (R 4.0.3)
beachmat 2.6.2 2020-11-24 [1] Bioconductor
beeswarm 0.2.3 2016-04-25 [1] CRAN (R 4.0.3)
Biobase * 2.50.0 2020-10-27 [1] Bioconductor
BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
BiocNeighbors 1.8.1 2020-11-11 [1] Bioconductor
BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
BiocSingular 1.6.0 2020-10-27 [1] Bioconductor
bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.3)
bluster 1.0.0 2020-10-27 [1] Bioconductor
callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.3)
checkmate 2.0.0 2020-02-06 [1] CRAN (R 4.0.3)
cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.3)
cowplot 1.1.0 2020-09-08 [1] CRAN (R 4.0.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.3)
DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
DelayedMatrixStats 1.12.1 2020-11-24 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.3)
devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.3)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
dplyr 1.0.2 2020-08-18 [1] CRAN (R 4.0.3)
dqrng 0.2.1 2019-05-17 [1] CRAN (R 4.0.3)
edgeR 3.32.0 2020-10-27 [1] Bioconductor
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.3)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.3)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.3)
farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.3)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
GenomeInfoDb * 1.26.1 2020-11-20 [1] Bioconductor
GenomeInfoDbData 1.2.4 2020-12-07 [1] Bioconductor
GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.3)
ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.3)
git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.3)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.3)
hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.3)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.3)
httpuv 1.5.4 2020-06-06 [1] CRAN (R 4.0.3)
igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.3)
IRanges * 2.24.0 2020-10-27 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.3)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.3)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.3)
later 1.1.0.1 2020-06-05 [1] CRAN (R 4.0.3)
lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.3)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.3)
limma 3.46.0 2020-10-27 [1] Bioconductor
locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.3)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.3)
MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
matrixStats * 0.57.0 2020-09-25 [1] CRAN (R 4.0.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.3)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.3)
pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.0.3)
pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.3)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.3)
processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.3)
promises 1.1.1 2020-06-09 [1] CRAN (R 4.0.3)
ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.3)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.3)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.3)
RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.3)
readr 1.4.0 2020-10-05 [1] CRAN (R 4.0.3)
remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.3)
rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.3)
rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.3)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
rsvd 1.0.3 2020-02-17 [1] CRAN (R 4.0.3)
Rtsne 0.15 2018-11-10 [1] CRAN (R 4.0.3)
S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.3)
scater * 1.18.3 2020-11-08 [1] Bioconductor
scran * 1.18.1 2020-11-05 [1] Bioconductor
scuttle 1.0.3 2020-11-23 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor
sparseMatrixStats 1.2.0 2020-10-27 [1] Bioconductor
splatter * 1.14.1 2020-12-01 [1] Bioconductor
statmod 1.4.35 2020-10-19 [1] CRAN (R 4.0.3)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3)
SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.3)
tibble 3.0.4 2020-10-12 [1] CRAN (R 4.0.3)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.3)
usethis 2.0.0 2020-12-10 [1] CRAN (R 4.0.3)
vctrs 0.3.5 2020-11-17 [1] CRAN (R 4.0.3)
vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.3)
viridis 0.5.1 2018-03-29 [1] CRAN (R 4.0.3)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.3)
whisker 0.4 2019-08-28 [1] CRAN (R 4.0.3)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.3)
workflowr 1.6.2 2020-04-30 [1] CRAN (R 4.0.3)
xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3)
XVector 0.30.0 2020-10-27 [1] Bioconductor
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3)
zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
[1] /mnt/mcfiles/jpullin/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.3/lib/R/library