Last updated: 2021-02-17
Checks: 7 0
Knit directory: mage_2020_marker-gene-benchmarking/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190102)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a3d68ac. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .snakemake/
Ignored: config/
Ignored: data/sim_data/
Ignored: logs/
Ignored: results/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/marker-genes-types.Rmd
) and HTML (public/marker-genes-types.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | a3d68ac | Jeffrey Pullin | 2021-02-17 | Update after discussion with Davis |
html | 3ecba5d | Jeffrey Pullin | 2021-02-16 | Build site. |
Rmd | 94d2ecf | Jeffrey Pullin | 2021-02-16 | Improve wording in marker gene type analysis |
html | ca82ce0 | Jeffrey Pullin | 2021-02-16 | Build site. |
Rmd | 63bce66 | Jeffrey Pullin | 2021-02-16 | Add new marker gene type analysis and update FDR + TPR |
Rmd | c947596 | Jeffrey Pullin | 2021-02-16 | Rename file to relfect focus on types of marker genes |
Explore different types of ‘marker genes’ in scRNA-seq data (currently using Splatter simulations)
For now we will simulate all data using the Splatter package, meaning that the types of marker genes will be defined, at least to some extent, with respect to how the Splatter simulations work.
Throughout this file we will work with one of the ‘standard’ Splatter simulations.
The simplest type of marker genes are what I am (for now) calling unqiue marker genes. These are genes which are up or down regulated in the cluster of interest and expressed at normal/background level in all other groups. In splatter we can identify these genes by looking for genes which have DE parameter not equal to one in the group of interest and equal to one in all other groups.
We can get these unique marker genes (UMGs) in a SingleCellExperiment object output from splatter with the find_unique_marker_genes()
function. This will return a list of marker genes in each group. Each element of the list is data.frame
with columns gene
and fc
holding the gene id and splatter parameter representing a fold change. (TODO: fc
is really a fold change?)
One critical feature of unique marker genes is that their magnitude/importance can be defined simply by a single number allowing ranking. For example we can find the top 3 most up-regulated genes and top 3 most down-regulated genes for group 1 with the following code
umgs_group_1 <- as_tibble(umgs[[1]])
top_3_ur <- umgs_group_1 %>%
arrange(desc(fc)) %>%
slice(1:3) %>%
pull(gene)
top_3_dr <- umgs_group_1 %>%
arrange(fc) %>%
slice(1:3) %>%
pull(gene)
We can then visual these genes on a tSNE plot (TODO: I would like to use {schex} for these plots but haven’t been able to get it to install)
sim <- runTSNE(sim)
plots <- lapply(c(top_3_ur, top_3_dr), function(x) {
plotTSNE(sim, colour_by = x)
})
We can see that for the second and third plots the up-regulated genes stand out substantially from the others. This effect is less pronounced for the first plot despite it having the highest splatter parameter value. More work is needed to understand why this is.
Version | Author | Date |
---|---|---|
ca82ce0 | Jeffrey Pullin | 2021-02-16 |
The impact of down-regulation is less visually pronounced in the plots especially 1 and 2 though it can be easily seen in Gene9111.
Version | Author | Date |
---|---|---|
ca82ce0 | Jeffrey Pullin | 2021-02-16 |
(Name to be workshopped.)
Next we can consider marker genes which are semi-unique (suggestions for a better name appreciated) that is genes which are different (up-regulated say) in the cluster of interest and in only a few other clusters.
The precise number of other clusters will depend on the number of total clusters. If, for example, there are 40 clusters genes up-regulated in 4-5 clusters may still be useful marker genes. In the simulations done to date there are only 5 total clusters meaning that semi-unique clusters can really only be different in one other cluster.
One way semi-unique clusters could occur is if several sub-types of a broad cell type were present in the data.
We can find these genes using the find_semi_unique_marker_genes()
function:
Similarly to find_unique_marker_genes()
this returns a list of data.frame
s, which in addition to the columns gene
and fc
have columns other_group
and other_fc
which contain the index of the ‘other’ group with change and the splatter parameter of the change in the other group respectively.
Note that it is difficult to rank the magnitude/importance of semi-unique marker genes.
We can find and visualize the top 3 most up-regulated marker genes in group 1. Here we simply define the magnitude of the gene by the size of the splatter parameter in the group of interest.
sumgs_group_1 <- as_tibble(sumgs[[1]])
top_3_sumgs <- sumgs_group_1 %>%
arrange(desc(fc), desc(fc)) %>%
slice(1:3) %>%
pull(gene)
sumgs_plots <- lapply(top_3_sumgs, function(x) {
plotTSNE(sim, colour_by = x)
})
wrap_plots(sumgs_plots)
Version | Author | Date |
---|---|---|
ca82ce0 | Jeffrey Pullin | 2021-02-16 |
We can again see that sometimes even genes with large splatter parameter don’t appear clearly on the plot.
I (currently at least) believe that the above types of marker genes are the only ones that are useful of our purposes.
It is of course possible to generalize the semi-unique marker genes to have more ‘other’ groups, but I don’t think that genes which are different in a majority of clusters really qualify as marker genes.
The results here are based on simulations and my understanding of how marker genes are usually defined. A natural next step would be to look at the expression patterns of known marker genes in real data.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.3 (2020-10-10)
os Red Hat Enterprise Linux
system x86_64, linux-gnu
ui X11
language (EN)
collate en_AU.UTF-8
ctype en_AU.UTF-8
tz Australia/Melbourne
date 2021-02-17
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3)
beachmat 2.6.2 2020-11-24 [1] Bioconductor
beeswarm 0.2.3 2016-04-25 [1] CRAN (R 4.0.3)
Biobase * 2.50.0 2020-10-27 [1] Bioconductor
BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
BiocNeighbors 1.8.1 2020-11-11 [1] Bioconductor
BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
BiocSingular 1.6.0 2020-10-27 [1] Bioconductor
bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.3)
callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.3)
cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.3)
cowplot 1.1.0 2020-09-08 [1] CRAN (R 4.0.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.3)
DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
DelayedMatrixStats 1.12.1 2020-11-24 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.3)
devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.3)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.3)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.3)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.3)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.3)
farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.3)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
GenomeInfoDb * 1.26.1 2020-11-20 [1] Bioconductor
GenomeInfoDbData 1.2.4 2020-12-07 [1] Bioconductor
GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.3)
ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.3)
git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.3)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.3)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.3)
here 1.0.0 2020-11-15 [1] CRAN (R 4.0.3)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.3)
httpuv 1.5.4 2020-06-06 [1] CRAN (R 4.0.3)
IRanges * 2.24.0 2020-10-27 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.3)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.3)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.3)
later 1.1.0.1 2020-06-05 [1] CRAN (R 4.0.3)
lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.3)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.3)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.3)
MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
matrixStats * 0.57.0 2020-09-25 [1] CRAN (R 4.0.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.3)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.3)
patchwork * 1.1.0 2020-11-09 [1] CRAN (R 4.0.3)
pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.3)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.3)
processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.3)
promises 1.1.1 2020-06-09 [1] CRAN (R 4.0.3)
ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.3)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.3)
RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.3)
remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.3)
rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.3)
rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.3)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
rsvd 1.0.3 2020-02-17 [1] CRAN (R 4.0.3)
Rtsne 0.15 2018-11-10 [1] CRAN (R 4.0.3)
S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor
scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.3)
scater * 1.18.3 2020-11-08 [1] Bioconductor
scuttle 1.0.3 2020-11-23 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor
sparseMatrixStats 1.2.0 2020-10-27 [1] Bioconductor
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3)
SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.3)
tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.3)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.3)
usethis 2.0.0 2020-12-10 [1] CRAN (R 4.0.3)
vctrs 0.3.5 2020-11-17 [1] CRAN (R 4.0.3)
vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.3)
viridis 0.5.1 2018-03-29 [1] CRAN (R 4.0.3)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.3)
whisker 0.4 2019-08-28 [1] CRAN (R 4.0.3)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.3)
workflowr 1.6.2 2020-04-30 [1] CRAN (R 4.0.3)
xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3)
XVector 0.30.0 2020-10-27 [1] Bioconductor
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3)
zlibbioc 1.36.0 2020-10-27 [1] Bioconductor
[1] /mnt/mcfiles/jpullin/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.3/lib/R/library