Last updated: 2021-02-17

Checks: 7 0

Knit directory: mage_2020_marker-gene-benchmarking/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190102) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a3d68ac. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .snakemake/
    Ignored:    config/
    Ignored:    data/sim_data/
    Ignored:    logs/
    Ignored:    results/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/marker-genes-types.Rmd) and HTML (public/marker-genes-types.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd a3d68ac Jeffrey Pullin 2021-02-17 Update after discussion with Davis
html 3ecba5d Jeffrey Pullin 2021-02-16 Build site.
Rmd 94d2ecf Jeffrey Pullin 2021-02-16 Improve wording in marker gene type analysis
html ca82ce0 Jeffrey Pullin 2021-02-16 Build site.
Rmd 63bce66 Jeffrey Pullin 2021-02-16 Add new marker gene type analysis and update FDR + TPR
Rmd c947596 Jeffrey Pullin 2021-02-16 Rename file to relfect focus on types of marker genes

library(SingleCellExperiment)
library(tibble)
library(dplyr)
library(scater)
library(patchwork)
source(here::here("code", "find-marker-genes.R"))

Aim

Explore different types of ‘marker genes’ in scRNA-seq data (currently using Splatter simulations)

Analysis

For now we will simulate all data using the Splatter package, meaning that the types of marker genes will be defined, at least to some extent, with respect to how the Splatter simulations work.

Throughout this file we will work with one of the ‘standard’ Splatter simulations.

sim <- readRDS(here::here("data", "sim_data", "sim_1.rds"))

‘Unique’ marker genes

The simplest type of marker genes are what I am (for now) calling unqiue marker genes. These are genes which are up or down regulated in the cluster of interest and expressed at normal/background level in all other groups. In splatter we can identify these genes by looking for genes which have DE parameter not equal to one in the group of interest and equal to one in all other groups.

We can get these unique marker genes (UMGs) in a SingleCellExperiment object output from splatter with the find_unique_marker_genes() function. This will return a list of marker genes in each group. Each element of the list is data.frame with columns gene and fc holding the gene id and splatter parameter representing a fold change. (TODO: fc is really a fold change?)

umgs <- find_unique_marker_genes(sim)

One critical feature of unique marker genes is that their magnitude/importance can be defined simply by a single number allowing ranking. For example we can find the top 3 most up-regulated genes and top 3 most down-regulated genes for group 1 with the following code

umgs_group_1 <- as_tibble(umgs[[1]])

top_3_ur <- umgs_group_1 %>% 
  arrange(desc(fc)) %>% 
  slice(1:3) %>% 
  pull(gene)

top_3_dr <- umgs_group_1 %>% 
  arrange(fc) %>% 
  slice(1:3) %>% 
  pull(gene)

We can then visual these genes on a tSNE plot (TODO: I would like to use {schex} for these plots but haven’t been able to get it to install)

sim <- runTSNE(sim)

plots <- lapply(c(top_3_ur, top_3_dr), function(x) {
  plotTSNE(sim, colour_by = x)
})

We can see that for the second and third plots the up-regulated genes stand out substantially from the others. This effect is less pronounced for the first plot despite it having the highest splatter parameter value. More work is needed to understand why this is.

wrap_plots(plots[1:3])

Version Author Date
ca82ce0 Jeffrey Pullin 2021-02-16

The impact of down-regulation is less visually pronounced in the plots especially 1 and 2 though it can be easily seen in Gene9111.

wrap_plots(plots[4:6])

Version Author Date
ca82ce0 Jeffrey Pullin 2021-02-16

‘Semi-unique’ marker genes

(Name to be workshopped.)

Next we can consider marker genes which are semi-unique (suggestions for a better name appreciated) that is genes which are different (up-regulated say) in the cluster of interest and in only a few other clusters.

The precise number of other clusters will depend on the number of total clusters. If, for example, there are 40 clusters genes up-regulated in 4-5 clusters may still be useful marker genes. In the simulations done to date there are only 5 total clusters meaning that semi-unique clusters can really only be different in one other cluster.

One way semi-unique clusters could occur is if several sub-types of a broad cell type were present in the data.

We can find these genes using the find_semi_unique_marker_genes() function:

sumgs <- find_semi_unique_marker_genes(sim)

Similarly to find_unique_marker_genes() this returns a list of data.frames, which in addition to the columns gene and fc have columns other_group and other_fc which contain the index of the ‘other’ group with change and the splatter parameter of the change in the other group respectively.

Note that it is difficult to rank the magnitude/importance of semi-unique marker genes.

We can find and visualize the top 3 most up-regulated marker genes in group 1. Here we simply define the magnitude of the gene by the size of the splatter parameter in the group of interest.

sumgs_group_1 <- as_tibble(sumgs[[1]]) 

top_3_sumgs <- sumgs_group_1 %>% 
  arrange(desc(fc), desc(fc)) %>% 
  slice(1:3) %>% 
  pull(gene)

sumgs_plots <- lapply(top_3_sumgs, function(x) {
  plotTSNE(sim, colour_by = x)
})

wrap_plots(sumgs_plots)

Version Author Date
ca82ce0 Jeffrey Pullin 2021-02-16

We can again see that sometimes even genes with large splatter parameter don’t appear clearly on the plot.

Other marker gene types

I (currently at least) believe that the above types of marker genes are the only ones that are useful of our purposes.

It is of course possible to generalize the semi-unique marker genes to have more ‘other’ groups, but I don’t think that genes which are different in a majority of clusters really qualify as marker genes.

Limitation: no analysis of real data

The results here are based on simulations and my understanding of how marker genes are usually defined. A natural next step would be to look at the expression patterns of known marker genes in real data.


devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Red Hat Enterprise Linux    
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Melbourne         
 date     2021-02-17                  

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.3)
 beachmat               2.6.2    2020-11-24 [1] Bioconductor  
 beeswarm               0.2.3    2016-04-25 [1] CRAN (R 4.0.3)
 Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
 BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
 BiocNeighbors          1.8.1    2020-11-11 [1] Bioconductor  
 BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
 BiocSingular           1.6.0    2020-10-27 [1] Bioconductor  
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.3)
 callr                  3.5.1    2020-10-13 [1] CRAN (R 4.0.3)
 cli                    2.2.0    2020-11-20 [1] CRAN (R 4.0.3)
 colorspace             2.0-0    2020-11-11 [1] CRAN (R 4.0.3)
 cowplot                1.1.0    2020-09-08 [1] CRAN (R 4.0.3)
 crayon                 1.3.4    2017-09-16 [1] CRAN (R 4.0.3)
 DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
 DelayedMatrixStats     1.12.1   2020-11-24 [1] Bioconductor  
 desc                   1.2.0    2018-05-01 [1] CRAN (R 4.0.3)
 devtools               2.3.2    2020-09-18 [1] CRAN (R 4.0.3)
 digest                 0.6.27   2020-10-24 [1] CRAN (R 4.0.3)
 dplyr                * 1.0.2    2020-08-18 [1] CRAN (R 4.0.3)
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.3)
 evaluate               0.14     2019-05-28 [1] CRAN (R 4.0.3)
 fansi                  0.4.1    2020-01-08 [1] CRAN (R 4.0.3)
 farver                 2.0.3    2020-01-16 [1] CRAN (R 4.0.3)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.0.3)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.0.3)
 GenomeInfoDb         * 1.26.1   2020-11-20 [1] Bioconductor  
 GenomeInfoDbData       1.2.4    2020-12-07 [1] Bioconductor  
 GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
 ggbeeswarm             0.6.0    2017-08-07 [1] CRAN (R 4.0.3)
 ggplot2              * 3.3.2    2020-06-19 [1] CRAN (R 4.0.3)
 git2r                  0.28.0   2021-01-10 [1] CRAN (R 4.0.3)
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.0.3)
 gridExtra              2.3      2017-09-09 [1] CRAN (R 4.0.3)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.3)
 here                   1.0.0    2020-11-15 [1] CRAN (R 4.0.3)
 htmltools              0.5.0    2020-06-16 [1] CRAN (R 4.0.3)
 httpuv                 1.5.4    2020-06-06 [1] CRAN (R 4.0.3)
 IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
 irlba                  2.3.3    2019-02-05 [1] CRAN (R 4.0.3)
 knitr                  1.30     2020-09-22 [1] CRAN (R 4.0.3)
 labeling               0.4.2    2020-10-20 [1] CRAN (R 4.0.3)
 later                  1.1.0.1  2020-06-05 [1] CRAN (R 4.0.3)
 lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.3)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.3)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
 Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.3)
 MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
 matrixStats          * 0.57.0   2020-09-25 [1] CRAN (R 4.0.3)
 memoise                1.1.0    2017-04-21 [1] CRAN (R 4.0.3)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.3)
 patchwork            * 1.1.0    2020-11-09 [1] CRAN (R 4.0.3)
 pillar                 1.4.7    2020-11-20 [1] CRAN (R 4.0.3)
 pkgbuild               1.1.0    2020-07-13 [1] CRAN (R 4.0.3)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.3)
 pkgload                1.1.0    2020-05-29 [1] CRAN (R 4.0.3)
 prettyunits            1.1.1    2020-01-24 [1] CRAN (R 4.0.3)
 processx               3.4.5    2020-11-30 [1] CRAN (R 4.0.3)
 promises               1.1.1    2020-06-09 [1] CRAN (R 4.0.3)
 ps                     1.5.0    2020-12-05 [1] CRAN (R 4.0.3)
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.0.3)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.0.3)
 Rcpp                   1.0.5    2020-07-06 [1] CRAN (R 4.0.3)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.3)
 remotes                2.2.0    2020-07-21 [1] CRAN (R 4.0.3)
 rlang                  0.4.9    2020-11-26 [1] CRAN (R 4.0.3)
 rmarkdown              2.5      2020-10-21 [1] CRAN (R 4.0.3)
 rprojroot              2.0.2    2020-11-15 [1] CRAN (R 4.0.3)
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.0.3)
 rsvd                   1.0.3    2020-02-17 [1] CRAN (R 4.0.3)
 Rtsne                  0.15     2018-11-10 [1] CRAN (R 4.0.3)
 S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.3)
 scater               * 1.18.3   2020-11-08 [1] Bioconductor  
 scuttle                1.0.3    2020-11-23 [1] Bioconductor  
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.3)
 SingleCellExperiment * 1.12.0   2020-10-27 [1] Bioconductor  
 sparseMatrixStats      1.2.0    2020-10-27 [1] Bioconductor  
 stringi                1.5.3    2020-09-09 [1] CRAN (R 4.0.3)
 stringr                1.4.0    2019-02-10 [1] CRAN (R 4.0.3)
 SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
 testthat               3.0.0    2020-10-31 [1] CRAN (R 4.0.3)
 tibble               * 3.0.4    2020-10-12 [1] CRAN (R 4.0.3)
 tidyselect             1.1.0    2020-05-11 [1] CRAN (R 4.0.3)
 usethis                2.0.0    2020-12-10 [1] CRAN (R 4.0.3)
 vctrs                  0.3.5    2020-11-17 [1] CRAN (R 4.0.3)
 vipor                  0.4.5    2017-03-22 [1] CRAN (R 4.0.3)
 viridis                0.5.1    2018-03-29 [1] CRAN (R 4.0.3)
 viridisLite            0.3.0    2018-02-01 [1] CRAN (R 4.0.3)
 whisker                0.4      2019-08-28 [1] CRAN (R 4.0.3)
 withr                  2.3.0    2020-09-22 [1] CRAN (R 4.0.3)
 workflowr              1.6.2    2020-04-30 [1] CRAN (R 4.0.3)
 xfun                   0.19     2020-10-30 [1] CRAN (R 4.0.3)
 XVector                0.30.0   2020-10-27 [1] Bioconductor  
 yaml                   2.2.1    2020-02-01 [1] CRAN (R 4.0.3)
 zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  

[1] /mnt/mcfiles/jpullin/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.3/lib/R/library