Skip to content
Snippets Groups Projects
splatter.Rmd 22.2 KiB
Newer Older
  • Learn to ignore specific revisions
  • ---
    title: "Introduction to Splatter"
    author: "Luke Zappia"
    date: "`r Sys.Date()`"
    output:
        BiocStyle::html_document:
            toc: true
    vignette: >
      %\VignetteIndexEntry{An introduction to the Splatter package}
      %\VignetteEngine{knitr::rmarkdown}
      \usepackage[utf8]{inputenc}
    ---
    
    ```{r knitr-options, echo = FALSE, message = FALSE, warning = FALSE}
    # To render an HTML version that works nicely with github and web pages, do:
    # rmarkdown::render("vignettes/splatter.Rmd", "all")
    knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5,
                          dev = 'png')
    ```
    
    ![Splatter logo](splatter-logo-small.png)
    
    Welcome to Splatter! Splatter is an R package for the simple simulation of
    single-cell RNA sequencing data. This vignette gives an overview and
    introduction to Splatter's functionality.
    
    # Installation
    
    Splatter can be installed from Bioconductor:
    
    ```{r install, eval = FALSE}
    source("https://bioconductor.org/biocLite.R")
    biocLite("splatter")
    ```
    
    To install the most recent development version from Github use:
    
    ```{r install-github, eval = FALSE}
    biocLite("Oshlack/splatter", dependencies = TRUE, 
             build_vignettes = TRUE)
    ```
    
    # Quickstart
    
    Assuming you already have a matrix of count data similar to that you wish to
    simulate there are two simple steps to creating a simulated data set with
    Splatter. Here is an example using the example dataset in the `scater` package:
    
    ```{r quickstart}
    # Load package
    library(splatter)
    
    # Load example data
    data("sc_example_counts")
    # Estimate parameters from example data
    params <- splatEstimate(sc_example_counts)
    # Simulate data using estimated parameters
    sim <- splatSimulate(params, dropout.present = FALSE)
    ```
    
    These steps will be explained in detail in the following sections but briefly
    the first step takes a dataset and estimates simulation parameters from it and
    the second step takes those parameters and simulates a new dataset.
    
    # The Splat simulation
    
    Before we look at how we estimate parameters let's first look at how Splatter
    simulates data and what those parameters are. We use the term 'Splat' to refer
    to the Splatter's own simulation and differentiate it from the package itself.
    The core of the Splat model is a gamma-Poisson distribution used to generate a
    gene by cell matrix of counts. Mean expression levels for each gene are
    simulated from a [gamma distribution][gamma] and the Biological Coefficient of
    Variation is used to enforce a mean-variance trend before counts are simulated
    from a [Poisson distribution][poisson]. Splat also allows you to simulate
    expression outlier genes (genes with mean expression outside the gamma
    distribution) and dropout (random knock out of counts based on mean expression).
    Each cell is given an expected library size (simulated from a log-normal
    distribution) that makes it easier to match to a given dataset.
    
    Splat can also simulate differential expression between groups of different
    types of cells or differentiation paths between different cells types where
    expression changes in a continuous way. These are described further in the
    [simulating counts] section.
    
    ## Parameters
    
    The parameters required for the Splat simulation are briefly described here:
    
    * **Global parameters**
        * `nGenes` - The number of genes to simulate.
        * `nCells` - The number of cells to simulate.
        * `seed` - Seed to use for generating random numbers.
    
    * **Batch parameters**
        * `nBatches` - The number of batches to simulate.
        * `batchCells` - The number of cells in each batch.
        * `batch.facLoc` - Location (meanlog) parameter for the batch effects factor
          log-normal distribution.
        * `batch.facScale` - Scale (sdlog) parameter for the batch effects factor
          log-normal distribution.
    
    * **Mean parameters**
        * `mean.shape` - Shape parameter for the mean gamma distribution.
        * `mean.rate` - Rate parameter for the mean gamma distribution.
    * **Library size parameters**
        * `lib.loc` - Location (meanlog) parameter for the library size log-normal
          distribution.
        * `lib.scale` - Scale (sdlog) parameter for the library size log-normal
          distribution.
    * **Expression outlier parameters**
        * `out.prob` - Probability that a gene is an expression outlier.
        * `out.facLoc` - Location (meanlog) parameter for the expression outlier
          factor log-normal distribution.
        * `out.facScale` - Scale (sdlog) parameter for the expression outlier factor
          log-normal distribution.
    
    * **Group parameters**
        * `nGroups` - The number of groups or paths to simulate.
        * `group.prob` - The probabilities that cells come from particular groups.
    
    * **Differential expression parameters**
        * `de.prob` - Probability that a gene is differentially expressed in each
          group or path.
        * `de.loProb` - Probability that a differentially expressed gene is
          down-regulated.
        * `de.facLoc` - Location (meanlog) parameter for the differential expression
          factor log-normal distribution.
        * `de.facScale` - Scale (sdlog) parameter for the differential expression
          factor log-normal distribution.
    * **Biological Coefficient of Variation parameters**
        * `bcv.common` - Underlying common dispersion across all genes.
        * `bcv.df` - Degrees of Freedom for the BCV inverse chi-squared
          distribution.
    * **Dropout parameters**
        * `dropout.present` - Logical. Whether to simulate dropout.
        * `dropout.mid` - Midpoint parameter for the dropout logistic function.
        * `dropout.shape` - Shape parameter for the dropout logistic function.
    * **Differentiation path parameters**
        * `path.from` - Vector giving the originating point of each path.
        * `path.length` - Vector giving the number of steps to simulate along each
          path.
        * `path.skew` - Vector giving the skew of each path.
        * `path.nonlinearProb` - Probability that a gene changes expression in a
          non-linear way along the differentiation path.
        * `path.sigmaFac` - Sigma factor for non-linear gene paths.
    
    While this may look like a lot of parameters Splatter attempts to make it easy
    for the user, both by providing sensible defaults and making it easy to estimate
    many of the parameters from real data. For more details on the parameters see
    `?SplatParams`.
    
    # The `SplatParams` object
    
    All the parameters for the Splat simulation are stored in a `SplatParams`
    object. Let's create a new one and see what it looks like.
    
    ```{r SplatParams}
    params <- newSplatParams()
    params
    ```
    
    As well as telling us what type of object we have ("A `Params` object of class
    `SplatParams`") and showing us the values of the parameter this output gives us
    some extra information. We can see which parameters can be estimated by the
    `splatEstimate` function (those in parentheses), which can't be estimated
    (those in brackets) and which have been changed from their default values (those
    in ALL CAPS). 
    
    ## Getting and setting
    
    If we want to look at a particular parameter, for example the number of genes to
    simulate, we can extract it using the `getParam` function:
    
    ```{r getParam}
    getParam(params, "nGenes")
    ```
    
    Alternatively, to give a parameter a new value we can use the `setParam`
    function:
    
    ```{r setParam}
    params <- setParam(params, "nGenes", 5000)
    getParam(params, "nGenes")
    ```
    
    If we want to extract multiple parameters (as a list) or set multiple parameters
    we can use the `getParams` or `setParams` functions:
    
    ```{r getParams-setParams}
    # Set multiple parameters at once (using a list)
    params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
    # Extract multiple parameters as a list
    getParams(params, c("nGenes", "mean.rate", "mean.shape"))
    # Set multiple parameters at once (using additional arguments)
    params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
    params
    ```
    
    The parameters with have changed are now shown in ALL CAPS to indicate that they
    been changed form the default.
    
    We can also set parameters directly when we call `newSplatParams`:
    
    ```{r newSplatParams-set}
    params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
    getParams(params, c("lib.loc", "lib.scale"))
    ```
    
    # Estimating parameters
    
    Splat allows you to estimate many of it's parameters from a data set containing
    counts using the `splatEstimate` function.
    
    ```{r splatEstimate}
    # Check that sc_example counts is an integer matrix
    class(sc_example_counts)
    typeof(sc_example_counts)
    # Check the dimensions, each row is a gene, each column is a cell
    dim(sc_example_counts)
    # Show the first few entries
    sc_example_counts[1:5, 1:5]
    
    params <- splatEstimate(sc_example_counts)
    ```
    
    Here we estimated parameters from a counts matrix but `splatEstimate` can also
    
    Luke Zappia's avatar
    Luke Zappia committed
    take a `SingleCellExperiment` object. The estimation process has the following
    steps:
    
    
    1. Mean parameters are estimated by fitting a gamma distribution to the mean
       expression levels.
    2. Library size parameters are estimated by fitting a log-normal distribution to
       the library sizes.
    3. Expression outlier parameters are estimated by determining the number of
       outliers and fitting a log-normal distribution to their difference from the
       median.
    4. BCV parameters are estimated using the `estimateDisp` function from the
       `edgeR` package.
    5. Dropout parameters are estimated by checking if dropout is present and
       fitting a logistic function to the relationship between mean expression and
       proportion of zeros.
    
    For more details of the estimation procedures see `?splatEstimate`.
    
    # Simulating counts
    
    Once we have a set of parameters we are happy with we can use `splatSimulate`
    to simulate counts. If we want to make small adjustments to the parameters we
    can provide them as additional arguments, alternatively if we don't supply any
    parameters the defaults will be used:
    
    ```{r splatSimulate}
    sim <- splatSimulate(params, nGenes = 1000, dropout.present = FALSE)
    sim
    ```
    
    
    Luke Zappia's avatar
    Luke Zappia committed
    Looking at the output of `splatSimulate` we can see that `sim` is
    `SingleCellExperiment` object with `r nrow(sim)` features (genes) and
    `r ncol(sim)` samples (cells). The main part of this object is a features
    by samples matrix containing the simulated counts (accessed using `counts`),
    although it can also hold other expression measures such as FPKM or TPM.
    Additionaly a `SingleCellExperiment` contains phenotype information about
    each cell (accessed using `colData`) and feature information about each gene
    (accessed using `rowData`). Splatter uses these slots, as well as `assays`, to
    store information about the intermediate values of the simulation.
    
    ```{r SCE}
    
    # Access the counts
    counts(sim)[1:5, 1:5]
    # Information about genes
    
    Luke Zappia's avatar
    Luke Zappia committed
    head(rowData(sim))
    
    Luke Zappia's avatar
    Luke Zappia committed
    head(colData(sim))
    
    Luke Zappia's avatar
    Luke Zappia committed
    names(assays(sim))
    
    Luke Zappia's avatar
    Luke Zappia committed
    assays(sim)$CellMeans[1:5, 1:5]
    
    Luke Zappia's avatar
    Luke Zappia committed
    An additional (big) advantage of outputting a `SingleCellExperiment` is that we
    get immediate access to other analysis packages, such as the plotting functions
    in `scater`. For example we can make a PCA plot:
    
    Luke Zappia's avatar
    Luke Zappia committed
    # Use scater to calculate logcounts
    sim <- normalise(sim)
    # Plot PCA
    plotPCA(sim)
    
    ```
    
    (**NOTE:** Your values and plots may look different as the simulation is random
    and produces different results each time it is run.)
    
    
    Luke Zappia's avatar
    Luke Zappia committed
    For more details about the `SingleCellExperiment` object refer to the [vignette]
    [SCE-vignette]. For information about what you can do with `scater` refer to the
    
    `scater` documentation and [vignette][scater-vignette].
    
    The `splatSimulate` function outputs the following additional information about
    the simulation:
    
    * **Cell information (`pData`)**
        * `Cell` - Unique cell identifier.
        * `Group` - The group or path the cell belongs to.
        * `ExpLibSize` - The expected library size for that cell.
        * `Step` (paths only) - How far along the path each cell is.
    * **Gene information (`fData`)**
        * `Gene` - Unique gene identifier.
        * `BaseGeneMean` - The base expression level for that gene.
        * `OutlierFactor` - Expression outlier factor for that gene (1 is not an
          outlier).
        * `GeneMean` - Expression level after applying outlier factors.
        * `DEFac[Group]` - The differential expression factor for each gene
          in a particular group (1 is not differentially expressed).
        * `GeneMean[Group]` - Expression level of a gene in a particular group after
          applying differential expression factors.
    * **Gene by cell information (`assayData`)**
        * `BaseCellMeans` - The expression of genes in each cell adjusted for
          expected library size.
        * `BCV` - The Biological Coefficient of Variation for each gene in
          each cell.
        * `CellMeans` - The expression level of genes in each cell adjusted
          for BCV.
        * `TrueCounts` - The simulated counts before dropout.
        * `Dropout` - Logical matrix showing which counts have been dropped in which
          cells.
    
    Values that have been added by Splatter are named using `UpperCamelCase` to
    
    Luke Zappia's avatar
    Luke Zappia committed
    separate them from the `underscore_naming` used by `scater` and other packages.
    For more information on the simulation see `?splatSimulate`.
    
    
    ## Simulating groups
    
    So far we have only simulated a single population of cells but often we are
    interested in investigating a mixed population of cells and looking to see what
    cell types are present or what differences there are between them. Splatter is
    able to simulate these situations by changing the `method` argument Here we are
    
    going to simulate two groups, by specifying the `group.prob` parameter and 
    setting the `method` parameter to `"groups"`:
    
    
    (**NOTE:** We have also set the `verbose` argument to `FALSE` to stop Splatter
    printing progress messages.)
    
    ```{r groups}
    
    sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
    
    Luke Zappia's avatar
    Luke Zappia committed
    sim.groups <- normalise(sim.groups)
    plotPCA(sim.groups, colour_by = "Group")
    
    As we have set both the group probabilites to 0.5 we should get approximately
    equal numbers of cells in each group (around 50 in this case). If we wanted
    uneven groups we could set `group.prob` to any set of probabilites that sum to
    1.
    
    
    ## Simulating paths
    
    The other situation that is often of interest is a differentiation process where
    one cell type is changing into another. Splatter approximates this process by
    simulating a series of steps between two groups and randomly assigning each
    cell to a step. We can create this kind of simulation using the `"paths"`
    method. 
    
    ```{r paths}
    sim.paths <- splatSimulate(method = "paths", verbose = FALSE)
    
    Luke Zappia's avatar
    Luke Zappia committed
    sim.paths <- normalise(sim.paths)
    plotPCA(sim.paths, colour_by = "Step")
    
    ```
    
    Here the colours represent the "step" of each cell or how far along the 
    differentiation path it is. We can see that the cells with dark colours are more
    similar to the originating cell type and the light coloured cells are closer
    to the final, differentiated, cell type. By setting additional parameters it is
    possible to simulate more complex process (for example multiple mature cell
    types from a single progenitor).
    
    
    ## Batch effects
    
    Another factor that is important in the analysis of any sequencing experiment
    are batch effects, technical variation that is common to a set of samples
    processed at the same time. We apply batch effects by telling Splatter how many
    cells are in each batch:
    
    ```{r batches}
    sim.batches <- splatSimulate(batchCells = c(50, 50), verbose = FALSE)
    
    Luke Zappia's avatar
    Luke Zappia committed
    sim.batches <- normalise(sim.batches)
    plotPCA(sim.batches, colour_by = "Batch")
    
    ```
    
    This looks at lot like when we simulated groups and that is because the process
    is very similar. The difference is that batch effects are applied to all genes,
    not just those that are differentially expressed, and the effects are usually
    smaller. By combining groups and batches we can simulate both unwanted variation
    that we aren't interested in (batch) and the wanted variation we are looking for
    (group):
    
    ```{r batch-groups}
    sim.groups <- splatSimulate(batchCells = c(50, 50), group.prob = c(0.5, 0.5),
                                method = "groups", verbose = FALSE)
    
    Luke Zappia's avatar
    Luke Zappia committed
    sim.groups <- normalise(sim.groups)
    
    Luke Zappia's avatar
    Luke Zappia committed
    plotPCA(sim.groups, shape_by = "Batch", colour_by = "Group",
            exprs_values = "counts")
    
    ```
    
    Here we see that the effects of the group (first component) are stronger than
    the batch effects (second component) but by adjusting the parameters we could
    made the batch effects dominate.
    
    
    ## Convenience functions
    
    Each of the Splatter simulation methods has it's own convenience function.
    To simulate a single population use `splatSimulateSingle()` (equivalent to
    `splatSimulate(method = "single")`), to simulate grops use
    `splatSimulateGroups()` (equivalent to `splatSimulate(method = "groups")`) or to
    simulate paths use `splatSimulatePaths()` (equivalent to
    `splatSimulate(method = "paths")`).
    
    # Other simulations
    
    
    As well as it's own Splat simulation method the Splatter package contains
    
    implementations of other single-cell RNA-seq simulations that have been
    published or wrappers around simulations included in other packages. To see all
    the available simulations run the `listSims()` function:
    
    ```{r listSims}
    listSims()
    ```
    
    (or more conveniently for the vignette as a table)
    
    ```{r listSims-table}
    knitr::kable(listSims(print = FALSE))
    ```
    
    Each simulation has it's own prefix which gives the name of the functions
    associated with that simulation. For example the prefix for the simple
    simulation is `simple` so it would store it's parameters in a `SimpleParams`
    object that can be created using `newSimpleParams()` or estimated from real
    data using `simpleEstimate()`. To simulate data using that simulation you
    
    Luke Zappia's avatar
    Luke Zappia committed
    would use `simpleSimulate()`. Each simulation returns a `SingleCellExperiment`
    object with intermediate values similar to that returned by `splatSimulate()`.
    For more detailed information on each simulation see the appropriate help page
    (eg. `?simpleSimulate` for information on how the simple simulation works or `?
    lun2Estimate` for details of how the Lun 2 simulation estimates parameters) or
    refer to the appropriate paper or package.
    
    # Other expression values
    
    Splatter is designed to simulate count data but some analysis methods expect
    
    Luke Zappia's avatar
    Luke Zappia committed
    other expression values, particularly length-normalised values such as TPM
    or FPKM. The `scater` package has functions for adding these values to a
    `SingleCellExperiment` object but they require a length for each gene. The
    `addGeneLengths` function can be used to simulate these lengths:
    
    
    ```{r lengths}
    sim <- simpleSimulate(verbose = FALSE)
    sim <- addGeneLengths(sim)
    
    Luke Zappia's avatar
    Luke Zappia committed
    head(rowData(sim))
    
    ```
    
    We can then use `scater` to calculate TPM:
    
    ```{r TPM}
    
    Luke Zappia's avatar
    Luke Zappia committed
    tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
    
    tpm(sim)[1:5, 1:5]
    ```
    
    The default method used by `addGeneLengths` to simulate lengths is to generate
    values from a log-normal distribution which are then rounded to give an integer
    
    length. The parameters for this distribution are based on human protein coding 
    genes but can be adjusted if needed (for example for other species). 
    Alternatively lengths can be sampled from a provided vector (see 
    `?addGeneLengths` for details and an example).
    
    # Comparing simulations and real data
    
    One thing you might like to do after simulating data is to compare it to a real
    dataset, or compare simulations with different parameters or models. Splatter
    
    Luke Zappia's avatar
    Luke Zappia committed
    provides a function `compareSCEs` that aims to make these comparisons easier. As
    the name suggests this function takes a list of `SingleCellExperiment` objects,
    combines the datasets and produces some plots comparing them. Let's make two
    small simulations and see how they compare.
    
    sim1 <- splatSimulate(nGenes = 1000, batchCells = 20, verbose = FALSE)
    
    sim2 <- simpleSimulate(nGenes = 1000, nCells = 20, verbose = FALSE)
    
    Luke Zappia's avatar
    Luke Zappia committed
    comparison <- compareSCEs(list(Splat = sim1, Simple = sim2))
    
    
    names(comparison)
    names(comparison$Plots)
    ```
    
    The returned list has three items. The first two are the combined datasets by
    gene (`FeatureData`) and by cell (`PhenoData`) and the third contains some
    comparison plots (produced using `ggplot2`), for example a plot of the
    distribution of means:
    
    ```{r comparison-means}
    comparison$Plots$Means
    ```
    
    These are only a few of the plots you might want to consider but it should be
    easy to make more using the returned data. For example, we could plot the
    number of expressed genes against the library size:
    
    ```{r comparison-libsize-features}
    library("ggplot2")
    ggplot(comparison$PhenoData,
           aes(x = total_counts, y = total_features, colour = Dataset)) +
        geom_point()
    ```
    
    
    ## Comparing differences
    
    
    Luke Zappia's avatar
    Luke Zappia committed
    Sometimes instead of visually comparing datasets it may be more interesting
    to look at the differences between them. We can do this using the
    `diffSCEs` function. Similar to `compareSCEs` this function takes a list of
    `SingleCellExperiment` objects but now we also specify one to be a reference.
    A series of similar plots are returned but instead of showing the overall
    distributions they demonstrate differences from the reference.
    
    
    ```{r difference}
    
    Luke Zappia's avatar
    Luke Zappia committed
    difference <- diffSCEs(list(Splat = sim1, Simple = sim2), ref = "Simple")
    
    difference$Plots$Means
    ```
    
    We also get a series of Quantile-Quantile plot that can be used to compare
    distributions.
    
    ```{r difference-qq}
    difference$QQPlots$Means
    ```
    
    
    ## Making panels
    
    Each of these comparisons makes several plots which can be a lot to look at. To
    make this easier, or to produce figures for publications, you can make use of
    the functions `makeCompPanel`, `makeDiffPanel` and `makeOverallPanel`.
    
    These functions combine the plots into a single panel using the `cowplot`
    package. The panels can be quite large and hard to view (for example in
    RStudio's plot viewer) so it can be better to output the panels and view them
    separately. Luckily `cowplot` provides a convenient function for saving the
    images. Here are some suggested parameters for outputting each of the panels:
    
    ```{r save-panels, eval = FALSE}
    # This code is just an example and is not run
    panel <- makeCompPanel(comparison)
    cowplot::save_plot("comp_panel.png", panel, nrow = 4, ncol = 3)
    
    panel <- makeDiffPanel(difference)
    cowplot::save_plot("diff_panel.png", panel, nrow = 3, ncol = 5)
    
    panel <- makeOverallPanel(comparison, difference)
    cowplot::save_plot("overall_panel.png", panel, ncol = 4, nrow = 7)
    ```
    
    
    Luke Zappia's avatar
    Luke Zappia committed
    # Citing Splatter
    
    If you use Splatter in your work please cite our paper:
    
    ```{r citation}
    citation("splatter")
    ```
    
    
    # Session information {-}
    
    ```{r sessionInfo}
    sessionInfo()
    ```
    
    [gamma]: https://en.wikipedia.org/wiki/Gamma_distribution
    [poisson]: https://en.wikipedia.org/wiki/Poisson_distribution
    
    [scater-vignette]: https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette.html
    
    Luke Zappia's avatar
    Luke Zappia committed
    [SCE-vignette]: https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html