Skip to content
Snippets Groups Projects
splatter.Rmd 19.7 KiB
Newer Older
Luke Zappia's avatar
Luke Zappia committed
---
title: "Introduction to Splatter"
author: "Luke Zappia"
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
vignette: >
  %\VignetteIndexEntry{An introduction to the Splatter package}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r knitr-options, echo = FALSE, message = FALSE, warning = FALSE}
# To render an HTML version that works nicely with github and web pages, do:
# rmarkdown::render("vignettes/splatter.Rmd", "all")
knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5,
                      dev = 'png')
```

![Splatter logo](splatter-logo-small.png)
Luke Zappia's avatar
Luke Zappia committed

Luke Zappia's avatar
Luke Zappia committed
Welcome to Splatter! Splatter is an R package for the simple simulation of
Luke Zappia's avatar
Luke Zappia committed
single-cell RNA sequencing data. This vignette gives an overview and
introduction to Splatter's functionality.
Luke Zappia's avatar
Luke Zappia committed

# Installation

Splatter can be installed from Bioconductor:

```{r install, eval = FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("splatter")
```

To install the most recent development version from Github use:

```{r install-github, eval = FALSE}
biocLite("Oshlack/splatter", dependencies = TRUE, 
         build_vignettes = TRUE)
```

Luke Zappia's avatar
Luke Zappia committed
# Quickstart

Assuming you already have a matrix of count data similar to that you wish to
simulate there are two simple steps to creating a simulated data set with
Splatter. Here is an example using the example dataset in the `scater` package:

```{r quickstart}
Luke Zappia's avatar
Luke Zappia committed
# Load package
Luke Zappia's avatar
Luke Zappia committed
library(splatter)

Luke Zappia's avatar
Luke Zappia committed
# Load example data
Luke Zappia's avatar
Luke Zappia committed
data("sc_example_counts")
Luke Zappia's avatar
Luke Zappia committed
# Estimate parameters from example data
Luke Zappia's avatar
Luke Zappia committed
params <- splatEstimate(sc_example_counts)
Luke Zappia's avatar
Luke Zappia committed
# Simulate data using estimated parameters
Luke Zappia's avatar
Luke Zappia committed
sim <- splatSimulate(params, dropout.present = FALSE)
```

These steps will be explained in detail in the following sections but briefly
the first step takes a dataset and estimates simulation parameters from it and
the second step takes those parameters and simulates a new dataset.

Luke Zappia's avatar
Luke Zappia committed
# The Splat simulation
Luke Zappia's avatar
Luke Zappia committed

Before we look at how we estimate parameters let's first look at how Splatter
Luke Zappia's avatar
Luke Zappia committed
simulates data and what those parameters are. We use the term 'Splat' to refer
to the Splatter's own simulation and differentiate it from the package itself.
The core of the Splat model is a gamma-Poisson distribution used to generate a
gene by cell matrix of counts. Mean expression levels for each gene are
simulated from a [gamma distribution][gamma] and the Biological Coefficient of
Variation is used to enforce a mean-variance trend before counts are simulated
from a [Poisson distribution][poisson]. Splat also allows you to simulate
Luke Zappia's avatar
Luke Zappia committed
expression outlier genes (genes with mean expression outside the gamma
distribution) and dropout (random knock out of counts based on mean expression).
Each cell is given an expected library size (simulated from a log-normal
distribution) that makes it easier to match to a given dataset.

Luke Zappia's avatar
Luke Zappia committed
Splat can also simulate differential expression between groups of different
Luke Zappia's avatar
Luke Zappia committed
types of cells or differentiation paths between different cells types where
expression changes in a continuous way. These are described further in the
[simulating counts] section.
Luke Zappia's avatar
Luke Zappia committed

## Parameters

Luke Zappia's avatar
Luke Zappia committed
The parameters required for the Splat simulation are briefly described here:
Luke Zappia's avatar
Luke Zappia committed

* **Global parameters**
    * `nGenes` - The number of genes to simulate.
    * `nCells` - The number of cells to simulate.
    * `nGroups` - The number of groups or paths to simulate.
    * `groupCells` - The number of cells in each group/path.
    * `seed` - Seed to use for generating random numbers.
* **Mean parameters**
    * `mean.shape` - Shape parameter for the mean gamma distribution.
    * `mean.rate` - Rate parameter for the mean gamma distribution.
* **Library size parameters**
    * `lib.loc` - Location (meanlog) parameter for the library size log-normal
      distribution.
    * `lib.scale` - Scale (sdlog) parameter for the library size log-normal
      distribution.
* **Expression outlier parameters**
    * `out.prob` - Probability that a gene is an expression outlier.
    * `out.facLoc` - Location (meanlog) parameter for the expression outlier
      factor log-normal distribution.
    * `out.facScale` - Scale (sdlog) parameter for the expression outlier factor
      log-normal distribution.
* **Differential expression parameters**
    * `de.prob` - Probability that a gene is differentially expressed in each
      group or path.
    * `de.loProb` - Probability that a differentially expressed gene is
      down-regulated.
    * `de.facLoc` - Location (meanlog) parameter for the differential expression
      factor log-normal distribution.
    * `de.facScale` - Scale (sdlog) parameter for the differential expression
      factor log-normal distribution.
* **Biological Coefficient of Variation parameters**
    * `bcv.common` - Underlying common dispersion across all genes.
    * `bcv.df` - Degrees of Freedom for the BCV inverse chi-squared
      distribution.
* **Dropout parameters**
    * `dropout.present` - Logical. Whether to simulate dropout.
    * `dropout.mid` - Midpoint parameter for the dropout logistic function.
    * `dropout.shape` - Shape parameter for the dropout logistic function.
* **Differentiation path parameters**
    * `path.from` - Vector giving the originating point of each path.
    * `path.length` - Vector giving the number of steps to simulate along each
      path.
    * `path.skew` - Vector giving the skew of each path.
    * `path.nonlinearProb` - Probability that a gene changes expression in a
      non-linear way along the differentiation path.
    * `path.sigmaFac` - Sigma factor for non-linear gene paths.

While this may look like a lot of parameters Splatter attempts to make it easy
for the user, both by providing sensible defaults and making it easy to estimate
many of the parameters from real data. For more details on the parameters see
`?SplatParams`.

# The `SplatParams` object

Luke Zappia's avatar
Luke Zappia committed
All the parameters for the Splat simulation are stored in a `SplatParams`
Luke Zappia's avatar
Luke Zappia committed
object. Let's create a new one and see what it looks like.

```{r SplatParams}
params <- newSplatParams()
params
```

As well as telling us what type of object we have ("A `Params` object of class
`SplatParams`") and showing us the values of the parameter this output gives us
some extra information. We can see which parameters can be estimated by the
`splatEstimate` function (those in parentheses), which can't be estimated
(those in brackets) and which have been changed from their default values (those
in ALL CAPS). 

## Getting and setting

If we want to look at a particular parameter, for example the number of genes to
simulate, we can extract it using the `getParam` function:

```{r getParam}
getParam(params, "nGenes")
```

Luke Zappia's avatar
Luke Zappia committed
Alternatively, to give a parameter a new value we can use the `setParam`
Luke Zappia's avatar
Luke Zappia committed
function:

```{r setParam}
params <- setParam(params, "nGenes", 5000)
Luke Zappia's avatar
Luke Zappia committed
getParam(params, "nGenes")
Luke Zappia's avatar
Luke Zappia committed
```

If we want to extract multiple parameters (as a list) or set multiple parameters
we can use the `getParams` or `setParams` functions:

```{r getParams-setParams}
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
Luke Zappia's avatar
Luke Zappia committed
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
Luke Zappia's avatar
Luke Zappia committed
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
```

Luke Zappia's avatar
Luke Zappia committed
The parameters with have changed are now shown in ALL CAPS to indicate that they
been changed form the default.

Luke Zappia's avatar
Luke Zappia committed
We can also set parameters directly when we call `newSplatParams`:

```{r newSplatParams-set}
params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
Luke Zappia's avatar
Luke Zappia committed
getParams(params, c("lib.loc", "lib.scale"))
Luke Zappia's avatar
Luke Zappia committed
```

# Estimating parameters

Luke Zappia's avatar
Luke Zappia committed
Splat allows you to estimate many of it's parameters from a data set containing
counts using the `splatEstimate` function.
Luke Zappia's avatar
Luke Zappia committed

```{r splatEstimate}
# Check that sc_example counts is an integer matrix
class(sc_example_counts)
typeof(sc_example_counts)
# Check the dimensions, each row is a gene, each column is a cell
dim(sc_example_counts)
# Show the first few entries
sc_example_counts[1:5, 1:5]

params <- splatEstimate(sc_example_counts)
```

Here we estimated parameters from a counts matrix but `splatEstimate` can also
take an `SCESet` object from the `scater` package. The estimation process has
the following steps:

1. Mean parameters are estimated by fitting a gamma distribution to the mean
   expression levels.
2. Library size parameters are estimated by fitting a log-normal distribution to
   the library sizes.
3. Expression outlier parameters are estimated by determining the number of
   outliers and fitting a log-normal distribution to their difference from the
   median.
4. BCV parameters are estimated using the `estimateDisp` function from the
   `edgeR` package.
Luke Zappia's avatar
Luke Zappia committed
5. Dropout parameters are estimated by checking if dropout is present and
   fitting a logistic function to the relationship between mean expression and
Luke Zappia's avatar
Luke Zappia committed
   proportion of zeros.
   
For more details of the estimation procedures see `?splatEstimate`.

Luke Zappia's avatar
Luke Zappia committed
# Simulating counts

Luke Zappia's avatar
Luke Zappia committed
Once we have a set of parameters we are happy with we can use `splatSimulate`
to simulate counts. If we want to make small adjustments to the parameters we
can provide them as additional arguments, alternatively if we don't supply any
parameters the defaults will be used:

```{r splatSimulate}
sim <- splatSimulate(params, nGenes = 1000, dropout.present = FALSE)
sim
```

Looking at the output of `splatSimulate` we can see that `sim` is an `SCESet`
object with `r nrow(sim)` features (genes) and `r ncol(sim)` samples
(cells). The main part of this object is a features by samples matrix containing
the simulated counts (accessed using `counts`), although it can also hold other
expression measures such as FPKM or TPM. Additionaly an `SCESet` contains
phenotype information about each cell (accessed using `pData`) and feature
Luke Zappia's avatar
Luke Zappia committed
information about each gene (accessed using `fData`). Splatter uses these slots
to store information about the intermediate values of the simulation.

```{r SCESet}
# Access the counts
counts(sim)[1:5, 1:5]
Luke Zappia's avatar
Luke Zappia committed
# Information about genes
head(fData(sim))
# Information about cells
head(pData(sim))
# Gene by cell matrices
names(assayData(sim))
# Example of cell means matrix
get_exprs(sim, "CellMeans")[1:5, 1:5]
Luke Zappia's avatar
Luke Zappia committed
```

An additional (big) advantage of outputting an `SCESet` is that we get immediate
access to all of the `scater` functions. For example we can make a PCA plot:

```{r pca}
plotPCA(sim)
```

Luke Zappia's avatar
Luke Zappia committed
(**NOTE:** Your values and plots may look different as the simulation is random
and produces different results each time it is run.)

Luke Zappia's avatar
Luke Zappia committed
For more details of the `SCESet` and what you can do with `scater` refer to the
`scater` documentation and [vignette][scater-vignette].

The `splatSimulate` function outputs the following additional information about
the simulation:

* **Cell information (`pData`)**
    * `Cell` - Unique cell identifier.
    * `Group` - The group or path the cell belongs to.
Luke Zappia's avatar
Luke Zappia committed
    * `ExpLibSize` - The expected library size for that cell.
Luke Zappia's avatar
Luke Zappia committed
    * `Step` (paths only) - How far along the path each cell is.
* **Gene information (`fData`)**
    * `Gene` - Unique gene identifier.
    * `BaseGeneMean` - The base expression level for that gene.
Luke Zappia's avatar
Luke Zappia committed
    * `OutlierFactor` - Expression outlier factor for that gene (1 is not an
      outlier).
Luke Zappia's avatar
Luke Zappia committed
    * `GeneMean` - Expression level after applying outlier factors.
    * `DEFac[Group]` - The differential expression factor for each gene
Luke Zappia's avatar
Luke Zappia committed
      in a particular group (1 is not differentially expressed).
Luke Zappia's avatar
Luke Zappia committed
    * `GeneMean[Group]` - Expression level of a gene in a particular group after
      applying differential expression factors.
Luke Zappia's avatar
Luke Zappia committed
* **Gene by cell information (`assayData`)**
Luke Zappia's avatar
Luke Zappia committed
    * `BaseCellMeans` - The expression of genes in each cell adjusted for
      expected library size.
    * `BCV` - The Biological Coefficient of Variation for each gene in
      each cell.
    * `CellMeans` - The expression level of genes in each cell adjusted
      for BCV.
    * `TrueCounts` - The simulated counts before dropout.
    * `Dropout` - Logical matrix showing which counts have been dropped in which
      cells.

Values that have been added by Splatter are named using `UpperCamelCase` to
separate them from the `underscore_naming` used by `scater`. For more
information on the simulation see `?splatSimulate`.
Luke Zappia's avatar
Luke Zappia committed

## Simulating groups

Luke Zappia's avatar
Luke Zappia committed
So far we have only simulated a single population of cells but often we are
Luke Zappia's avatar
Luke Zappia committed
interested in investigating a mixed population of cells and looking to see what
cell types are present or what differences there are between them. Splatter is
able to simulate these situations by changing the `method` argument Here we are
going to simulate two groups, each with 50 cells, by specifying the `groupCells`
parameter and setting the `method` parameter to `"groups"`:

(**NOTE:** We have also set the `verbose` argument to `FALSE` to stop Splatter
printing progress messages.)
Luke Zappia's avatar
Luke Zappia committed

```{r groups}
sim.groups <- splatSimulate(groupCells = c(100, 100), method = "groups",
                            verbose = FALSE)
plotPCA(sim.groups, colour_by = "Group")
```

Luke Zappia's avatar
Luke Zappia committed
## Simulating paths

Luke Zappia's avatar
Luke Zappia committed
The other situation that is often of interest is a differentiation process where
Luke Zappia's avatar
Luke Zappia committed
one cell type is changing into another. Splatter approximates this process by
simulating a series of steps between two groups and randomly assigning each
cell to a step. We can create this kind of simulation using the `"paths"`
method. 
Luke Zappia's avatar
Luke Zappia committed

```{r paths}
sim.paths <- splatSimulate(method = "paths", verbose = FALSE)
plotPCA(sim.paths, colour_by = "Step")
```

Luke Zappia's avatar
Luke Zappia committed
Here the colours represent the "step" of each cell or how far along the 
differentiation path it is. We can see that the cells with dark colours are more
similar to the originating cell type and the light coloured cells are closer
to the final, differentiated, cell type. By setting additional parameters it is
possible to simulate more complex process (for example multiple mature cell
types from a single progenitor).

Luke Zappia's avatar
Luke Zappia committed
## Convenience functions

Luke Zappia's avatar
Luke Zappia committed
Each of the Splatter simulation methods has it's own convenience function.
To simulate a single population use `splatSimulateSingle()` (equivalent to
`splatSimulate(method = "single")`), to simulate grops use
`splatSimulateGroups()` (equivalent to `splatSimulate(method = "groups")`) or to
simulate paths use `splatSimulatePaths()` (equivalent to
`splatSimulate(method = "paths")`).

Luke Zappia's avatar
Luke Zappia committed
# Other simulations

Luke Zappia's avatar
Luke Zappia committed
As well as it's own simulation method the Splatter package contains
implementations of other single-cell RNA-seq simulations that have been
published or wrappers around simulations included in other packages. To see all
Luke Zappia's avatar
Luke Zappia committed
the available simulations run the `listSims()` function:
Luke Zappia's avatar
Luke Zappia committed

```{r listSims}
listSims()
```

Luke Zappia's avatar
Luke Zappia committed
(or more conveniently for the vignette as a table)
Luke Zappia's avatar
Luke Zappia committed
```{r listSims-table}
Luke Zappia's avatar
Luke Zappia committed
knitr::kable(listSims(print = FALSE))
```

Each simulation has it's own prefix which gives the name of the functions
associated with that simulation. For example the prefix for the simple
simulation is `simple` so it would store it's parameters in a `SimpleParams`
object that can be created using `newSimpleParams()` or estimated from real
Luke Zappia's avatar
Luke Zappia committed
data using `simpleEstimate()`. To simulate data using that simulation you
Luke Zappia's avatar
Luke Zappia committed
would use `simpleSimulate()`. Each simulation returns an `SCESet` object with
intermediate values similar to that returned by `splatSimulate()`. For more
detailed information on each simulation see the appropriate help page (eg.
`?simpleSimulate` for information on how the simple simulation works or
Luke Zappia's avatar
Luke Zappia committed
`?lun2Estimate` for details of how the Lun 2 simulation estimates
Luke Zappia's avatar
Luke Zappia committed
parameters) or refer to the appropriate paper or package.

# Other expression values

Splatter is designed to simulate count data but some analysis methods expect
other expression values, particularly length-normalised values such as TPM or
FPKM. The `scater` package has functions for adding these values to an `SCESet`
object but they require a length for each gene. The `addGeneLengths` can be
used to simulate these lengths:

```{r lengths}
sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(fData(sim))
```

We can then use `scater` to calculate TPM:

```{r TPM}
tpm(sim) <- calculateTPM(sim, fData(sim)$Length)
tpm(sim)[1:5, 1:5]
```

The default method used by `addGeneLengths` to simulate lengths is to generate
values from a log-normal distribution which are then rounded to give an integer
length. The parameters for this distribution are based on human coding genes
but can be adjusted if needed (for example for other species). Alternatively
lengths can be sampled from a provided vector (see `?addGeneLengths` for details
and an example).

Luke Zappia's avatar
Luke Zappia committed
# Comparing simulations and real data

One thing you might like to do after simulating data is to compare it to a real
dataset, or compare simulations with different parameters or models. Splatter
provides a function `compareSCESets` that aims to make these comparisons easier.
As the name suggests this function takes a list of `SCESet` objects, combines
the datasets and produces some plots comparing them. Let's make two small
simulations and see how they compare.

```{r comparison}
sim1 <- splatSimulate(nGenes = 1000, groupCells = 20, verbose = FALSE)
sim2 <- simpleSimulate(nGenes = 1000, nCells = 20, verbose = FALSE)
comparison <- compareSCESets(list(Splat = sim1, Simple = sim2))

names(comparison)
names(comparison$Plots)
```

The returned list has three items. The first two are the combined datasets by
gene (`FeatureData`) and by cell (`PhenoData`) and the third contains some
comparison plots (produced using `ggplot2`), for example a plot of the
distribution of means:

```{r comparison-means}
comparison$Plots$Means
```

These are only a few of the plots you might want to consider but it should be
easy to make more using the returned data. For example, we could plot the
number of expressed genes against the library size:

```{r comparison-libsize-features}
library("ggplot2")
ggplot(comparison$PhenoData,
       aes(x = total_counts, y = total_features, colour = Dataset)) +
    geom_point()
```

## Comparing differences

Sometimes instead of visually comparing datasets it may be more interesting to
look at the differences between them. We can do this using the `diffSCESets`
function. Similar to `compareSCESets` this function takes a list of `SCESet`
objects but now we also specify one to be a reference. A series of similar plots
are returned but instead of showing the overall distributions they demonstrate
differences from the reference.

```{r difference}
difference <- diffSCESets(list(Splat = sim1, Simple = sim2), ref = "Simple")
difference$Plots$Means
```

We also get a series of Quantile-Quantile plot that can be used to compare
distributions.

```{r difference-qq}
difference$QQPlots$Means
```

## Making panels

Each of these comparisons makes several plots which can be a lot to look at. To
make this easier, or to produce figures for publications, you can make use of
the functions `makeCompPanel`, `makeDiffPanel` and `makeOverallPanel`.

These functions combine the plots into a single panel using the `cowplot`
package. The panels can be quite large and hard to view (for example in
RStudio's plot viewer) so it can be better to output the panels and view them
separately. Luckily `cowplot` provides a convenient function for saving the
images. Here are some suggested parameters for outputting each of the panels:

```{r save-panels, eval = FALSE}
# This code is just an example and is not run
panel <- makeCompPanel(comparison)
cowplot::save_plot("comp_panel.png", panel, nrow = 4, ncol = 3)

panel <- makeDiffPanel(difference)
cowplot::save_plot("diff_panel.png", panel, nrow = 3, ncol = 5)

panel <- makeOverallPanel(comparison, difference)
cowplot::save_plot("overall_panel.png", panel, ncol = 4, nrow = 7)
```

Luke Zappia's avatar
Luke Zappia committed
# Session information {-}

Luke Zappia's avatar
Luke Zappia committed
```{r sessionInfo}
sessionInfo()
```

Luke Zappia's avatar
Luke Zappia committed
[gamma]: https://en.wikipedia.org/wiki/Gamma_distribution
[poisson]: https://en.wikipedia.org/wiki/Poisson_distribution
[scater-vignette]: https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette.html