-
Luke Zappia authoredLuke Zappia authored
title: "Introduction to Splatter"
author: "Luke Zappia"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{An introduction to the Splatter package}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
# To render an HTML version that works nicely with github and web pages, do:
# rmarkdown::render("vignettes/splatter.Rmd", "all")
knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5,
dev = 'png')
Welcome to Splatter! Splatter is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to Splatter's functionality.
Installation
Splatter can be installed from Bioconductor:
source("https://bioconductor.org/biocLite.R")
biocLite("splatter")
To install the most recent development version from Github use:
biocLite("Oshlack/splatter", dependencies = TRUE,
build_vignettes = TRUE)
Quickstart
Assuming you already have a matrix of count data similar to that you wish to
simulate there are two simple steps to creating a simulated data set with
Splatter. Here is an example using the example dataset in the scater
package:
# Load package
library(splatter)
# Load example data
library(scater)
data("sc_example_counts")
# Estimate parameters from example data
params <- splatEstimate(sc_example_counts)
# Simulate data using estimated parameters
sim <- splatSimulate(params, dropout.present = FALSE)
These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.
The Splat simulation
Before we look at how we estimate parameters let's first look at how Splatter simulates data and what those parameters are. We use the term 'Splat' to refer to the Splatter's own simulation and differentiate it from the package itself. The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.
Splat can also simulate differential expression between groups of different types of cells or differentiation paths between different cells types where expression changes in a continuous way. These are described further in the [simulating counts] section.
Parameters
The parameters required for the Splat simulation are briefly described here:
-
Global parameters
-
nGenes
- The number of genes to simulate. -
nCells
- The number of cells to simulate. -
seed
- Seed to use for generating random numbers.
-
-
Batch parameters
-
nBatches
- The number of batches to simulate. -
batchCells
- The number of cells in each batch. -
batch.facLoc
- Location (meanlog) parameter for the batch effects factor log-normal distribution. -
batch.facScale
- Scale (sdlog) parameter for the batch effects factor log-normal distribution.
-
-
Mean parameters
-
mean.shape
- Shape parameter for the mean gamma distribution. -
mean.rate
- Rate parameter for the mean gamma distribution.
-
-
Library size parameters
-
lib.loc
- Location (meanlog) parameter for the library size log-normal distribution. -
lib.scale
- Scale (sdlog) parameter for the library size log-normal distribution.
-
-
Expression outlier parameters
-
out.prob
- Probability that a gene is an expression outlier. -
out.facLoc
- Location (meanlog) parameter for the expression outlier factor log-normal distribution. -
out.facScale
- Scale (sdlog) parameter for the expression outlier factor log-normal distribution.
-
-
Group parameters
-
nGroups
- The number of groups or paths to simulate. -
group.prob
- The probabilities that cells come from particular groups.
-
-
Differential expression parameters
-
de.prob
- Probability that a gene is differentially expressed in each group or path. -
de.loProb
- Probability that a differentially expressed gene is down-regulated. -
de.facLoc
- Location (meanlog) parameter for the differential expression factor log-normal distribution. -
de.facScale
- Scale (sdlog) parameter for the differential expression factor log-normal distribution.
-
-
Biological Coefficient of Variation parameters
-
bcv.common
- Underlying common dispersion across all genes. -
bcv.df
- Degrees of Freedom for the BCV inverse chi-squared distribution.
-
-
Dropout parameters
-
dropout.present
- Logical. Whether to simulate dropout. -
dropout.mid
- Midpoint parameter for the dropout logistic function. -
dropout.shape
- Shape parameter for the dropout logistic function.
-
-
Differentiation path parameters
-
path.from
- Vector giving the originating point of each path. -
path.length
- Vector giving the number of steps to simulate along each path. -
path.skew
- Vector giving the skew of each path. -
path.nonlinearProb
- Probability that a gene changes expression in a non-linear way along the differentiation path. -
path.sigmaFac
- Sigma factor for non-linear gene paths.
-
While this may look like a lot of parameters Splatter attempts to make it easy
for the user, both by providing sensible defaults and making it easy to estimate
many of the parameters from real data. For more details on the parameters see
?SplatParams
.
SplatParams
object
The All the parameters for the Splat simulation are stored in a SplatParams
object. Let's create a new one and see what it looks like.
params <- newSplatParams()
params
As well as telling us what type of object we have ("A Params
object of class
SplatParams
") and showing us the values of the parameter this output gives us
some extra information. We can see which parameters can be estimated by the
splatEstimate
function (those in parentheses), which can't be estimated
(those in brackets) and which have been changed from their default values (those
in ALL CAPS).
Getting and setting
If we want to look at a particular parameter, for example the number of genes to
simulate, we can extract it using the getParam
function:
getParam(params, "nGenes")
Alternatively, to give a parameter a new value we can use the setParam
function:
params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
If we want to extract multiple parameters (as a list) or set multiple parameters
we can use the getParams
or setParams
functions:
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
The parameters with have changed are now shown in ALL CAPS to indicate that they been changed form the default.
We can also set parameters directly when we call newSplatParams
:
params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
Estimating parameters
Splat allows you to estimate many of it's parameters from a data set containing
counts using the splatEstimate
function.
# Check that sc_example counts is an integer matrix
class(sc_example_counts)
typeof(sc_example_counts)
# Check the dimensions, each row is a gene, each column is a cell
dim(sc_example_counts)
# Show the first few entries
sc_example_counts[1:5, 1:5]
params <- splatEstimate(sc_example_counts)
Here we estimated parameters from a counts matrix but splatEstimate
can also
take a SingleCellExperiment
object. The estimation process has the following
steps:
- Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
- Library size parameters are estimated by fitting a log-normal distribution to the library sizes.
- Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
- BCV parameters are estimated using the
estimateDisp
function from theedgeR
package. - Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.
For more details of the estimation procedures see ?splatEstimate
.
Simulating counts
Once we have a set of parameters we are happy with we can use splatSimulate
to simulate counts. If we want to make small adjustments to the parameters we
can provide them as additional arguments, alternatively if we don't supply any
parameters the defaults will be used:
sim <- splatSimulate(params, nGenes = 1000, dropout.present = FALSE)
sim
Looking at the output of splatSimulate
we can see that sim
is
SingleCellExperiment
object with r nrow(sim)
features (genes) and
r ncol(sim)
samples (cells). The main part of this object is a features
by samples matrix containing the simulated counts (accessed using counts
),
although it can also hold other expression measures such as FPKM or TPM.
Additionaly a SingleCellExperiment
contains phenotype information about
each cell (accessed using colData
) and feature information about each gene
(accessed using rowData
). Splatter uses these slots, as well as assays
, to
store information about the intermediate values of the simulation.
# Access the counts
counts(sim)[1:5, 1:5]
# Information about genes
head(rowData(sim))
# Information about cells
head(colData(sim))
# Gene by cell matrices
names(assays(sim))
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
An additional (big) advantage of outputting a SingleCellExperiment
is that we
get immediate access to other analysis packages, such as the plotting functions
in scater
. For example we can make a PCA plot:
# Use scater to calculate logcounts
sim <- normalise(sim)
# Plot PCA
plotPCA(sim)
(NOTE: Your values and plots may look different as the simulation is random and produces different results each time it is run.)
For more details about the SingleCellExperiment
object refer to the [vignette]
SCE-vignette. For information about what you can do with scater
refer to the
scater
documentation and vignette.
The splatSimulate
function outputs the following additional information about
the simulation:
-
Cell information (
pData
)-
Cell
- Unique cell identifier. -
Group
- The group or path the cell belongs to. -
ExpLibSize
- The expected library size for that cell. -
Step
(paths only) - How far along the path each cell is.
-
-
Gene information (
fData
)-
Gene
- Unique gene identifier. -
BaseGeneMean
- The base expression level for that gene. -
OutlierFactor
- Expression outlier factor for that gene (1 is not an outlier). -
GeneMean
- Expression level after applying outlier factors. -
DEFac[Group]
- The differential expression factor for each gene in a particular group (1 is not differentially expressed). -
GeneMean[Group]
- Expression level of a gene in a particular group after applying differential expression factors.
-
-
Gene by cell information (
assayData
)-
BaseCellMeans
- The expression of genes in each cell adjusted for expected library size. -
BCV
- The Biological Coefficient of Variation for each gene in each cell. -
CellMeans
- The expression level of genes in each cell adjusted for BCV. -
TrueCounts
- The simulated counts before dropout. -
Dropout
- Logical matrix showing which counts have been dropped in which cells.
-
Values that have been added by Splatter are named using UpperCamelCase
to
separate them from the underscore_naming
used by scater
and other packages.
For more information on the simulation see ?splatSimulate
.
Simulating groups
So far we have only simulated a single population of cells but often we are
interested in investigating a mixed population of cells and looking to see what
cell types are present or what differences there are between them. Splatter is
able to simulate these situations by changing the method
argument Here we are
going to simulate two groups, by specifying the group.prob
parameter and
setting the method
parameter to "groups"
:
(NOTE: We have also set the verbose
argument to FALSE
to stop Splatter
printing progress messages.)
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
verbose = FALSE)
sim.groups <- normalise(sim.groups)
plotPCA(sim.groups, colour_by = "Group")
As we have set both the group probabilites to 0.5 we should get approximately
equal numbers of cells in each group (around 50 in this case). If we wanted
uneven groups we could set group.prob
to any set of probabilites that sum to
1.
Simulating paths
The other situation that is often of interest is a differentiation process where
one cell type is changing into another. Splatter approximates this process by
simulating a series of steps between two groups and randomly assigning each
cell to a step. We can create this kind of simulation using the "paths"
method.
sim.paths <- splatSimulate(method = "paths", verbose = FALSE)
sim.paths <- normalise(sim.paths)
plotPCA(sim.paths, colour_by = "Step")
Here the colours represent the "step" of each cell or how far along the differentiation path it is. We can see that the cells with dark colours are more similar to the originating cell type and the light coloured cells are closer to the final, differentiated, cell type. By setting additional parameters it is possible to simulate more complex process (for example multiple mature cell types from a single progenitor).
Batch effects
Another factor that is important in the analysis of any sequencing experiment are batch effects, technical variation that is common to a set of samples processed at the same time. We apply batch effects by telling Splatter how many cells are in each batch:
sim.batches <- splatSimulate(batchCells = c(50, 50), verbose = FALSE)
sim.batches <- normalise(sim.batches)
plotPCA(sim.batches, colour_by = "Batch")
This looks at lot like when we simulated groups and that is because the process is very similar. The difference is that batch effects are applied to all genes, not just those that are differentially expressed, and the effects are usually smaller. By combining groups and batches we can simulate both unwanted variation that we aren't interested in (batch) and the wanted variation we are looking for (group):
sim.groups <- splatSimulate(batchCells = c(50, 50), group.prob = c(0.5, 0.5),
method = "groups", verbose = FALSE)
sim.groups <- normalise(sim.groups)
plotPCA(sim.groups, shape_by = "Batch", colour_by = "Group",
exprs_values = "counts")
Here we see that the effects of the group (first component) are stronger than the batch effects (second component) but by adjusting the parameters we could made the batch effects dominate.
Convenience functions
Each of the Splatter simulation methods has it's own convenience function.
To simulate a single population use splatSimulateSingle()
(equivalent to
splatSimulate(method = "single")
), to simulate grops use
splatSimulateGroups()
(equivalent to splatSimulate(method = "groups")
) or to
simulate paths use splatSimulatePaths()
(equivalent to
splatSimulate(method = "paths")
).