-
Luke Zappia authoredLuke Zappia authored
title: "Introduction to Splatter"
author: "Luke Zappia"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{An introduction to the Splatter package}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
# To render an HTML version that works nicely with github and web pages, do:
# rmarkdown::render("vignettes/splatter.Rmd", "all")
knitr::opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5,
dev = 'png')
Welcome to Splatter! Splatter is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to Splatter's functionality.
Installation
Splatter can be installed from Bioconductor:
source("https://bioconductor.org/biocLite.R")
biocLite("splatter")
To install the most recent development version from Github use:
biocLite("Oshlack/splatter", dependencies = TRUE,
build_vignettes = TRUE)
Quickstart
Assuming you already have a matrix of count data similar to that you wish to
simulate there are two simple steps to creating a simulated data set with
Splatter. Here is an example using the example dataset in the scater
package:
# Load package
library(splatter)
# Load example data
library(scater)
data("sc_example_counts")
# Estimate parameters from example data
params <- splatEstimate(sc_example_counts)
# Simulate data using estimated parameters
sim <- splatSimulate(params)
These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.
The Splat simulation
Before we look at how we estimate parameters let's first look at how Splatter simulates data and what those parameters are. We use the term 'Splat' to refer to the Splatter's own simulation and differentiate it from the package itself. The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.
Splat can also simulate differential expression between groups of different types of cells or differentiation paths between different cells types where expression changes in a continuous way. These are described further in the [simulating counts] section.
Parameters
The parameters required for the Splat simulation are briefly described here:
-
Global parameters
-
nGenes
- The number of genes to simulate. -
nCells
- The number of cells to simulate. -
seed
- Seed to use for generating random numbers.
-
-
Batch parameters
-
nBatches
- The number of batches to simulate. -
batchCells
- The number of cells in each batch. -
batch.facLoc
- Location (meanlog) parameter for the batch effects factor log-normal distribution. -
batch.facScale
- Scale (sdlog) parameter for the batch effects factor log-normal distribution.
-
-
Mean parameters
-
mean.shape
- Shape parameter for the mean gamma distribution. -
mean.rate
- Rate parameter for the mean gamma distribution.
-
-
Library size parameters
-
lib.loc
- Location (meanlog) parameter for the library size log-normal distribution, or mean for the normal distribution. -
lib.scale
- Scale (sdlog) parameter for the library size log-normal distribution, or sd for the normal distribution. -
lib.norm
- Whether to use a normal distribution instead of the usual log-normal distribution.
-
-
Expression outlier parameters
-
out.prob
- Probability that a gene is an expression outlier. -
out.facLoc
- Location (meanlog) parameter for the expression outlier factor log-normal distribution. -
out.facScale
- Scale (sdlog) parameter for the expression outlier factor log-normal distribution.
-
-
Group parameters
-
nGroups
- The number of groups or paths to simulate. -
group.prob
- The probabilities that cells come from particular groups.
-
-
Differential expression parameters
-
de.prob
- Probability that a gene is differentially expressed in each group or path. -
de.loProb
- Probability that a differentially expressed gene is down-regulated. -
de.facLoc
- Location (meanlog) parameter for the differential expression factor log-normal distribution. -
de.facScale
- Scale (sdlog) parameter for the differential expression factor log-normal distribution.
-
-
Biological Coefficient of Variation parameters
-
bcv.common
- Underlying common dispersion across all genes. -
bcv.df
- Degrees of Freedom for the BCV inverse chi-squared distribution.
-
-
Dropout parameters
-
dropout.type
- Type of dropout to simulate. -
dropout.mid
- Midpoint parameter for the dropout logistic function. -
dropout.shape
- Shape parameter for the dropout logistic function.
-
-
Differentiation path parameters
-
path.from
- Vector giving the originating point of each path. -
path.length
- Vector giving the number of steps to simulate along each path. -
path.skew
- Vector giving the skew of each path. -
path.nonlinearProb
- Probability that a gene changes expression in a non-linear way along the differentiation path. -
path.sigmaFac
- Sigma factor for non-linear gene paths.
-
While this may look like a lot of parameters Splatter attempts to make it easy
for the user, both by providing sensible defaults and making it easy to estimate
many of the parameters from real data. For more details on the parameters see
?SplatParams
.