README.md 6.55 KB
Newer Older
Davis McCarthy's avatar
Davis McCarthy committed
1
# Single Cell 'Omics: Analysis of single-cell methylation data
Davis McCarthy's avatar
Davis McCarthy committed
2

Davis McCarthy's avatar
Davis McCarthy committed
3
4
5
6
7
The final session of the course will cover pre-processing and basic analysis of
single-cell bisulfite sequencing data. We will assay two cell types, probably 16
cells in total.

## Goals
8
We will process raw single-cell methylation data and conduct analyses to show the following two main goals:
Davis McCarthy's avatar
Davis McCarthy committed
9
10
11
12
13
14
15
16
17
1. Methylation profiles define cell type (i.e. cells will cluster apart by e.g. PCA)                                                             
2. Context specificity of methylation variance. E.g. in mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements are homogenously high and active enhancer elements are heterogeneous. This is interesting because the enhancer elements are cell type specific and thus some variation in the methylation levels here implies plasticity in cell identity which could be important for lineage formation.

## First step

Clone or download this repository so that you have the necessary code, data and materials to hand.

If you're familiar with `git`:
```
18
git clone https://gitlab.svi.edu.au/biocellgen-public/embo-singlecellomics-methylation_2019-05_heidelberg.git
Davis McCarthy's avatar
Davis McCarthy committed
19
20
21
22
23
24
25
26
27
28
29
30
31
```

If not, you can download a zip file of the repository by clicking the green "Clone or download" button above.

## Outline:

We have two 1.5 hour sessions to work on single-cell methylation. Broadly, we will spend the first session on processing the raw sequence files to get summarized, annotated methylation results for genomic features of interest. In the second session we will analyze and plot these results to fulfill the goals above.

1. We will use `BISMARK` for alignments and methylation calling. For details, see this [protocol paper](http://www.nature.com/nprot/journal/v12/n3/full/nprot.2016.187.html).
2. QC (also see protocol paper)
    1.  Negative controls should not align
    1. bisulfite conversion efficiency (assessed using CHH methylation from bismark reports) should be >95%
    1. mapping efficiency (from bismark reports)  >10% (30-40% is normal here but may end up lower  in these practicals)
Davis McCarthy's avatar
Davis McCarthy committed
32
    1. number of CpG sites covered (we often use 1M unique positions but this will depend on sequencing depth so can also exclude outliers)
Davis McCarthy's avatar
Davis McCarthy committed
33
34
35
36
37
38
39
40
41
42
3. Preprocessing and annotation
    *  Quantify methylation over regions of interest (promoters, gene bodies, enhancers, repeats, CpG islands).
        1. mean methylation rate (each covered position counts once – i.e. do not give extra weight to positions with >1 read)
        2. also record the coverage (number of CpG sites that were covered in the that cell at that locus) for the purpose of assigning weights to each cell in downstream analyses
4. Analysis
    1. Mean methylation by feature / cell type
    1. Variation by feature / cell type
    1. Dimension reduction    
    1. Clustering

Davis McCarthy's avatar
Davis McCarthy committed
43
We will manage the data processing and analysis workflow using [Snakemake](http://snakemake.readthedocs.io/en/stable/). We will analyze our results in RStudio, using an [R Markdown](http://rmarkdown.rstudio.com) Notebook (see the `notebooks` folder in this repository for an example.)
Davis McCarthy's avatar
Davis McCarthy committed
44

Davis McCarthy's avatar
Davis McCarthy committed
45
## Data and references
Davis McCarthy's avatar
Davis McCarthy committed
46
47
48
49
50

The aim will be for you to analyze the data you generate during the course in Heidelberg.

However, in case that data is unavailable for any reason and to have an alternative dataset that is processed and ready for analysis, we also have access to a small dataset from Stephen Clark and colleagues at the Babraham Institute, Cambridge. This dataset consists of 15 cells from mouse embryos.

Davis McCarthy's avatar
Davis McCarthy committed
51
52
53
54
### Reference files

### Raw data

Davis McCarthy's avatar
Davis McCarthy committed
55
56
57
58
59
1. Raw `fastq` files are available at this [link](https://www.dropbox.com/sh/1wy3gw7fpil73dd/AADIOGvbsYNdt45KnaHahmqqa?dl=0) (6GB; password required, which will be shared on the course Slack channel). Only if you want to work from raw `fastq` files (substantial computation needed) and have a high-bandwidth connection, download the files at the link and save to `data/fastq`.
1. Raw `fastq` files for a "test" dataset (sampling 500,000 reads from each of
the above `fastq` files), smaller in size so a little more convenient, are
available at this [link](https://www.dropbox.com/sh/s0dmlgg0cmxak9y/AAAC4NK_Bz2rSN7kYJfJcloRa?dl=0)
(210MB; password required).
Davis McCarthy's avatar
Davis McCarthy committed
60
61
62

### Intermediate results files

Davis McCarthy's avatar
Davis McCarthy committed
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
1. Merged `Bismark` files are available at this [link](https://www.dropbox.com/sh/b3v55pdkkimo13s/AAA4gH-6uCxMqFSbFM72rwLna?dl=0) (76MB; password required). Download and copy these to `data/bismark/merged`.
1. Summarized, annotated methylation results that we will use for analysis are
available in the results folder of this repository (we will generate these
ourselves during the course). A version of this file, `results/all.tsv.gz` that
has already been computed is available at this [link](https://www.dropbox.com/s/dq7x4ohu5zxw5n9/all.tsv.gz?dl=0) (3.5MB)
in case you wish to use it for the second part of the analysis.


## Software requirements:
* `R` >=3.5.0 with packages:
    * From CRAN: `tidyverse`, `data.table`, `docopt`, `ggthemes`, `ggforce`
    * From Bioconductor: `scater`, `scran`, `GenomicRanges`, `iSEE`, `SC3`, `pcaMethods`
* `RStudio`
* `Python` >=3.6 with packages: `snakemake`
* [`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), which requires [Cutadapt](https://github.com/marcelm/cutadapt/)
* [`Bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
* [`Bismark`](https://www.bioinformatics.babraham.ac.uk/projects/bismark/)
* [`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* [`MultiQC`](http://multiqc.info/)
* [`MethylQA`](http://methylqa.sourceforge.net/index.php)
Davis McCarthy's avatar
Davis McCarthy committed
83
* [`Singularity`](https://www.sylabs.io/docs/)
Davis McCarthy's avatar
Davis McCarthy committed
84

85
86
87
88
89
90
91
92
93
94
95
96
## Notes on building images with singularity

On multi-user systems, building images with Singularity can be sensitive to the
settings of `umask`, which determines the default permissions for files and
directories created by the user. On many systems, the default `umask` is often
007 or 077, which defines more stringent permissions for newly created files.
We have found that building Singularity images with these `umask` values can
lead to errors when trying to run the Singularity images. We have found it
necessary to change `umask` to 002 before building Singularity images. Once this
has been done, users with `umask` settings of 007 or 077 (and otherwise correct
  permissions) should be able to run the built Singularity image.

Davis McCarthy's avatar
Davis McCarthy committed
97
98
99

## Acknowledgements

100
Many thanks to Stephen Clark and Ricard Argelaguet for help and advice. Stephen
Davis McCarthy's avatar
Davis McCarthy committed
101
102
103
advised on the course aims and structure and directed generation of raw data.
Ricard provided advice on analysis and provided data processing scripts and
processed datasets for use.