Commit 42f4e382 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Minor bugfix in analysis notebook and tweak to Snakefile

parent 9d437a5e
......@@ -32,9 +32,6 @@ else:
fastqc_html_reports = expand('reports/fastqc/{sample}_fastqc.html', sample = SAMPLES_LONG)
print(SAMPLES_LONG)
print(SAMPLES_MERGE)
rule all:
input:
'results/all.tsv.gz',
......@@ -53,7 +50,7 @@ rule fastqc_reports:
singularity:
"docker://dmccarthy/embo-singlecellomics-methylation_2019-05_heidelberg:0.1"
shell:
'/Users/davis/src/FastQC.app/Contents/MacOS/fastqc -o {params.output_dir} {input}'
'fastqc -o {params.output_dir} {input}'
rule trim_fastq:
......
---
title: "Single-cell methylation analysis notebook"
output:
output:
html_notebook:
toc: true
toc_float: true
......@@ -13,9 +13,9 @@ date: "`r Sys.Date()`"
# Introduction to notebooks
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
Executing a chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.
Executing a chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.
Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.
......@@ -57,14 +57,14 @@ head(df)
# Clustering of cells by methylation profile
We can take just the methylation rates for the active enhancer features and
We can take just the methylation rates for the active enhancer features and
use probabilistic PCA (which can handle missing values; highly relevant here) to
view cells in a lower-dimensional space. Here we only use enhancers with less
view cells in a lower-dimensional space. Here we only use enhancers with less
than 50% missing values
```{r gather-enhancers}
mat_enhancers <- filter(df, anno == "active_enhancers") %>%
mat_enhancers <- filter(df, anno == "active_enhancers") %>%
select(sample, rate, id) %>% spread(key = id, value = rate)
prop_na <- apply(mat_enhancers, 2, function(x) mean(is.na(x)))
mat_enhancers <- mat_enhancers[, prop_na < 0.5]
......@@ -75,21 +75,21 @@ pca_enhancers <- pca(mat_enhancers, method = "ppca", scale = "uv")
df_pca_enhancers <- merge(scores(pca_enhancers), df_meta, by = 0)
ggplot(df_pca_enhancers, aes(PC1, PC2, shape = cell_type, color = cell_type)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_manual(values = wes_palette("GrandBudapest")) +
scale_color_manual(values = wes_palette("GrandBudapest1")) +
xlab(paste("PC1", round(pca_enhancers@R2[1] * 100), "% of variance")) +
ylab(paste("PC2", round(pca_enhancers@R2[2] * 100), "% of variance")) +
ggtitle("Prob. PCA using active enhancer features") +
theme_classic()
```
Here, the first principal component separates the cells by cell type (E4.5 and
Here, the first principal component separates the cells by cell type (E4.5 and
E6.5).
**EXERCISE:** Repeat this with other classes of features. Do you see the same
**EXERCISE:** Repeat this with other classes of features. Do you see the same
results?
```{r pca-cgi}
mat_cgi <- filter(df, anno == "CGI") %>%
mat_cgi <- filter(df, anno == "CGI") %>%
select(sample, rate, id) %>% spread(key = id, value = rate)
prop_na <- apply(mat_cgi, 2, function(x) mean(is.na(x)))
mat_cgi <- mat_cgi[, prop_na < 0.5]
......@@ -97,7 +97,7 @@ pca_cgi <- pca(mat_cgi, method = "ppca", scale = "uv")
df_pca_cgi <- merge(scores(pca_cgi), df_meta, by = 0)
ggplot(df_pca_cgi, aes(PC1, PC2, shape = cell_type, color = cell_type)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_manual(values = wes_palette("GrandBudapest")) +
scale_color_manual(values = wes_palette("GrandBudapest1")) +
xlab(paste("PC1", round(pca_enhancers@R2[1] * 100), "% of variance")) +
ylab(paste("PC2", round(pca_enhancers@R2[2] * 100), "% of variance")) +
ggtitle("Prob. PCA using CGI features") +
......@@ -105,7 +105,7 @@ ggplot(df_pca_cgi, aes(PC1, PC2, shape = cell_type, color = cell_type)) +
```
```{r pca-promoters}
mat_promoters <- filter(df, anno == "promoters") %>%
mat_promoters <- filter(df, anno == "promoters") %>%
select(sample, rate, id) %>% spread(key = id, value = rate)
prop_na <- apply(mat_promoters, 2, function(x) mean(is.na(x)))
mat_promoters <- mat_promoters[, prop_na < 0.5]
......@@ -113,7 +113,7 @@ pca_promoters <- pca(mat_promoters, method = "ppca", scale = "uv")
df_pca_promoters <- merge(scores(pca_promoters), df_meta, by = 0)
ggplot(df_pca_promoters, aes(PC1, PC2, shape = cell_type, color = cell_type)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_manual(values = wes_palette("GrandBudapest")) +
scale_color_manual(values = wes_palette("GrandBudapest1")) +
xlab(paste("PC1", round(pca_enhancers@R2[1] * 100), "% of variance")) +
ylab(paste("PC2", round(pca_enhancers@R2[2] * 100), "% of variance")) +
ggtitle("Prob. PCA using promoter features") +
......@@ -125,48 +125,48 @@ ggplot(df_pca_promoters, aes(PC1, PC2, shape = cell_type, color = cell_type)) +
## Mean and variance of methylation rate
Functionality from the `dplyr` package makes it easy to compute the mean and
Functionality from the `dplyr` package makes it easy to compute the mean and
variance of the methylation rate across annotation categories.
### Mean and variance of methylation rate across all cells
```{r meanvar-rate}
df_mean_var <- df %>% group_by(id, anno) %>%
df_mean_var <- df %>% group_by(id, anno) %>%
summarise(mean_methyl_rate = mean(rate), var_methyl_rate = var(rate),
sd_methyl_rate = sd(rate))
```
Plotting the distributions of mean methylation rate across genomic features
reveals that promoters and CpG islands are generally very lowly methylated,
whereas repetetive elements (IAP) are frequently highly methylated. Enhancer
Plotting the distributions of mean methylation rate across genomic features
reveals that promoters and CpG islands are generally very lowly methylated,
whereas repetetive elements (IAP) are frequently highly methylated. Enhancer
regions commonly have moderate (25-50%) methylation.
```{r plot-mean-rate}
ggplot(df_mean_var, aes(x = anno, y = mean_methyl_rate, colour = anno)) +
geom_violin(scale = "width") +
geom_boxplot() +
geom_quasirandom(size = 0.5, alpha = 0.5) +
scale_color_manual(values = wes_palette("Darjeeling")) +
geom_quasirandom(size = 0.5, alpha = 0.5) +
scale_color_manual(values = wes_palette("Darjeeling1")) +
xlab("Genomic feature") + ylab("Mean methylation rate") +
ggtitle("Mean methylation rate by genomic feature") +
coord_flip() + theme_bw()
```
We can also examine the variance of methylation rate across classes of genomic
feature. Plotting the distributions of the standard deviation of methylation
rate by type of feature reveals that promoters and CpG islands typically have
We can also examine the variance of methylation rate across classes of genomic
feature. Plotting the distributions of the standard deviation of methylation
rate by type of feature reveals that promoters and CpG islands typically have
low variance of methylation rate, whereas methylation rate genebody and active
enhancers is highly variable. IAP elements have widely varying standard
deviations: some elements have very low variance in methylation rate, while
enhancers is highly variable. IAP elements have widely varying standard
deviations: some elements have very low variance in methylation rate, while
others have very high variance.
```{r plot-var-rate}
ggplot(df_mean_var, aes(x = anno, y = sd_methyl_rate, colour = anno)) +
geom_violin(scale = "width") +
geom_quasirandom(size = 0.5, alpha = 0.5) +
scale_color_manual(values = wes_palette("Darjeeling")) +
geom_quasirandom(size = 0.5, alpha = 0.5) +
scale_color_manual(values = wes_palette("Darjeeling1")) +
xlab("Genomic feature") + ylab("Standard deviation of methylation rate") +
ggtitle("Standard deviation of methylation rate by genomic feature") +
coord_flip() + theme_bw()
......@@ -175,21 +175,21 @@ ggplot(df_mean_var, aes(x = anno, y = sd_methyl_rate, colour = anno)) +
### Mean and variance of methylation rate by cell type
```{r meanvar-rate-cell}
df_mean_var_by_celltype <- df %>% group_by(id, anno, cell_type) %>%
df_mean_var_by_celltype <- df %>% group_by(id, anno, cell_type) %>%
summarise(mean_methyl_rate = mean(rate), var_methyl_rate = var(rate),
sd_methyl_rate = sd(rate))
```
When we compute the mean methylation rate separately for the two cell types
(E4.5 and E6.5), we see stark differences in the distributions of mean
When we compute the mean methylation rate separately for the two cell types
(E4.5 and E6.5), we see stark differences in the distributions of mean
methylation rate across genomic features. Particularly, active enhancers go from
almost completely unmethylated in E4.5 cells to highly methylated (most active
enhancers greater than 50% methylated) in E6.5 cells. Similarly, gene bodies are
enhancers greater than 50% methylated) in E6.5 cells. Similarly, gene bodies are
generally much more methylated in E6.5 cells than E4.5 cells.
In both cell types CpG islands are generally very lowly methylated,
whereas repetetive elements (IAP) are consistently highly methylated in E6.5
In both cell types CpG islands are generally very lowly methylated,
whereas repetetive elements (IAP) are consistently highly methylated in E6.5
cells, but show high variation in mean methylation rate in E4.5 cells. Promoters
are generally lowly methylated, but show slightly more methylation in E6.5 cells.
......@@ -197,38 +197,38 @@ are generally lowly methylated, but show slightly more methylation in E6.5 cells
```{r plot-mean-rate-cell, fig.width=5}
ggplot(df_mean_var_by_celltype, aes(x = anno, y = mean_methyl_rate, colour = anno)) +
geom_violin(scale = "width") +
geom_quasirandom(size = 0.5, alpha = 0.5) +
geom_quasirandom(size = 0.5, alpha = 0.5) +
facet_wrap(~cell_type, ncol = 2) +
scale_color_manual(values = wes_palette("Darjeeling")) +
scale_color_manual(values = wes_palette("Darjeeling1")) +
xlab("Genomic feature") + ylab("Mean methylation rate") +
ggtitle("Mean methylation rate for genomic features by cell type") +
coord_flip() + theme_bw()
```
Examination of the variance of methylation rate across classes of genomic
Examination of the variance of methylation rate across classes of genomic
feature split by cell type shows similar patterns to the mean methylation rate.
Genebody and active enhancer methylation variance is higher in E6.5 cells than
E4.5 cells. IAP features are generally have less variable methylation in E6.5
cells; in E4.5 cells, IAP elements have widely varying standard
deviations: some elements have very low variance in methylation rate, while
others have very high variance. Promoters have slightly higher methylation
E4.5 cells. IAP features are generally have less variable methylation in E6.5
cells; in E4.5 cells, IAP elements have widely varying standard
deviations: some elements have very low variance in methylation rate, while
others have very high variance. Promoters have slightly higher methylation
variance in E6.5 cells. In both cell types, CpG islands typically have very low
variance of methylation rate.
variance of methylation rate.
```{r plot-var-rate-cell, fig.width=5}
ggplot(df_mean_var_by_celltype, aes(x = anno, y = sd_methyl_rate, colour = anno)) +
geom_violin(scale = "width") +
geom_quasirandom(size = 0.5, alpha = 0.5) +
geom_quasirandom(size = 0.5, alpha = 0.5) +
facet_wrap(~cell_type, ncol = 2) +
scale_color_manual(values = wes_palette("Darjeeling")) +
scale_color_manual(values = wes_palette("Darjeeling1")) +
xlab("Genomic feature") + ylab("Standard deviation of methylation rate") +
ggtitle("Standard deviation of methylation rate by genomic feature") +
coord_flip() + theme_bw()
```
These analyses demonstrate the context specificity of methylation variance. In
mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements
These analyses demonstrate the context specificity of methylation variance. In
mouse ES cells, CGIs are homogenous (and low in methylation), repeat elements
are homogenously high in E6.5 cells and active enhancer elements are
heterogeneous. This is interesting because the enhancer elements are cell type
specific and thus some variation in the methylation levels here implies
specific and thus some variation in the methylation levels here implies
plasticity in cell identity which could be important for lineage formation.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment