Commit d73fd375 authored by Davis McCarthy's avatar Davis McCarthy
Browse files

Initial commit

parents
Pipeline #58 failed with stages
in 9 seconds
FROM nfcore/base
LABEL authors="davismcc@gmail.com" \
maintainer="Davis McCarthy <davismcc@gmail.com>" \
description="Docker image containing all requirements for AAAA_2019_Project-Template"
RUN apt-get update && \
apt-get -y upgrade && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
git \
libbz2-dev \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
COPY environment.yml /
RUN conda env create -f /environment.yml python=3.6 && conda clean -a
ENV PATH /opt/conda/envs/fibroblast-clonality/bin:$PATH
Copyright (C) 2019, Davis McCarthy
This work is copyrighted under a Creative Commons Attribution-ShareAlike 4.0
International license. For details, see https://creativecommons.org/licenses/by-sa/4.0/legalcode.
# Project AAAA_2019_Project-Template
## Project overview
[Write a concise overview of the project]
## Contributors
[List contributors to the project, affiliations, and if appropriate contact details]
## Project setup
### Define project name and identifier
To use this template project for a new project, first copy the directory or clone the repository to a new location.
The convention for project names is that they are identified by a randomly-generated four-letter code (in the format `[A-Z][AEIOU][A-Z][A-Z][A-Z]`), the year in which the project commenced and a project name (title case with words separated by hyphens). These elements are separated by underscores.
Thus, this template project has a valid identifier ("project ID"): `AAAA_2019_Project-Template`.
R code to (reproducibly) generate a four-letter project code:
```
set.seed(as.numeric(Sys.Date()))
c(sample(LETTERS, 1), sample(c("A","E","I","O","U"), 1), sample(LETTERS, 2))
```
### Update template for new project
To setup the new project, the project ID should be changed in the following files:
* `cluster.json` (a file that defines defaults for running Snakemake on a cluster)
* `Dockerfile` (a file that defines a Docker container with necessary software installed)
* `environment.yml` (a file that defines the conda environment to use)
* `analysis/_site.yml` (a YAML file with parameters for constructing the website with workflowr)
* `envs/myenvs.yml` (a conda environment definition file)
* `org/project_management.org` (an org-mode file for managing the project)
Specify the way to cite the project in the `CITATION` file. This will likely need updating over the course of the project.
Change the seed for random number generation for the project in `_workflowr.yml` to allow reproducibility of anlayses that use "random" numbers. A sensible choice is the date that the project is setup in YYYYMMDD format (i.e. a numeric value).
Check that the `LICENSE` file is appropriate. Unless there is good reason to prefer something different, we strongly prefer open, permissive licenses to make our work as widely accessible and useable as possible.
The `Snakefile` file that defines the Snakemake workflow will need to be rewritten, but the included file in this template provides a useful starting point and some handy patterns to exploit.
The `analysis/about.Rmd`, `analysis/index.Rmd` and `analysis/license.Rmd` files need to be completed, but the existing templates from a previous project provide a useful guide and starting point.
## Project organisation and management
### File naming conventions
Files should have names that follow this pattern: `<project code>_<date file created>_<descriptive name>.<file extension>`. The four-letter project code should be used (e.g. AAAA). The YYYY-MM-DD format should be used (e.g. 2019-01-01 for January 1 2019). Hyphens should be used to separate words in the descriptive name (e.g. exploratory-data-analysis).
Thus a valid file name is `AAAA_2019-01-01_exploratory-data-analysis.Rmd`.
Applying these conventions ensures that user-created files are easily identifiable and uniquely named.
### Correspondence about the project
All emails sent regarding the project should have the four-letter project code at the start of the email subject (e.g. `AAAA - <topic of email>`). This ensures that email correspondence for the project is easily searchable in overstuffed email archives and inboxes.
Particularly important emails should be exported to PDF and saved in a subfolder (e.g. `correspondence`) of the `org` folder.
### To-dos and notes
The `org` folder is intended to house files used for organising and managing the project. These might include markdown (`.md`) files for notes, and `.org` mode files (another plaintext document format compatible with Emacs org-mode and org-mode emulators in other modern text editors) for to-do lists and outlines.
There is an included org-mode file `org/project_management.org` that can act as a template for organising the project with a roadmap and to-do lists.
## Reproducibility and version control
We aim to make all of our projects and analyses completely open and reproducible. The [workflowr][] package makes this aim easier to achieve by providing a set of tools and conventions for reproducible analysis workflows in R and the simultaneous building of a website that presents the analyses, with source code, in a readable way.
The system integrates seamlessly with the git version control system.
As many project files as possible should be under version control. All user-created text files (e.g. `.md`, `.Rmd`, etc) and code (`.R`, `.py`, etc. files) should be under version control. Files that are large in size should not be under version control (too many large files in the repository makes git become unwieldy), and, in general, there is no need for files produced by code or analysis files to be versioned as running the workflow will produce them.
Docker containers encapsulating software environments enable reproducibility by allowing the same software and environments to be easily used by different people across different platforms. As such, we use them extensively.
Workflow management software makes a very big difference when trying to run complicated computational workflows and making them portable across local, cluster, and other HPC computing environments.
## Acknowledgements
This project is a [workflowr][] project. Making use of the workflowr package for reproducible analyses dictates certain structures for the project file.
[workflowr]: https://github.com/jdblischak/workflowr
"""
Snakefile for
Author: Davis McCarthy
Affiliation: St Vincent's Institute of Medical Research and the University of Melbourne
Run: snakemake -s Snakefile_canopy --jobs 1000 --latency-wait 30 --cluster-config cluster.json --cluster 'bsub -J {cluster.name} -q {cluster.queue} -n {cluster.n} -R "rusage[mem={cluster.memory}]" -M {cluster.memory} -o {cluster.output} -e {cluster.error}' --keep-going --rerun-incomplete
Davis McCarthy, 02 January 2019
"""
import glob
import os
from subprocess import run
import subprocess
import pandas as pd
import re
import h5py
shell.prefix("set -euo pipefail;")
donors = ['bima', 'bubh', 'ceik', 'ciwj', 'cuhk', 'deyz', 'diku', 'eipl', 'eofe', 'euts', \
'fawm', 'feec', 'fiaj', 'fikt', 'garx', 'gesg', 'gifk', 'hehd', 'heja', 'hipn', 'ieki', \
'jogf', 'joxm', 'kajh', 'kuco', 'laey', 'lexy', 'melw', 'miaj', 'naju', 'nusw', 'oaaz', \
'oaqd', 'oicx', 'oilg', 'pamv', 'pelm', 'pipw', 'puie', 'qayj', 'qolg', 'qonc', 'rozh', \
'rutc', 'sebz', 'sehl', 'sohd', 'tixi', 'toss', 'ualf', 'vabj', 'vass', 'vils', 'vuna', \
'wahn', 'wetu', 'wigw', 'wopl', 'wuye', 'xugn', 'xuja', 'zihe', 'zoxy']
## too few variants for clonal analysis:
singlecell_donors_all = ['bima', 'bubh', 'ceik', 'ciwj', 'cuhk', 'deyz', 'diku',\
'eipl', 'eofe', 'euts', 'fawm', 'feec', 'fiaj', 'fikt',\
'garx', 'gesg', 'gifk', 'hehd', 'heja', 'hipn', 'ieki',\
'joxm', 'kajh', 'kuco', 'laey', 'lexy', 'melw',\
'miaj', 'naju', 'nusw', 'oaaz', 'oaqd', 'oilg',\
'pamv', 'pelm', 'pipw', 'puie', 'qayj', 'qolg', 'qonc',\
'rozh', 'rutc', 'sebz', 'sehl', 'sohd', 'toss', 'ualf',\
'vabj', 'vass', 'vils', 'vuna', 'wahn', 'wetu', 'wigw',\
'wopl', 'wuye', 'xugn', 'xuja', 'zihe', 'zoxy'] # 60 donors
## lenient variant filtering
## donors with <10 variants with coverage in at least one cell:
## bima, bubh, ceik, cuhk, deyz, diku, dons, eika, fiaj, gifk, hehd, jogf, kajh, lise, pamv, pelm, rutc, sebz, tolg, toss, tuju, vabj, wigw, wopl, wuye, xuja, zihe
## not enough QC-passing cells (<30): ciwj, eipl, eofe, miaj, oaqd,
donors_lenient_all = ['euts', 'fawm', 'feec', 'fikt', \
'garx', 'gesg', 'heja', 'hipn', 'ieki', 'joxm', 'kuco', 'laey', 'lexy', 'melw', \
'naju', 'nusw', 'oaaz', 'oilg', 'pipw', 'puie', 'qayj', 'qolg', 'qonc', 'rozh', \
'sehl', 'sohd', 'ualf', 'vass', 'vils', 'vuna', 'wahn', 'wetu', 'xugn', 'zoxy'] ## 34 donors
## Canopy will not fit (variant clustering fails): melw, sohd
donors_lenient_cell_cov = ['euts', 'fawm', 'feec', 'fikt', 'garx', 'gesg', \
'heja', 'hipn', 'ieki', 'joxm', 'kuco', 'laey', 'lexy', 'naju', 'nusw', \
'oaaz', 'oilg', 'pipw', 'puie', 'qayj', 'qolg', 'qonc', 'rozh', 'sehl', \
'ualf', 'vass', 'vils', 'vuna', 'wahn', 'wetu', 'xugn', 'zoxy'] ## 32 donors
## strict variant filtering
## donors with <10 variants with coverage in at least one cell:
## bima, bubh, ceik, ciwj, cuhk, deyz, diku, dons, eika, fiaj, gifk, hehd, jogf, kajh, lexy, lise, pamv, pelm, rutc, sebz, tolg, toss, tuju, vabj, vils, wigw, wopl, wuye, xuja, zihe
## not enough QC-passing cells (<30): eipl, eofe, melw, miaj, oaqd
donors_strict_all = ['euts', 'fawm', 'feec', 'fikt', 'garx', 'gesg', \
'heja', 'hipn', 'ieki', 'joxm', 'kuco', 'laey', 'naju', 'nusw', \
'oaaz', 'oilg', 'pipw', 'puie', 'qayj', 'qolg', 'qonc', 'rozh', 'sehl', \
'sohd', 'ualf', 'vass', 'vuna', 'wahn', 'wetu', 'xugn', 'zoxy'] # 31 donors
## Canopy will not fit (variant clustering fails): kuco, sohd
donors_strict_cell_cov = ['euts', 'fawm', 'feec', 'fikt', 'garx', 'gesg', \
'heja', 'hipn', 'ieki', 'joxm', 'laey', 'naju', 'nusw', \
'oaaz', 'oilg', 'pipw', 'puie', 'qayj', 'qolg', 'qonc', 'rozh', 'sehl', \
'ualf', 'vass', 'vuna', 'wahn', 'wetu', 'xugn', 'zoxy'] # 29 donors
sce_list = {}
sce_list['filt_lenient'] = {}
sce_list['filt_lenient']['all_filt_sites'] = expand(\
'data/sces/sce_{donor}_with_clone_assignments.filt_lenient.all_filt_sites.rds',\
donor = donors_lenient_all)
sce_list['filt_lenient']['cell_coverage_sites'] = expand(\
'data/sces/sce_{donor}_with_clone_assignments.filt_lenient.cell_coverage_sites.rds',\
donor = donors_lenient_cell_cov)
sce_list['filt_strict'] = {}
sce_list['filt_strict']['all_filt_sites'] = expand(\
'data/sces/sce_{donor}_with_clone_assignments.filt_strict.all_filt_sites.rds',\
donor = donors_strict_all)
sce_list['filt_strict']['cell_coverage_sites'] = expand(\
'data/sces/sce_{donor}_with_clone_assignments.filt_strict.cell_coverage_sites.rds',\
donor = donors_strict_cell_cov)
sces_flat = []
sces_flat.append(sce_list['filt_lenient']['all_filt_sites'])
sces_flat.append(sce_list['filt_lenient']['cell_coverage_sites'])
sces_flat.append(sce_list['filt_strict']['all_filt_sites'])
sces_flat.append(sce_list['filt_strict']['cell_coverage_sites'])
sces_flat = [filename for elem in sces_flat for filename in elem]
rule all:
input:
expand('data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_{strictness}-{donor}.txt.gz',\
strictness = ['lenient', 'strict'], donor = singlecell_donors_all),
expand('data/raw/mpileup/{donor}.mpileup.vcf{suffix}', \
donor = singlecell_donors_all, suffix = ['.gz', '.gz.csi']),
expand('data/sces/sce_{donor}_with_clone_assignments.{strictness}.{sites}.rds',\
donor = singlecell_assign_donors, strictness = ['filt_strict', 'filt_lenient'],\
sites = ['all_filt_sites', 'cell_coverage_sites']),
expand('reports/de_pathway/de_pathway.{cells}.{strictness}.{sites}.html', \
cells = ['unst_cells'], strictness = ['filt_strict', 'filt_lenient'],\
sites = ['all_filt_sites', 'cell_coverage_sites']), # 'cell_coverage_sites'
expand('reports/de_pathway/de_pathway.{cells}.cellcycle_analyses.{strictness}.{sites}.html', \
cells = ['unst_cells'], strictness = ['filt_strict', 'filt_lenient'],\
sites = ['all_filt_sites', 'cell_coverage_sites']), # 'cell_coverage_sites'
expand('reports/de_pathway/de_pathway.{cells}.permutations.{strictness}.{sites}.html', \
cells = ['unst_cells'], strictness = ['filt_strict', 'filt_lenient'],\
sites = ['all_filt_sites', 'cell_coverage_sites']),
expand('data/exome-point-mutations/high-vs-low-exomes.v62.ft.alldonors-{strictness}.all_filt_sites.ped', \
strictness = ['filt_strict', 'filt_lenient']),
expand('data/exome-point-mutations/high-vs-low-exomes.v62.ft.alldonors-{strictness}.all_filt_sites.vcf', \
strictness = ['filt_strict', 'filt_lenient']),
expand('data/simulations/{donor}.simulate.rds', \
donor = donors_lenient_cell_cov),
expand('data/variance_components/donorVar/{donor}.var_part.var1.csv' \
donor = donors_lenient_cell_cov)
rule run_varpart_per_donor:
input:
sce=lambda wildcards: sce_list['filt_lenient']['cell_coverage_sites']
output:
'data/variance_components/donorVar/{donor}.var_part.var1.csv'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/var_part_donor.R {wildcards.donor}'
rule run_simulation_per_donor:
input:
card='data/cell_assignment/cardelino_results.{donor}.filt_lenient.cell_coverage_sites.rds'
output:
real_data='data/simulations/{donor}.filt_lenient.cell_coverage_sites.mult.rds',
simu_data='data/simulations/{donor}.simulate.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/simulation_per_donor.R {wildcards.donor}'
rule run_de_pathway_analysis_unst_cells_permutation:
input:
sce=lambda wildcards: sce_list[wildcards.strictness][wildcards.sites]
output:
html='reports/de_pathway/de_pathway.unst_cells.permutations.{strictness}.{sites}.html',
unst_rds='data/de_analysis_FTv62/permutations/{strictness}.{sites}.de_results_unstimulated_cells.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'{rscript_cmd} src/R/compile_report_de_pathways.R '
'-c {wildcards.strictness}.{wildcards.sites} '
'-o {output.html} '
'--template src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.unst_cells.permutations.Rmd '
'--title "DE Pathway permutation analysis using unstimulated cells: {wildcards.strictness} {wildcards.sites}" '
'--to_working_dir ../../ '
rule run_de_pathway_analysis_unst_cells_cellcycle:
input:
sce=lambda wildcards: sce_list[wildcards.strictness][wildcards.sites]
output:
html='reports/de_pathway/de_pathway.unst_cells.cellcycle_analyses.{strictness}.{sites}.html',
unst_rds='data/de_analysis_FTv62/cellcycle_analyses/{strictness}.{sites}.de_results_unstimulated_cells.cc.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/compile_report_de_pathways.R '
'-c {wildcards.strictness}.{wildcards.sites} '
'-o {output.html} '
'--template src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.cell_cycle.unst_cells.Rmd '
'--title "DE Pathway Analysis using unstimulated cells accounting for cell cycle : {wildcards.strictness} {wildcards.sites}" '
'--to_working_dir ../../ '
rule run_de_pathway_analysis_unst_cells:
input:
sce=lambda wildcards: sce_list[wildcards.strictness][wildcards.sites]
output:
html='reports/de_pathway/de_pathway.unst_cells.{strictness}.{sites}.html',
unst_rds='data/de_analysis_FTv62/{strictness}.{sites}.de_results_unstimulated_cells.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/compile_report_de_pathways.R '
'-c {wildcards.strictness}.{wildcards.sites} '
'-o {output.html} '
'--template src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.unst_cells.Rmd '
'--title "DE Pathway Analysis using unstimulated cells: {wildcards.strictness} {wildcards.sites}" '
'--to_working_dir ../../ '
rule run_cell_assignment:
input:
can='data/canopy/canopy_results.{donor}.{strictness}.{sites}.rds',
sce='data/sces/sce_{donor}_qc.rds',
vcf='data/raw/mpileup/{donor}.mpileup.vcf.gz',
csi='data/raw/mpileup/{donor}.mpileup.vcf.gz.csi'
output:
html = 'reports/cell_assignment/cell_assignment.{donor}.{strictness}.{sites}.html',
sce = 'data/sces/sce_{donor}_with_clone_assignments.{strictness}.{sites}.rds',
card = 'data/cell_assignment/cardelino_results.{donor}.{strictness}.{sites}.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/compile_report_cell_assign.R '
'-i {input.sce} --vcf_file {input.vcf} --tree_file {input.can} '
'-o {output.html} --results_sce {output.sce} --results_card {output.card} '
'--template src/Rmd/cell_assignment_template.Rmd '
'--title "Assigning single cells to clones: {wildcards.donor}" '
'--donor {wildcards.donor} --to_working_dir ../../ '
rule run_canopy_donor_specific_coverage:
input:
'Data/exome-point-mutations/high-vs-low-exomes.v62.ft.{strictness}-{donor}.txt.gz'
output:
html = 'reports/canopy/canopy.analysis.{donor}.{strictness}.cell_coverage_sites.html',
rds = 'data/canopy/canopy_results.{donor}.{strictness}.cell_coverage_sites.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript '
'src/R/compile_report.R -i {input} -o {output.html} '
'--results_out {output.rds} '
'--template src/Rmd/canopy_analysis_template.Rmd '
'--title "Canopy analysis: {wildcards.donor}" '
'--donor {wildcards.donor} --to_working_dir ../../ '
rule run_canopy:
input:
'Data/exome-point-mutations/high-vs-low-exomes.v62.ft.{strictness}-alldonors.txt.gz'
output:
html = 'reports/canopy/canopy.analysis.{donor}.{strictness}.all_filt_sites.html',
rds = 'data/canopy/canopy_results.{donor}.{strictness}.all_filt_sites.rds'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript '
'src/R/compile_report.R -i {input} -o {output.html} '
'--results_out {output.rds} '
'--template src/Rmd/canopy_analysis_template.Rmd '
'--title "Canopy analysis: {wildcards.donor}" '
'--donor {wildcards.donor} --to_working_dir ../../ '
rule filter_somatic_variants_per_donor_strict:
input:
flat='data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_strict-alldonors.txt.gz',
vcf='data/raw/mpileup/{donor}.mpileup.vcf.gz',
csi='data/raw/mpileup/{donor}.mpileup.vcf.gz.csi'
output:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_strict-{donor}.txt.gz'
conda:
"envs/myenv.yaml"
shell:
'Rscript src/R/filter_variants.R -i {input.flat} -o {output} '
'--donor_cell_vcf {input.vcf} --max_fdr 0.2 '
'--min_prop_covered_cells 0.005 --donor_name {wildcards.donor}'
rule filter_somatic_variants_per_donor_lenient:
input:
flat='data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz',
vcf='data/raw/mpileup/{donor}.mpileup.vcf.gz',
csi='data/raw/mpileup/{donor}.mpileup.vcf.gz.csi'
output:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_lenient-{donor}.txt.gz'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/filter_variants.R -i {input.flat} -o {output} '
'--donor_cell_vcf {input.vcf} --max_fdr 0.2 '
'--min_prop_covered_cells 0.005 --donor_name {wildcards.donor}'
rule filter_somatic_variants_strict:
input:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.txt.gz'
output:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_strict-alldonors.txt.gz'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/filter_variants.R -i {input} -o {output} '
'--max_fdr 0.05 --min_vaf_fibro 0.03 --max_vaf_fibro 0.45 '
'--min_nalt_fibro 2.5 --max_vaf_ips 0.7 --combo_max_vaf_fibro 0.35 '
'--combo_max_vaf_ips 0.3'
rule filter_somatic_variants_lenient:
input:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.txt.gz'
output:
'data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz'
conda:
"envs/myenv.yaml"
singularity:
"docker://davismcc/r-singlecell-img"
shell:
'Rscript src/R/filter_variants.R -i {input} -o {output} '
'--max_fdr 0.1 --min_vaf_fibro 0.01 --max_vaf_fibro 0.45 '
'--min_nalt_fibro 1.5 --max_vaf_ips 0.8 --combo_max_vaf_fibro 0.45 '
'--combo_max_vaf_ips 0.45'
# workflowr options
# Version 1.1.1
# The seed to use for random number generation. See ?set.seed for details.
seed: 20190102
# The working directory to build the R Markdown files. The path is relative to
# _workflowr.yml. See ?rmarkdown::render for details.
knit_root_dir: "."
# Session information function
sessioninfo: "devtools::session_info()"
name: "AAAA_2019_Project-Template"
output_dir: "../docs"
navbar:
title: "AAAA_2019_Project-Template"
left:
- text: "Home"
href: index.html
- text: "About"
href: about.html
- text: "License"
href: license.html
right:
- icon: fa-github
href: https://github.com/davismcc/AAAA_2019_Project-Template
output:
workflowr::wflow_html:
toc: true
toc_float: true
theme: journal
highlight: pygments
code_folding: hide
---
title: "About"
output:
workflowr::wflow_html:
toc: false
---
## *Cardelino* : Integrating whole exomes and single-cell transcriptomes to reveal phenotypic impact of somatic variants
**Key findings:**
* A new approach for integrating DNA-seq and single-cell RNA-seq data to reconstruct clonal substructure and single-cell transcriptomes.
* A new computational method to map single-cell RNA-seq profiles to clones.
* Evidence for non-neutral evolution of clonal populations in human fibroblasts.
* Proliferation and cell cycle pathways are commonly distorted in mutated clonal populations, with implications for cancer and ageing.
**Abstract**
Decoding the clonal substructures of somatic tissues sheds light on cell growth, development and differentiation in health, ageing and disease. DNA-sequencing, either using bulk or using single-cell assays, has enabled the reconstruction of clonal trees from somatic variants. However, approaches to characterize phenotypic and functional variations between clones are not established.
Here we present cardelino (https://github.com/PMBio/cardelino), a computational method to assign single-cell transcriptome profiles to somatic clones using variant information contained in single-cell RNA-seq (scRNA-seq) data. After validating our model using simulations, we apply cardelino to matched scRNA-seq and exome sequencing data from 32 human dermal fibroblast lines
We identify hundreds of differentially expressed genes between cells assigned to different clones. These genes were frequently enriched for the cell cycle and pathways related to cell proliferation, and our data point to clone gene expression phenotypes that support previous work showing non-neutral somatic evolution in nominally healthy human skin cells.
## Authors
The full author list is as follows:
Davis J. McCarthy<sup>1,4,\*</sup>, Raghd Rostom<sup>1,2,\*</sup>, Yuanhua Huang<sup>1,\*</sup>, Daniel J. Kunz<sup>2,5,6</sup>, Petr Danecek<sup>2</sup>, Marc Jan Bonder<sup>1</sup>, Tzachi Hagai<sup>1,2</sup>, HipSci Consortium, Wenyi Wang<sup>8</sup>, Daniel J. Gaffney<sup>2</sup>, Benjamin D. Simons<sup>5,6,7</sup>, Oliver Stegle<sup>1,3,9,#</sup>, Sarah A. Teichmann<sup>1,2,5,#</sup>
<sup>1</sup>European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, CB10 1SD
Hinxton, Cambridge, UK; <sup>2</sup>Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK; <sup>3</sup>European Molecular Biology Laboratory, Genome Biology Unit, 69117 Heidelberg, Germany; <sup>4</sup>St Vincent’s Institute of Medical Research, Fitzroy, Victoria 3065, Australia. <sup>5</sup>Cavendish Laboratory, Department of Physics, JJ Thomson Avenue, Cambridge, CB3 0HE, UK. <sup>6</sup>The Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, CB2 1QN, UK. <sup>7</sup>The Wellcome Trust/Medical Research Council Stem Cell Institute, University of Cambridge, Cambridge, UK. <sup>8</sup>Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA. <sup>9</sup>Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
<sup>*</sup> These authors contributed equally to this work.
<sup>#</sup> Corresponding authors.
---
title: "Human dermal fibroblast clonality project"
author: "Davis J. McCarthy"
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: false
---
## Project overview
This project investigates clonality in human dermal fibroblast cell populations
in 32 cell lines from distinct donors, using bulk whole-exome sequencing and
single-cell RNA-sequencing data.
**Key findings:**
* A novel approach for integrating DNA-seq and single-cell RNA-seq data to
reconstruct clonal substructure and single-cell transcriptomes.
* A new computational method, [cardelino](https://github.com/PMBio/cardelino), to map
single-cell RNA-seq profiles to clones.
* Evidence for non-neutral evolution of clonal populations in human fibroblasts.
* Proliferation and cell cycle pathways are commonly distorted in mutated clonal
populations, with implications for cancer and ageing.
For a richer overview, see the [About](about.html) page.
## Data pre-processing
The data pre-processing for this project from the raw data described above is
complicated and computationally expensive, so this repository does not reproduce
the data pre-processing in an automated way. However, we provide the source code
for the [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for
data pre-processing in this repository. Docker images providing the computing
environment and software used are publicly available, split into an image for
command line [bioinformatics tools](https://hub.docker.com/r/davismcc/fibroblast-clonality/)
and an [R installation](https://hub.docker.com/r/davismcc/r-singlecell-img/) with
necessary packages installed.
If you would like to pre-process the data from raw reads to results as we have,
please consult our description of [how to run](data_preprocessing.html) the
workflow.
## Analyses
Here we present the reproducible the results of our analyses. They were
generated by rendering the
[R Markdown documents](https://github.com/davismcc/fibroblast-clonality/tree/master/analysis)
into webpages available at the links below.
The results presented in the paper were produced with these analyses.
1. [Simulation results.](simulations.html)
1. [Overview of lines.](overview_lines.html)
1. [Selection models.](selection_models.html)
1. [Analysis of clonal prevalences.](clone_prevalences.html)
1. [Analysis for the example cell line *joxm*.](analysis_for_joxm.html)
1. [Variance components analysis.](variance_components.html)
1. [Differential expression analysis.](differential_expression.html)
1. [Analysis of effects of somatic variants on cis gene expression.](mutated_genes.html)
## Data availability
This is a complicated project, and reproducing all of the results presented, especially from raw data is highly non-trivial. Nevertheless, we have made all data available so that everything is entirely reproducible.
Single-cell RNA-seq data have been deposited in the
[ArrayExpress](https://www.ebi.ac.uk/arrayexpress) database at EMBL-EBI under accession
number [E-MTAB-7167](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7167).
Whole-exome sequencing data is available through the
[HipSci portal](http://www.hipsci.org). Processed data and large results files are
available from [Zenodo](http://doi.org/10.5281/zenodo.1403510) with DOI 10.5281/zenodo.1403510.
To set up the project to reproduce our analyses, first clone the [source code repository](https://github.com/davismcc/fibroblast-clonality) from GitHub. Next, download all of the reference, metadata and results files and add them to the (cloned) project folder with the following structure:
```
.
├── data
│   ├── canopy
│   │   ├── canopy_results.*.rds
│   ├── cell_assignment
│   │   ├── cardelino_results.*.rds
│   ├── de_analysis_FTv62
│   │   ├── cellcycle_analyses
│   │   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.cc.rds
│   │   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│   │   ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.rds
│   │   ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.rds
│   │   └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.rds
│   ├── donor_info_070818.txt
│   ├── donor_info_core.csv
│   ├── donor_neutrality.tsv
│   ├── exome-point-mutations
│   │   ├── high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
│   │   └── high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
│   ├── human_H_v5p2.rdata
│   ├── human_c2_v5p2.rdata
│   ├── human_c6_v5p2.rdata
│   ├── neg-bin-rsquared-petr.csv
│   ├── neutralitytestr-petr.tsv
| ├── sces
│   │   ├── sce_*.rds
│   ├── selection
│   │   ├── neg-bin-params-fit.csv
│   │   ├── neg-bin-rsquared-fit.csv
│   ├── simulations
│   │   ├── *.filt_lenient.cell_coverage_sites.mult.rds
│   │   ├── *.simulate.rds
│   └── variance_components
│   ├── covar_all.csv
│   ├── donorVar
│   │   ├── *.var_part.var1.csv
│   ├── fit_all_gene_highVar.csv
│   ├── fit_per_gene_highVar.csv
│   ├── gene_info_all.csv
│   └── logcnt_all.csv
├── metadata
│   ├── cell_metadata.csv
│   └── data_processing_metadata.tsv
├── references
│   ├── 1000G_phase1.indels.hg19.sites.vcf.gz
│   ├── GRCh37.p13.genome.ERCC92.fa
│   ├── Homo_sapiens.GRCh37.rel75.cdna.all.ERCC92.fa.gz
│   ├── Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
│   ├── dbsnp_138.hg19.biallelicSNPs.HumanCoreExome12.Top1000ExpressedIpsGenes.Maf0.01.HWE0.0001.HipSci.vcf.gz
│   ├── dbsnp_138.hg19.vcf.gz