Skip to content
Snippets Groups Projects
Commit 5ccd8cb8 authored by Ruqian Lyu's avatar Ruqian Lyu
Browse files

update readme

parent ddf2dbfc
No related branches found
No related tags found
No related merge requests found
Pipeline #9325 passed
...@@ -12,7 +12,7 @@ fastq reads and mapping reads to the mouse reference genome mm10 ...@@ -12,7 +12,7 @@ fastq reads and mapping reads to the mouse reference genome mm10
### Step3 subsample reads ### Step3 subsample reads
`sscocaller` is designed to process DNA reads with CB (cell barcode) tags from all single sperm cells stored in one BAM file. And to reduce some processing burdens, the `sgcocaller` is designed to process DNA reads with CB (cell barcode) tags from all single sperm cells stored in one BAM file. And to reduce some processing burdens, the
mapped reads for each sperm were de-duplicated and subsamples to a fraction of 0.5. mapped reads for each sperm were de-duplicated and subsamples to a fraction of 0.5.
In addition, before merging reads from each sperm, the CB (cell barcode, the SRR ID) tag was appended to each DNA read using [appendCB](https://github.com/ruqianl/appendCB). In addition, before merging reads from each sperm, the CB (cell barcode, the SRR ID) tag was appended to each DNA read using [appendCB](https://github.com/ruqianl/appendCB).
...@@ -37,12 +37,12 @@ The SNPs were further filtered to only keep the positions which have been called ...@@ -37,12 +37,12 @@ The SNPs were further filtered to only keep the positions which have been called
as Homo_alternative `CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz` downloaded from the as Homo_alternative `CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz` downloaded from the
dbsnp database from Mouse Genome Project. dbsnp database from Mouse Genome Project.
## Running sscocaller ## Running sgcocaller
`run_sscocaller.snk` defines the snakemake rule for running sscocaller for each chromosome. `run_sgcocaller.snk` defines the snakemake rule for running sgcocaller for each chromosome.
## Downstream crossover analysis with comapr ## Downstream crossover analysis with comapr
https://biocellgen-public.svi.edu.au/hinch-single-sperm-DNA-seq-processing/public/Crossover-identification-with-sscocaller-and-comapr.html https://biocellgen-public.svi.edu.au/hinch-single-sperm-DNA-seq-processing/Crossover-identification-with-sgcocaller-and-comapr.html
---
title: "Crossover-identification-with-sgcocaller-and-comapr"
output: bookdown::html_document2
author: Ruqian Lyu
Date: "05/14/2021"
bibliography: rejy.bib
---
## Introduction
We will demonstrate the usage of [`sgcocaller`](https://gitlab.svi.edu.au/biocellgen-public/sgcocaller)
and [`comapr`](https://github.com/ruqianl/comapr) for identifying and visualising
crossovers regions from single-sperm DNA sequencing dataset.
`sgcocaller`(https://gitlab.svi.edu.au/biocellgen-public/sgcocaller) applies a binomial
Hidden Markov Model for inferring haplotypes of single sperm genomes from the aligned
DNA reads in a BAM file. The inferred haplotype sequence can then be used for
calling crossovers by identifying haplotype shifts (see [`comapr`](https://github.com/ruqianl/comapr) ).
## Downloading example dataset
An individual mouse genetic map was constructed by DNA sequencing of 217 sperm
cells from a F1 hybrid mouse (B6 X CAST) [@Hinch2019-dt]. We will apply `sgcocaller`
on this dataset and it can be downloaded from GEO (Gene Expression Omnibus) with
accession [GSE125326](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125326)
The slurm submission script `submit-wgetSRAFastqdump.sh` at [repo](https://gitlab.svi.edu.au/biocellgen-public/hinch-single-sperm-DNA-seq-processing.git)
can be used for downloading the `.sra` files and dumping them into paired fastq
files for each sperm (including two bulk sperm samples).
## Dataset preprocessing
The preprocessing steps include read filtering and mapping, subsample reads and
append cell barcodes to reads, merge bams, and find informative SNP markers.
### 1 Alignment
The downloaded fastq files for each sperm cells (and the bulk sperm samples) were
aligned to mouse reference genome mm10. The workflow [`run_alignment.snk`](https://gitlab.svi.edu.au/biocellgen-public/hinch-single-sperm-DNA-seq-processing.git)
which is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) file that defined steps/rules including
- running [`fastp`](https://github.com/OpenGene/fastp) for filtering reads and adapter trimming
- running [`minimap2`](https://github.com/lh3/minimap2) for mapping reads to reference genome mm10
- running GATK MarkDuplicates
- running GATK AddOrReplaceReadGroup
- running sorting and indexing bam files using `samtools`
### 2 Subsample reads and append CB tag
`sgcocaller` is designed to process DNA reads with CB (cell barcode) tags from
all single sperm cells stored in one BAM file. And to reduce some processing burdens, the
mapped reads for each sperm were de-duplicated and subsamples to a fraction of 0.5.
In addition, before merging reads from each sperm, the CB (cell barcode, the SRR ID)
tag was appended to each DNA read using [appendCB](https://github.com/ruqianl/appendCB).
Refer to steps defined in `run_subsample.snk`.
### 3 Merge single-sperm bam files into one Bam
`samtools` was used for merge CB-taged reads from all single sperm to one BAM file. See
`submit-mergeBams.sh`.
### 4 Find informative SNP markers
The informative SNP markers are those SNPs which differ between the two mouse stains
that were used to generate the F1 hybrid mouse (CAST and BL6). The following steps were
applied which largely align with what has been described in the original paper [@Hinch2019-dt].
The bulk sperm sample `SRR8454653` was used for calling de no vo variants for this
mouse individual using GATK HaplotypeCaller. Only the HET SNPs with `MQ>50` AND
`DP>10` AND `DP<80` were kept.
The SNPs were further filtered to only keep the positions which have been called
as Homo_alternative `CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz` downloaded from the
dbsnp database from Mouse Genome Project[@Keane2011-be].
## Running sgcocaller
With the DNA reads from each sperm were tagged and merged into one BAM file, we
can run `sgcocaller` for inferring the haplotype states against the list of
informative SNP markers for each chromosome in each sperm.
The required input files are:
```
mergedBam = "output/alignment/mergedBam/mergedAll.bam",
vcfRef="output/variants/denovoVar/SRR8454653.mkdup.sort.rg.filter.snps.castVar.vcf.gz",
bcFile="output/alignment/mergedBam/mergedAll.bam.barcodes.txt"
```
`run_sgcocaller.snk` defines the rule for running `sgcocaller` on each chromosome
for sperm cells. The command line was:
```
sgcocaller --threads 4 --chrom "chr1" --chrName chr {input.mergedBam} \
{input.vcfRef} {input.bcFile} --maxTotalReads 150 --maxDP 10 \
sgcocaller/hinch/hinch_
```
## Output files
The generated output files (for each chromosome, here showing chr1):
- hinch_chr1_altCount.mtx, sparse matrix file, containing the alternative allele counts (the CAST alleles)
- hinch_chr1_totalCount.mtx, sparse matrix file, containing the total allele counts (the CAST + BL6 alleles)
- hinch_chr1_vi.mtx, sparse matrix file, containing the inferred Viterbi state
(haplotype state) for each chromosome against the list of SNP markers in "_snpAnnot.txt".
- hinch_chr1_viSegInfo.txt, txt file, containing the inferred Viterbi state segments information. Details below
- hinch_chr1_snpAnnot.txt, txt file, containing the row annotations (SNPs) for the above sparse matrices.
*Note*, the columns in these sparse matrices correspond to cells in the input `bcFile`.
**_viSegInfo.txt** contains summary statistics of inferred Viterbi state segments.
A Viterbi segment is defined by a list of consecutive SNPs having the same Viterbi state.
The columns in the `*_viSegInfo.txt` are:
- ithSperm,
- Starting SNP position,
- Ending SNP position,
- the number of SNPs supporting the segment
- the log likelihood ratio of the Viterbi segment
- the inferred hidden state
**log likelihood ratio**
The loglikelihood ratio is calculated by taking the inferred log likelihood and
subtract the reversed log likelihood.
For example, the segment with two SNPs in the figure below:
![](../public/meta_images/ratio_ll.png)
The numbers in brackets indicating the (alternative allele counts, total allele counts)
aligned to the two SNP positions.
The inferred log likelihood can be expressed as:
$$
inferredLogll = log(Trans_L)+log(dbinom(3,4,0.9))+log(dbinom(4,4,0.9))+log(Trans_R)
$$
The reversed log likelihood is then:
$$
reversedLogll = log(noTrans_L)+log(dbinom(3,4,0.1))+log(dbinom(4,4,0.1))+log(noTrans_R)
$$
Hence the logllRatio:
$$
logllRatio = inferredLogll - reversedLogll
$$
A larger `logllRatio` indicating we are more confident with the inferred Viterbi states for markers
in the segment.
## Diagnosic plots
The output files from `sgcocaller` can be directly parsed through `readHapState`
function. However, we have a look at some cell-level metrics and segment-level
metrics before we parse the `sgcocaller` output files.
### Per cell QC
The function `perCellQC` generates cell-level metrics in a data.frame and the
plots in a list.
We first identify the relevant file paths:
`dataset_dir` is the ouput directory from running `sgcocaller` and
`barcodeFile_path` points to the file containing the list of cell barcodes.
```{r}
suppressPackageStartupMessages({
library(comapr)
library(ggplot2)
library(dplyr)
library(Gviz)
library(BiocParallel)
library(SummarizedExperiment)
})
```
```{r}
path_dir <- "/mnt/beegfs/mccarthy/scratch/general/Datasets/Hinch2019/"
dataset_dir <- paste0(path_dir,"output/sgcocaller/hinch/")
barcodeFile_path <-paste0(path_dir,"output/alignment/mergedBam/mergedAll.bam.barcodes.txt")
```
We can locate the files and list the files to have a look:
```{r}
list.files(path=dataset_dir)[1:5]
```
```{r}
BiocParallel::register(BiocParallel::MulticoreParam(workers = 4))
#BiocParallel::register(BiocParallel::SerialParam())
```
Running `perCellChrQC` function to find the cell-level statistics:
```{r}
pcqc <- perCellChrQC("hinch",
chroms=paste0("chr",1:19),
path=dataset_dir,
barcodeFile=barcodeFile_path)
```
The generated scatter plots for selected chromosomes:
```{r}
pcqc$plot
```
X-axis plots the number of haplotype transitions (`nCORaw`) for each cell and
Y-axis plots the number of total SNPs detected in a cell. A large `nCORaw` might
indicate the cell being a diploid cell included by accident or doublets. Cells
with a lower `totalSNPs` might indicate poor cell quality.
A data.frame with cell-level metric is also returned:
```{r}
pcqc$cellQC
```
### perSegChrQC
`PerSegQC` function visualises statistics of inferred haplotype state segments,
which helps decide filtering thresholds for removing crossovers that do not have
enough evidence by the data and the very close double crossovers which are biologically
unlikely.
```{r}
psqc <- perSegChrQC("hinch",chroms=paste0("chr",1),
path=dataset_dir,
barcodeFile=barcodeFile_path,
maxRawCO = 30)
psqc+theme_classic()
```
## Parsing files using `comapr`
Now we have some idea about the features of this dataset, we can read in the
files from `sgcocaller` which can be directly parsed through `readHapState`
function. This function returns a `RangedSummarizedExperiment` object with
`rowRanges` containing SNP positions that have ever contributed to crossovers
in a cell, while `colData` contains the cell annotations such as barcodes.
The following filters have been applied:
- Segment level filters:
- minSNP=30, the segment that results in one/two crossovers has to have more than
30 SNPs of support
- minlogllRatio=150, the segment that results in one/two crossovers has to have
logllRatio larger than 150.
- bpDist=1e5, the segment that results in one/two crossovers has to have base pair
distances larger than 1e5
- Cell level filters:
- maxRawCO, the maximum number of raw crossovers (the number of state transitions
from the _vi.mtx file) for a cell
- minCellSNP=200, there have to be more than 200 SNPs detected within a cell,
otherwise this cell is removed.
```{r,eval=FALSE}
hinch_rse <- readHapState(sampleName = "hinch",
path = dataset_dir,
chrom=paste0("chr",1:19),
barcodeFile = barcodeFile_path,
minSNP = 30, minCellSNP = 200,
maxRawCO = 55,
minlogllRatio = 150,
bpDist = 1e5)
#saveRDS(hinch_rse,file = "output/outputR/analysisRDS/hinch_rse.rds")
```
```{r,include=FALSE}
#hinch_rse <- readRDS(file="output/outputR/analysisRDS/hinch_rse.rds")
```
The `hinch_rse` object:
```{r}
hinch_rse
```
The `rowRanges` of `hinch_rse`
```{r}
SummarizedExperiment::rowRanges(hinch_rse)
```
The `assay` slot contains the Viterbi state matrix (SNP by Cell):
```{r}
SummarizedExperiment::assay(hinch_rse)[1:5,1:5]
```
**Note** this matrix is more `sparse` which only contains the SNPs that
contribute to crossovers in cells.
## Samples group factor
We have sperm cells from only one individual in this dataset. However, to
demonstrate the functions in `comapr` we split the cells into two groups.
```{r}
## set the first 80 cells as group1 and rest as group2
colData(hinch_rse)$sampleGroup <- c(rep("Group1",ceiling(ncol(hinch_rse)/2)),rep("Group2",ncol(hinch_rse)-ceiling(ncol(hinch_rse)/2)))
colData(hinch_rse)
```
*Note* `combineHapState` can be applied of there are multiple sets of outputs from
`sgcocaller`.
## Count crossovers in cells
The function `countCOs` can then be executed to find the crossover intervals and
the number of crossovers for each cell within each crossover interval.
```{r}
hinch_co_counts <- countCOs(hinch_rse)
```
The SNP intervals list in the rowRanges slot:
```{r}
rowRanges(hinch_co_counts)
```
The `assay` slot of `hinch_co_counts` contains the number of crossovers per cell
per SNP interval:
```{r}
assay(hinch_co_counts)[1:5,1:5]
```
```{r}
mean(colSums(as.matrix(assay(hinch_co_counts))))
```
## Plot crossover counts
To get the number of crossovers per sperm cell, we just need to sum each column
of the matrix in the `assay` slot. And the `plotCount` function plots the number
of crossovers for each sperm.
```{r,fig.width=6,fig.height=4,fig.width=4,dpi=300}
plotCount(hinch_co_counts)+theme_classic()+
scale_color_manual(values = c("all"="red"))+theme_classic(base_size = 25)+
theme(axis.text = element_text(size = 25),
axis.title = element_text(size =25),
legend.position = "none",
axis.text.x = element_blank(),
axis.ticks.x = element_blank())+ylab("Crossover counts")+xlab("Mouse sperm cells")
```
Or we can plotCount for each sample group:
```{r}
plotCount(hinch_co_counts, group_by = "sampleGroup")
```
In addition, we can also plot the number of crossovers per chromosome (with
mean number of crossovers and standard error bar):
```{r,fig.width=14,fig.height=6}
plotCount(hinch_co_counts, ,by_chr = TRUE)+
theme(axis.text.x = element_text(angle=90))
```
We can also generate bar plot counts of number of crossovers for each
chromosome. We can see that for fewer double crossovers were called for
smaller chromosomes.
```{r,include=FALSE,fig.width=16,fig.height =8,dpi=300}
tmp <- assay(hinch_co_counts)
tmp$chr <- seqnames(hinch_co_counts)
tmp <- data.frame(tmp,check.names = FALSE) %>%
tidyr::pivot_longer(cols = colnames(hinch_co_counts),
values_to = "COs", names_to = "BC")
tmp$sampleGroup <- "all"
p <- tmp %>% group_by(chr,BC) %>% summarise(ChrCOs = sum(COs),
sampleGroup = unique(sampleGroup)) %>% mutate(ChrCOs=as.integer(ChrCOs)) %>%
ggplot()+geom_bar(mapping = aes(x=ChrCOs))+facet_wrap(.~chr)
p+theme_classic(base_size = 25)+facet_wrap(.~chr,ncol=7)+xlab("Crossover counts per chromosome")+ylab("Counts")
```
```{r,fig.width=12}
plotCount(hinch_co_counts,by_chr = TRUE,
plot_type ="hist")+theme_classic(base_size = 22)+facet_wrap(.~chr,ncol=8)
```
## Plot SNP density track
The informative SNP markers' distributions along the chromosome affects the
crossover resolutions, therefore it is helpful to visualize the SNP density
distribution.
We can generate the SNP density DataTrack with function `getSNPDensityTrack`
which returns a `DataTrack` object from `Gviz` package.
```{r}
## log=TRUE, the result after aggregation is returned on a log10 scale
snp_track_chr10 <- getSNPDensityTrack(chrom = "chr10",
path_loc = dataset_dir,
sampleName = "hinch",
nwindow = 80,
log = TRUE,
plot_type = "hist")
```
```{r,fig.width=8,fig.height=3}
plotTracks(snp_track_chr10)
snp_track_chr10 <- setPar(snp_track_chr10,"cex.axis",1.5)
```
To change visualisation parameters we can use setPar:
```{r,fig.height=3,fig.width=8}
snp_track_chr10 <- setPar(snp_track_chr10,"background.title","firebrick")
plotTracks(snp_track_chr10)
```
Change aggregation function to "sum"
```{r,fig.width=8,fig.height=3}
snp_track_chr10 <- setPar(snp_track_chr10,"aggregation","sum")
plotTracks(snp_track_chr10)
```
## Plot Mean DP (read depth) across cells for each chromosome
```{r}
meanDP_track_chr10 <- getMeanDPTrack(chrom = "chr10",
path_loc = dataset_dir,
nwindow = 80,
sampleName ="hinch",
barcodeFile=barcodeFile_path,
plot_type = "hist",
selectedBarcodes = colnames(hinch_co_counts),
snp_track = snp_track_chr10,
log =TRUE)
```
```{r,fig.height=3,fig.width=8}
plotTracks(meanDP_track_chr10)
```
## Visulise the raw Alternative Frequency (AF) plot with crossover region highlighted
for the selected cell
We can select a cell and visulise the raw Alternative Frequency (AF) plot with the
called crossover region highlighted.
```{r}
cell_af <- getCellAFTrack(chrom = "chr10",
path_loc = dataset_dir,
sampleName = "hinch",
barcodeFile = barcodeFile_path,
nwindow = 80,
snp_track = snp_track_chr10,
cellBarcode = colnames(hinch_co_counts)[1],
co_count = hinch_co_counts,
chunk = 8000L)
```
Generate a Highlight track with the returned list object `cell_af`
```{r,fig.width=8,fig.height=3}
changed_bgcolor <- cell_af$af_track
changed_bgcolor <- setPar(changed_bgcolor, "background.title","#4C5270")
ht <- HighlightTrack(changed_bgcolor,
range = cell_af$co_range[seqnames(cell_af$co_range)=="chr10"],
chromosome = "chr10")
plotTracks(ht,cex = 1.5)
```
Easily combined with `GenomeAxisTrack` and `IdeogramTrack`
```{r,fig.width=10,fig.height=4}
gtrack <- GenomeAxisTrack()
chr10_ideo <- IdeogramTrack(genome = "mm10", chromosome = "chr10")
plotTracks(list(chr10_ideo,gtrack, ht),cex.title = 1.2,
cex.axis = 1.5,cex = 1.5)
```
## Plot SNP density along with crossover counts
While one can get the DataTracks for the AF and the called crossover regions of
a set of cells with `getAFTracks`, comapr also offers the function for plotting
crossover counts for each cell or averaged crossover counts across sample groups.
```{r,fig.height=8,fig.width=8}
crossover_count_track <- DataTrack(range = rowRanges(hinch_co_counts),
genome = "mm10",
data = data.frame(assay(hinch_co_counts)),
name = "expected crossover counts across SNP intervals",
type = "heatmap",
groups = hinch_co_counts$sampleGroup,
col = c("red","blue"),
#aggregateGroups = TRUE,
aggregation = mean,
window =80)
plotTracks(list(gtrack,snp_track_chr10,crossover_count_track),
chromosome = "chr10")
```
**Chromosome 10**
```{r,fig.height=5,fig.width=10}
gtrack <- GenomeAxisTrack(cex=1.5)
snp_track_chr10 <- setPar(snp_track_chr10,"cex.axis",1.5)
snp_track_chr10 <- setPar(snp_track_chr10,"cex.title",1.5)
#plotTracks(snp_track_chr10)
crossover_count_track <- DataTrack(range = rowRanges(hinch_co_counts),
genome = "mm10",
data = data.frame(assay(hinch_co_counts)),
name = "Mean\ncrossovers",
type = c("heatmap"),
groups = hinch_co_counts$sampleGroup,
col = c("red","blue"),
aggregateGroups = TRUE,
aggregation = mean,
window =80,cex.title=1.5,
cex.axis =1.5,
background.title = "#F652A0")
plotTracks(list(gtrack,snp_track_chr10,crossover_count_track),
chromosome = "chr10",sizes = c(1,2,2),window =50)
```
**Chromosome 1**
```{r}
snp_track_chr1 <- getSNPDensityTrack(chrom = "chr1",
path_loc = dataset_dir,
sampleName = "hinch",
nwindow = 80,
log = FALSE,
plot_type = "hist")
snp_track_chr1 <- setPar(snp_track_chr1,"background.title","firebrick")
plotTracks(list(gtrack,snp_track_chr1,crossover_count_track),
chromosome = "chr1")
```
## Calculate Genetic distances from crossover rates
The raw crossover rates estimated from observed crossovers across SNP intervals
for a group of samples are commonly converted into genetic distances in units
of centiMorgans via mapping functions such as the Kosambi or the Haldane function.
- Haldane, cM =−0.5×ln(1−2r)×100,
- Kosambi, cM=0.25×ln ((1+2r)/(1−2r))×100,
where r is the recombination fraction.
The Haldane mapping function adds mathematical adjustments to the recombination
fraction. It assumes that crossover events are random and independent along the
chromosome, and the number of crossover events between two loci follows a
Poisson distribution. Haldane’s mapping function adjusts underestimated crossover
rate in larger intervals that are likely to have unobserved even number of crossovers.
Kosambi’s mapping function was derived based on Haldane’s and takes consideration
of crossover interference.
We can calculate the genetic distances with the sperm dataset using
`calGeneticDist` function:
```{r}
# mapping_fun = "k" for applying the kosambi function
hinch_dist <- calGeneticDist(hinch_co_counts,
mapping_fun = "k")
```
The total genetic distances across the autosomes are then:
```{r}
sum(rowData(hinch_dist)$kosambi)
```
The genetic distances can also be calculated per sample group. It is useful for
doing comparative analysis. We can also supply a `bin_size` parameter to get
the genetic distances calcuated on binned chromosome intervals.
```{r}
hinch_dist_groups <- calGeneticDist(hinch_co_counts,
group_by = "sampleGroup",
bin_size = 1e7)
```
The genetic distances per group can be derived as:
```{r}
Matrix::colSums(as.matrix(mcols(hinch_dist_groups)))
```
### Plot genetic distances
The genetic distances across chromosome bins can be visualized by `plotGeneticDist`
function:
```{r,fig.width=6.5,fig.height=5.5,dpi=300}
plotGeneticDist(hinch_dist_groups,chr = "chr10")
plotGeneticDist(hinch_dist_groups,chr = "chr1")+theme_classic(base_size = 25)+
scale_color_manual(values = c("Group1"="#122620",
"Group2" = "#D6AD60" ))+
theme(legend.position = "top",
plot.margin=margin(t = 0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
ylab("CentiMorgans per 10 Mb")
```
```{r,fig.width=7,fig.height=4.5,dpi=300}
hinch_dist_groups_non_bin <- calGeneticDist(hinch_co_counts,
group_by = "sampleGroup")
hinch_dist_groups_non_bin_gr <- rowRanges(hinch_dist_groups_non_bin)
mcols(hinch_dist_groups_non_bin_gr) <- mcols(hinch_dist_groups_non_bin_gr)$kosambi
plotGeneticDist(hinch_dist_groups_non_bin_gr,chr = "chr1",cumulative = TRUE)+
theme_classic(base_size = 25)+
theme(legend.position = "none",
plot.margin=margin(t = 0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
scale_color_manual(values = c("Group1"="#122620",
"Group2" = "#D6AD60" ))+
ylab("Cumulative\ncentiMorgans")
```
```{r,fig.width=6,fig.height=4,dpi=300}
non_group_gr <- calGeneticDist(hinch_co_counts,bin_size = 1e7)
#mcols(non_group_gr) <- mcols(non_group_gr)$kosambi
colnames(mcols(non_group_gr)) <- "allCells"
plotGeneticDist(non_group_gr,chr = "chr1")+theme_classic(base_size = 25)+
theme(legend.position = "none",
plot.margin=margin(t = 0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
scale_color_manual(values =c("allCells" = "darkblue"))+ylab("CentiMorgans per 10 Mb")
```
```{r}
plotGeneticDist(hinch_dist_groups,chr = c("chr1","chr2"))
plotGeneticDist(hinch_dist_groups,chr = c("chr15","chr16"))
```
We can also do cumulative centiMorgans plots and the whole genome plot:
```{r,fig.width=8,fig.height=6}
plotGeneticDist(hinch_dist_groups,chr = c("chr15","chr16"),cumulative = TRUE)
```
```{r, fig.width=12,fig.height=5}
plotWholeGenome(hinch_dist_groups)+theme_classic(base_size = 25)+
theme(axis.text.x = element_text(angle = 90),
legend.position = "none")+
scale_color_manual(values = c("Group1"="#122620",
"Group2" = "#D6AD60" ))+xlab("Cumulative whole genome")+
ylab("Cumulative\ncentiMorgans")
```
The two sample groups are similar by looking at the cumulative centiMorgan
growth curves of the two.
## Group comparison
The calculated total genetic distances for the two groups show that Group1 has
slightly larger total genetic distances resulted from more meiotic crossovers
observed.
To test whether the observed difference is statistically significant, we can
apply Bootstrapping test to get confidence intervals of group differences and
permutation testing for calculating a significance level.
**Bootstrapping**
```{r}
set.seed(100)
bootsResult <- bootstrapDist(hinch_co_counts,
group_by = "sampleGroup",
B = 2000)
```
The 95% confidence intervals for the group differences by bootstrapping is then:
```{r}
quantile(bootsResult,c(0.025,0.975))
```
which includes zero thus the observed difference is not significant at level of
0.05.
The histogram of the bootstrapping results:
```{r,bootstrapExampleFig,fig.width=6.5,fig.height=6,dpi = 300}
btrp_quantile <- quantile(bootsResult,c(0.025,0.975))
ggplot()+geom_histogram(mapping = aes(x = bootsResult),
fill = "#7c7b89")+theme_classic(base_size = 18)+
geom_vline(mapping = aes(xintercept = btrp_quantile[1],color = "2.5%"),size =1.5)+
geom_vline(mapping = aes(xintercept = btrp_quantile[2],
color = "97.5%"),size =1.5)+
scale_x_continuous(breaks = c( -100,-50,
round(btrp_quantile[1],2),
0 , 50 , 75 ,
round(btrp_quantile[2],2),
150, 200) ,
labels = c( -100,-50,
round(btrp_quantile[1],2),
0 , 50 , 75 ,
round(btrp_quantile[2],2),
150, 200))+
xlab("Bootstrapping results")+
theme_classic(base_size = 25)+
scale_color_manual(values = c("97.5%"="#0b7fab","2.5%"="#e9723d"),
name = "Quantile")+ylab("Count")+
theme(legend.position = "top",
axis.text.x = element_text(angle = 90))
## seq(from = -100, to = 200, by = 50)
```
```{r}
dat <- with(density(bootsResult), data.frame(x, y))
ggplot(dat,mapping = aes(x = x,y=y))+geom_line(fill = "#7c7b89")+
geom_area(mapping = aes(x = ifelse(x > btrp_quantile[1] &
x < btrp_quantile[2],
x, NA)),
fill = "hotpink",alpha=0.3)+theme_classic()+
xlab("Bootstrapping results")
```
**Permutation**
We next apply permutation testing using the `permuteDist` function.
```{r}
set.seed(2021)
BiocParallel::register(BiocParallel::SerialParam())
perms <- permuteDist(hinch_co_counts,group_by = "sampleGroup",
B=2000)
perms$observed_diff
perms$nSample
```
```{r}
sum(is.na(perms$permutes))
```
We can then use the `statmod::permp()` function [@Phipson2010-xi] to calculate
an exact p-value for this set of permutation results:
```{r}
padjust <- statmod::permp(x = sum(perms$permutes> perms$observed_diff),
nperm = 2000,
n1 = perms$nSample[1],
n2 = perms$nSample[2],
twosided = F)
padjust
```
We can see that the calculated p-value was not significant.
```{r,permutateResultExample,fig.width=7,fig.height=6,dpi = 300}
ggplot()+geom_histogram(mapping = aes(x = perms$permutes),
fill = "#7c7b89")+theme_classic()+
geom_vline(mapping = aes(xintercept = perms$observed_diff,color ="observed difference"),
size = 1.5,)+theme_classic(base_size = 25)+
geom_text(mapping = aes(x= 108, y = 180,
label = paste0("p-value = ",round(padjust,2))),
size = 7)+xlab("Permutation results")+
scale_color_manual(values = c("observed difference"="black"),
name = "")+theme(legend.position = "top",
axis.title.x = element_text(margin = margin(t = 30, r = 20, b = 0, l = 0)))+ylab("Count")
```
## Match with crossovers called in the original paper
The cell's crossover ranges can be found by `getCellCORange` function:
```{r}
called_co_cell <- getCellCORange(co_count = hinch_co_counts,
cellBarcode = colnames(hinch_co_counts)[1])
```
We now compare the crossovers called by `ssocaller` and `comapr` with the crossovers
positions identified in the original paper [@Hinch2019-dt].
We first collect the crossover regions for each cell:
```{r}
called_co_df <- lapply(colnames(hinch_co_counts), function(srr){
called_co_cell <- getCellCORange(co_count = hinch_co_counts,
cellBarcode = srr)
called_co_df <- as.data.frame(called_co_cell)
called_co_df$SRR <- srr
called_co_df
})
called_co_df <- do.call("rbind",called_co_df)
```
```{r}
head(called_co_df)
called_co_df$chr <- called_co_df$seqnames
```
The published crossover positions from [@Hinch2019-dt] was downloaded from
GEO with GSE125326:
```{r}
pub_co <- read.table(file ="references/publishedCrossovers.txt")
pub_co$chr <- paste0("chr",pub_co$chr)
pub_co <- pub_co[pub_co$chr!="chrX",]
merged_df <- merge(called_co_df, pub_co, by.x =c("SRR","chr"),
suffixes = c(".called",".pub"))
```
The number of crossovers called for each cell are highly concordant. The differences
in number of crossovers called per cell are plotted in histogram down below.
and we can see that our approach is more conservative. However, one can adjust
the filtering thresholds mentioned at the start of this section to be less
conservative.
```{r codiff,fig.cap="Differences in number of crossovers called by the two methods",fig.width=6,fig.height=5}
sgcocaller_cos <- called_co_df %>% group_by(SRR) %>% summarise(COs_sgcocaller = n())
hinch_cos <- pub_co %>% group_by(SRR) %>% summarise(COs_hinch = n())
sgcocaller_cos %>% left_join(hinch_cos) %>%
ggplot()+geom_histogram(mapping =
aes(x= (COs_sgcocaller - COs_hinch)))+theme_classic()
sgcocaller_cos %>% left_join(hinch_cos) %>%
ggplot()+geom_jitter(mapping =
aes(x= (COs_sgcocaller - COs_hinch), y ="diff"),
width = 0.3)+
theme_classic()
sgcocaller_cos %>% left_join(hinch_cos) %>%
mutate(diff = as.character(abs(COs_sgcocaller-COs_hinch))) %>% ggplot()+
geom_jitter(mapping = aes(x = COs_sgcocaller,
y= COs_hinch,
color = diff))+
theme_classic(base_size = 25)+scale_color_manual(
values = c("0"="#58c8c9","1"="#48a3a4","2"="#367a7a","4"="#235050")
)+theme(legend.position = c(0.5, .96),
legend.direction = "horizontal")
```
```{r}
# ggplot(data = hinch_cos,
# mapping = aes(x = "Hinch Published",
# y = COs_hinch)) + geom_boxplot()+ geom_jitter()+
# theme_classic()
#
# ggplot(data = hinch_cos,
# mapping = aes( x = COs_hinch)) + geom_histogram(stat = "count")+
# theme_classic()
# ggplot(data = sgcocaller_cos,
# mapping = aes( x = COs_sgcocaller)) + geom_histogram(stat = "count")+
# theme_classic()
```
The cells with discrepancy in number of crossovers called:
```{r}
sgcocaller_cos %>% left_join(hinch_cos) %>% filter(COs_sgcocaller != COs_hinch)
```
We can find the cell which has the largest difference in the number of
crossovers called by the two methods:
```{r}
sgcocaller_cos %>% left_join(hinch_cos) %>%
mutate(diff = (COs_hinch - COs_sgcocaller)) %>%
filter(diff>3)
```
We can then list the cells and chrs that have different number of crossovers
called by the two methods for cell `SRR8454715`:
```{r}
sgcocaller_cos <- called_co_df %>% group_by(SRR,chr) %>% summarise(ChrCOs_sgcocaller = n())
hinch_cos <- pub_co %>% group_by(SRR,chr) %>% summarise(ChrCOs_hinch = n())
hinch_cos %>% full_join(sgcocaller_cos) %>% filter(SRR == "SRR8454715")
```
And then plot the alternative allele frequencies for the chromosomes that do
not have the same number of crossovers called:
```{r}
cell <- "SRR8454715"
SRR8454715_af <- getCellAFTrack(chrom = "chr7",
path_loc = dataset_dir,sampleName = "hinch",
barcodeFile = barcodeFile_path,
co_count = hinch_co_counts,
nwindow = 500,
chunk = 50000L,
cellBarcode = cell)
```
```{r}
pub_co_range <- GRanges(seqnames = pub_co[pub_co$SRR==cell,]$chr,
IRanges(start = pub_co[pub_co$SRR==cell,]$crossover_breakpoint_leftpos,
end = pub_co[pub_co$SRR==cell,]$crossover_breakpoint_rightpos))
```
Two extra crossovers on chromosome 17 were called from the original paper:
```{r,fig.width=12,fig.height=4}
pub_co_chr7 <- AnnotationTrack(pub_co_range[seqnames(pub_co_range)=="chr7"],
name = "chr7 published CO ranges")
pub_co_chr7 <- setPar(pub_co_chr7,"background.title","lightblue")
SRR8454715_af_ht <- HighlightTrack(trackList = list(gtrack, SRR8454715_af$af_track,
pub_co_chr7),
range = SRR8454715_af$co_range[seqnames(SRR8454715_af$co_range)=="chr7"])
plotTracks(SRR8454715_af_ht)
```
```{r}
SRR8454715_af_chr18 <- getCellAFTrack(chrom = "chr18",
path_loc = dataset_dir,sampleName = "hinch",
barcodeFile = barcodeFile_path,
co_count = hinch_co_counts,
nwindow = 300,
chunk = 50000L,
cellBarcode = cell)
```
Two extra crossovers on chromosome 18 were called from the original paper:
```{r,fig.width=12,fig.height=4}
pub_co_chr18 <- AnnotationTrack(pub_co_range[seqnames(pub_co_range)=="chr18"],
name = "chr18 published CO ranges")
pub_co_chr18 <- setPar(pub_co_chr18,"background.title","lightblue")
SRR8454715_af_ht <- HighlightTrack(trackList =
list(gtrack, SRR8454715_af_chr18$af_track,
pub_co_chr18),
range = SRR8454715_af_chr18$co_range[seqnames(SRR8454715_af_chr18$co_range)=="chr18"])
plotTracks(SRR8454715_af_ht)
```
It is justifiable that `comapr` classify these 4 crossovers as false positives.
## Summary
We have demonstrated the application of `sgcocaller` to find the haplotype states
for the list of cell barcodes against the list of informative SNP markers using
a binomial Hidden Markov Model. We have also showed the functionality of `comapr`
for downstream analyses including cell quality control, finding crossover intervals,
visualising crossover regions, calculating genetic distances and resampling testings
for sample group comparisons. In addition, the tunable filtering parameters for
calling crossovers enable `comapr` to be applied for datasets with different
coverages.
## Session info
```{r}
sessionInfo()
```
## References
--- ---
title: "index" title: "Crossover-identification-with-sscocaller-and-comapr"
author: "Ruqian Lyu" author: "Ruqian Lyu"
date: "11/28/2021" date: "11/28/2021"
output: html_document output: html_document
......
@ARTICLE{Phipson2010-xi,
title = "Permutation P-values should never be zero: calculating exact
P-values when permutations are randomly drawn",
author = "Phipson, Belinda and Smyth, Gordon K",
abstract = "Permutation tests are amongst the most commonly used statistical
tools in modern genomic research, a process by which p-values are
attached to a test statistic by randomly permuting the sample or
gene labels. Yet permutation p-values published in the genomic
literature are often computed incorrectly, understated by about
1/m, where m is the number of permutations. The same is often
true in the more general situation when Monte Carlo simulation is
used to assign p-values. Although the p-value understatement is
usually small in absolute terms, the implications can be serious
in a multiple testing context. The understatement arises from the
intuitive but mistaken idea of using permutation to estimate the
tail probability of the test statistic. We argue instead that
permutation should be viewed as generating an exact discrete null
distribution. The relevant literature, some of which is likely to
have been relatively inaccessible to the genomic community, is
reviewed and summarized. A computation strategy is developed for
exact p-values when permutations are randomly drawn. The strategy
is valid for any number of permutations and samples. Some simple
recommendations are made for the implementation of permutation
tests in practice.",
journal = "Stat. Appl. Genet. Mol. Biol.",
volume = 9,
pages = "Article39",
month = oct,
year = 2010,
language = "en"
}
@ARTICLE{Keane2011-be,
title = "Mouse genomic variation and its effect on phenotypes and gene
regulation",
author = "Keane, Thomas M and Goodstadt, Leo and Danecek, Petr and White,
Michael A and Wong, Kim and Yalcin, Binnaz and Heger, Andreas and
Agam, Avigail and Slater, Guy and Goodson, Martin and Furlotte,
Nicholas A and Eskin, Eleazar and Nell{\aa}ker, Christoffer and
Whitley, Helen and Cleak, James and Janowitz, Deborah and
Hernandez-Pliego, Polinka and Edwards, Andrew and Belgard, T
Grant and Oliver, Peter L and McIntyre, Rebecca E and Bhomra,
Amarjit and Nicod, J{\'e}r{\^o}me and Gan, Xiangchao and Yuan,
Wei and van der Weyden, Louise and Steward, Charles A and Bala,
Sendu and Stalker, Jim and Mott, Richard and Durbin, Richard and
Jackson, Ian J and Czechanski, Anne and Guerra-Assun{\c c}{\~a}o,
Jos{\'e} Afonso and Donahue, Leah Rae and Reinholdt, Laura G and
Payseur, Bret A and Ponting, Chris P and Birney, Ewan and Flint,
Jonathan and Adams, David J",
abstract = "We report genome sequences of 17 inbred strains of laboratory
mice and identify almost ten times more variants than previously
known. We use these genomes to explore the phylogenetic history
of the laboratory mouse and to examine the functional
consequences of allele-specific variation on transcript
abundance, revealing that at least 12\% of transcripts show a
significant tissue-specific expression bias. By identifying
candidate functional variants at 718 quantitative trait loci we
show that the molecular nature of functional variants and their
position relative to genes vary according to the effect size of
the locus. These sequences provide a starting point for a new era
in the functional analysis of a key model organism.",
journal = "Nature",
volume = 477,
number = 7364,
pages = "289--294",
month = sep,
year = 2011,
language = "en"
}
@ARTICLE{Hinch2019-dt,
title = "Factors influencing meiotic recombination revealed by
whole-genome sequencing of single sperm",
author = "Hinch, Anjali G and Zhang, Gang and Becker, Philipp W and
Moralli, Daniela and Hinch, Robert and Davies, Benjamin and
Bowden, Rory and Donnelly, Peter",
abstract = "Recombination is critical to meiosis and evolution, yet many
aspects of the physical exchange of DNA via crossovers remain
poorly understood. We report an approach for single-cell
whole-genome DNA sequencing by which we sequenced 217 individual
hybrid mouse sperm, providing a kilobase-resolution genome-wide
map of crossovers. Combining this map with molecular assays
measuring stages of recombination, we identified factors that
affect crossover probability, including PRDM9 binding on the
non-initiating template homolog and telomere proximity. These
factors also influence the time for sites of
recombination-initiating DNA double-strand breaks to find and
engage their homologs, with rapidly engaging sites more likely to
form crossovers. We show that chromatin environment on the
template homolog affects positioning of crossover breakpoints.
Our results also offer insights into recombination in the
pseudoautosomal region.",
journal = "Science",
volume = 363,
number = 6433,
month = mar,
year = 2019,
language = "en"
}
#!/bin/bash
#SBATCH --job-name=hinchDNA
#SBATCH --ntasks=1
#SBATCH --time=324:10:00
#SBATCH --mem=1000
#SBATCH --partition=primary
/mnt/mcfiles/rlyu/Software/miniconda3.4.7/envs/rlyubase/bin/snakemake -s run_alignment.snk --rerun-incomplete --use-singularity --singularity-args "-B /mnt/mcscratch/:/mnt/mcscratch/,/mnt/beegfs/:/mnt/beegfs/" -j 8 -c "sbatch --partition=gpu --mem {resources.mem} --time=120:00:00 --cpus-per-task={resources.cpus} --error=Snakefile_hinch.log.out"
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment