update readme

5ccd8cb8 · Ruqian Lyu · ddf2dbfc · 5ccd8cb8 · 5ccd8cb8 · 5ccd8cb8
Commit 5ccd8cb8 authored 3 years ago by Ruqian Lyu
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ fastq reads and mapping reads to the mouse reference genome mm10
 ### Step3 subsample reads
-`sscocaller` is designed to process DNA reads with CB (cell barcode) tags from all single sperm cells stored in one BAM file. And to reduce some processing burdens, the
+`sgcocaller` is designed to process DNA reads with CB (cell barcode) tags from all single sperm cells stored in one BAM file. And to reduce some processing burdens, the
 mapped reads for each sperm were de-duplicated and subsamples to a fraction of 0.5.
 In addition, before merging reads from each sperm, the CB (cell barcode, the SRR ID) tag was appended to each DNA read using [appendCB](https://github.com/ruqianl/appendCB).
@@ -37,12 +37,12 @@ The SNPs were further filtered to only keep the positions which have been called
 as Homo_alternative `CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz` downloaded from the 
 dbsnp database from Mouse Genome Project.
-## Running sscocaller
+## Running sgcocaller
-`run_sscocaller.snk` defines the snakemake rule for running sscocaller for each chromosome.
+`run_sgcocaller.snk` defines the snakemake rule for running sgcocaller for each chromosome.
 ## Downstream crossover analysis with comapr
-https://biocellgen-public.svi.edu.au/hinch-single-sperm-DNA-seq-processing/public/Crossover-identification-with-sscocaller-and-comapr.html
+https://biocellgen-public.svi.edu.au/hinch-single-sperm-DNA-seq-processing/Crossover-identification-with-sgcocaller-and-comapr.html
--- a/analysis/Crossover-identification-with-sscocaller-and-comapr.Rmd
+++ b/analysis/Crossover-identification-with-sscocaller-and-comapr.Rmd
+---
+title: "Crossover-identification-with-sgcocaller-and-comapr"
+output: bookdown::html_document2
+author: Ruqian Lyu
+Date: "05/14/2021"
+bibliography: rejy.bib
+---
+## Introduction
+We will demonstrate the usage of [`sgcocaller`](https://gitlab.svi.edu.au/biocellgen-public/sgcocaller)
+and [`comapr`](https://github.com/ruqianl/comapr) for identifying and visualising 
+crossovers regions from single-sperm DNA sequencing dataset. 
+`sgcocaller`(https://gitlab.svi.edu.au/biocellgen-public/sgcocaller) applies a binomial
+Hidden Markov Model for inferring haplotypes of single sperm genomes from the aligned 
+DNA reads in a BAM file. The inferred haplotype sequence can then be used for 
+calling crossovers by identifying haplotype shifts (see [`comapr`](https://github.com/ruqianl/comapr) ).
+## Downloading example dataset
+An individual mouse genetic map was constructed by DNA sequencing of 217 sperm 
+cells from a F1 hybrid mouse (B6 X CAST) [@Hinch2019-dt]. We will apply `sgcocaller`
+on this dataset and it can be downloaded from GEO (Gene Expression Omnibus) with
+accession [GSE125326](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125326)
+The slurm submission script `submit-wgetSRAFastqdump.sh` at [repo](https://gitlab.svi.edu.au/biocellgen-public/hinch-single-sperm-DNA-seq-processing.git)
+can be used for downloading the `.sra` files and dumping them into paired fastq 
+files for each sperm (including two bulk sperm samples).
+## Dataset preprocessing
+The preprocessing steps include read filtering and mapping, subsample reads and
+append cell barcodes to reads, merge bams, and find informative SNP markers.
+### 1 Alignment
+The downloaded fastq files for each sperm cells (and the bulk sperm samples) were
+aligned to mouse reference genome mm10. The workflow [`run_alignment.snk`](https://gitlab.svi.edu.au/biocellgen-public/hinch-single-sperm-DNA-seq-processing.git)
+which is a [Snakemake](https://snakemake.readthedocs.io/en/stable/) file  that defined steps/rules including 
+- running [`fastp`](https://github.com/OpenGene/fastp) for filtering reads and adapter trimming
+- running [`minimap2`](https://github.com/lh3/minimap2) for mapping reads to reference genome mm10
+- running GATK MarkDuplicates 
+- running GATK AddOrReplaceReadGroup
+- running sorting and indexing bam files using `samtools`
+### 2 Subsample reads and append CB tag 
+`sgcocaller` is designed to process DNA reads with CB (cell barcode) tags from 
+all single sperm cells stored in one BAM file. And to reduce some processing burdens, the
+mapped reads for each sperm were de-duplicated and subsamples to a fraction of 0.5.
+In addition, before merging reads from each sperm, the CB (cell barcode, the SRR ID)
+tag was appended to each DNA read using [appendCB](https://github.com/ruqianl/appendCB).
+Refer to steps defined in `run_subsample.snk`.
+### 3 Merge single-sperm bam files into one Bam
+`samtools` was used for merge CB-taged reads from all single sperm to one BAM file. See
+`submit-mergeBams.sh`. 
+### 4 Find informative SNP markers
+The informative SNP markers are those SNPs which differ between the two mouse stains
+that were used to generate the F1 hybrid mouse (CAST and BL6). The following steps were
+applied which largely align with what has been described in the original paper [@Hinch2019-dt].
+The bulk sperm sample `SRR8454653` was used for calling de no vo variants for this
+mouse individual using GATK HaplotypeCaller. Only the HET SNPs with `MQ>50` AND 
+`DP>10` AND `DP<80` were kept.
+The SNPs were further filtered to only keep the positions which have been called 
+as Homo_alternative `CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz` downloaded from the 
+dbsnp database from Mouse Genome Project[@Keane2011-be].
+## Running sgcocaller 
+With the DNA reads from each sperm were tagged and merged into one BAM file, we
+can run `sgcocaller` for inferring the haplotype states against the list of 
+informative SNP markers for each chromosome in each sperm. 
+The required input files are:
+```
+mergedBam = "output/alignment/mergedBam/mergedAll.bam",
+vcfRef="output/variants/denovoVar/SRR8454653.mkdup.sort.rg.filter.snps.castVar.vcf.gz",
+bcFile="output/alignment/mergedBam/mergedAll.bam.barcodes.txt"
+```
+`run_sgcocaller.snk` defines the rule for running `sgcocaller` on each chromosome
+for sperm cells. The command line was:
+```
+sgcocaller --threads 4 --chrom "chr1" --chrName chr {input.mergedBam} \
+           {input.vcfRef} {input.bcFile} --maxTotalReads 150 --maxDP 10 \
+           sgcocaller/hinch/hinch_
+```
+## Output files
+The generated output files (for each chromosome, here showing chr1):
+- hinch_chr1_altCount.mtx, sparse matrix file, containing the alternative allele counts (the CAST alleles)
+- hinch_chr1_totalCount.mtx, sparse matrix file, containing the total allele counts (the CAST + BL6 alleles)
+- hinch_chr1_vi.mtx, sparse matrix file, containing the inferred Viterbi state 
+(haplotype state) for each chromosome against the list of SNP markers in "_snpAnnot.txt".
+- hinch_chr1_viSegInfo.txt, txt file, containing the inferred Viterbi state segments information. Details below
+- hinch_chr1_snpAnnot.txt, txt file, containing the row annotations (SNPs) for the above sparse matrices.
+*Note*, the columns in these sparse matrices correspond to cells in the input `bcFile`.
+**_viSegInfo.txt** contains summary statistics of inferred Viterbi state segments. 
+A Viterbi segment is defined by a list of consecutive SNPs having the same Viterbi state.
+The columns in the `*_viSegInfo.txt` are: 
+- ithSperm,
+- Starting SNP position, 
+- Ending SNP position, 
+- the number of SNPs supporting the segment
+- the log likelihood ratio of the Viterbi segment
+- the inferred hidden state
+**log likelihood ratio**
+The loglikelihood ratio is calculated by taking the inferred log likelihood and 
+subtract the reversed log likelihood.
+For example, the segment with two SNPs in the figure below: 
+![](../public/meta_images/ratio_ll.png)
+The numbers in brackets indicating the (alternative allele counts, total allele counts)
+aligned to the two SNP positions.
+The inferred log likelihood can be expressed as:
+$$
+inferredLogll = log(Trans_L)+log(dbinom(3,4,0.9))+log(dbinom(4,4,0.9))+log(Trans_R)
+$$
+The reversed log likelihood is then:
+$$
+reversedLogll = log(noTrans_L)+log(dbinom(3,4,0.1))+log(dbinom(4,4,0.1))+log(noTrans_R)
+$$
+Hence the logllRatio:
+$$
+logllRatio = inferredLogll - reversedLogll
+$$
+A larger `logllRatio` indicating we are more confident with the inferred Viterbi states for markers
+in the segment. 
+## Diagnosic plots
+The output files from `sgcocaller` can be directly parsed through `readHapState` 
+function. However, we have a look at some cell-level metrics and segment-level
+metrics before we parse the `sgcocaller` output files.
+### Per cell QC
+The function `perCellQC` generates cell-level metrics in a data.frame and the 
+plots in a list.
+We first identify the relevant file paths:
+`dataset_dir` is the ouput directory from running `sgcocaller` and 
+`barcodeFile_path` points to the file containing the list of cell barcodes.
+```{r}
+suppressPackageStartupMessages({
+  library(comapr)
+  library(ggplot2)
+  library(dplyr)
+  library(Gviz)
+  library(BiocParallel)
+  library(SummarizedExperiment)
+})
+```
+```{r}
+path_dir <- "/mnt/beegfs/mccarthy/scratch/general/Datasets/Hinch2019/"
+dataset_dir <- paste0(path_dir,"output/sgcocaller/hinch/")
+barcodeFile_path <-paste0(path_dir,"output/alignment/mergedBam/mergedAll.bam.barcodes.txt")
+```
+We can locate the files and list the files to have a look:
+```{r}
+list.files(path=dataset_dir)[1:5]
+```
+```{r}
+BiocParallel::register(BiocParallel::MulticoreParam(workers = 4))
+#BiocParallel::register(BiocParallel::SerialParam())
+```
+Running `perCellChrQC` function to find the cell-level statistics:
+```{r}
+pcqc <- perCellChrQC("hinch",
+                     chroms=paste0("chr",1:19),
+                     path=dataset_dir,
+                     barcodeFile=barcodeFile_path)
+```
+The generated scatter plots for selected chromosomes:
+```{r}
+pcqc$plot
+```
+X-axis plots the number of haplotype transitions (`nCORaw`) for each cell and 
+Y-axis plots the number of total SNPs detected in a cell. A large `nCORaw` might 
+indicate the cell being a diploid cell included by accident or doublets. Cells
+with a lower `totalSNPs` might indicate poor cell quality.
+A data.frame with cell-level metric is also returned:
+```{r}
+pcqc$cellQC
+```
+### perSegChrQC
+`PerSegQC` function visualises statistics of inferred haplotype state segments, 
+which helps decide filtering thresholds for removing crossovers that do not have
+enough evidence by the data and the very close double crossovers which are biologically
+unlikely.
+```{r}
+psqc <- perSegChrQC("hinch",chroms=paste0("chr",1),
+                    path=dataset_dir,
+                    barcodeFile=barcodeFile_path,
+                    maxRawCO = 30)
+psqc+theme_classic()
+```
+## Parsing files using `comapr`
+Now we have some idea about the features of this dataset, we can read in the 
+files from `sgcocaller` which can be directly parsed through `readHapState` 
+function. This function returns a `RangedSummarizedExperiment` object with 
+`rowRanges` containing SNP positions that have ever contributed to crossovers
+in a cell, while `colData` contains the cell annotations such as barcodes.
+The following filters have been applied:
+- Segment level filters:
+  - minSNP=30, the segment that results in one/two crossovers has to have more than
+  30 SNPs of support
+  - minlogllRatio=150, the segment that results in one/two crossovers has to have
+  logllRatio larger than 150.
+  - bpDist=1e5, the segment that results in one/two crossovers has to have base pair 
+  distances larger than 1e5
+- Cell level filters:
+  - maxRawCO, the maximum number of raw crossovers (the number of state transitions
+  from the _vi.mtx file) for a cell
+  - minCellSNP=200, there have to be more than 200 SNPs detected within a cell, 
+  otherwise this cell is removed.
+```{r,eval=FALSE}
+hinch_rse <- readHapState(sampleName = "hinch",
+                          path = dataset_dir,
+                          chrom=paste0("chr",1:19),
+                          barcodeFile = barcodeFile_path,
+                          minSNP = 30, minCellSNP = 200,
+                          maxRawCO = 55,
+                          minlogllRatio = 150,
+                          bpDist = 1e5)
+#saveRDS(hinch_rse,file = "output/outputR/analysisRDS/hinch_rse.rds")
+```
+```{r,include=FALSE}
+#hinch_rse <- readRDS(file="output/outputR/analysisRDS/hinch_rse.rds")
+```
+The `hinch_rse` object:
+```{r}
+hinch_rse
+```
+The `rowRanges` of `hinch_rse`
+```{r}
+SummarizedExperiment::rowRanges(hinch_rse)
+```
+The `assay` slot contains the Viterbi state matrix (SNP by Cell):
+```{r}
+SummarizedExperiment::assay(hinch_rse)[1:5,1:5]
+```
+**Note** this matrix is more `sparse` which only contains the SNPs that 
+contribute to crossovers in cells.
+## Samples group factor
+We have sperm cells from only one individual in this dataset. However, to 
+demonstrate the functions in `comapr` we split the cells into two groups.
+```{r}
+## set the first 80 cells as group1 and rest as group2
+colData(hinch_rse)$sampleGroup <- c(rep("Group1",ceiling(ncol(hinch_rse)/2)),rep("Group2",ncol(hinch_rse)-ceiling(ncol(hinch_rse)/2)))
+colData(hinch_rse)
+```
+*Note* `combineHapState` can be applied of there are multiple sets of outputs from 
+`sgcocaller`.
+## Count crossovers in cells
+The function `countCOs` can then be executed to find the crossover intervals and
+the number of crossovers for each cell within each crossover interval.
+```{r}
+hinch_co_counts <- countCOs(hinch_rse)
+```
+The SNP intervals list in the rowRanges slot:
+```{r}
+rowRanges(hinch_co_counts)
+```
+The `assay` slot of `hinch_co_counts` contains the number of crossovers per cell
+per SNP interval:
+```{r}
+assay(hinch_co_counts)[1:5,1:5]
+```
+```{r}
+mean(colSums(as.matrix(assay(hinch_co_counts))))
+```
+## Plot crossover counts
+To get the number of crossovers per sperm cell, we just need to sum each column
+of the matrix in the `assay` slot. And the `plotCount` function plots the number
+of crossovers for each sperm.
+```{r,fig.width=6,fig.height=4,fig.width=4,dpi=300}
+plotCount(hinch_co_counts)+theme_classic()+
+  scale_color_manual(values = c("all"="red"))+theme_classic(base_size = 25)+
+  theme(axis.text = element_text(size = 25),
+        axis.title = element_text(size =25),
+        legend.position = "none",
+          axis.text.x = element_blank(),
+  axis.ticks.x = element_blank())+ylab("Crossover counts")+xlab("Mouse sperm cells")
+```
+Or we can plotCount for each sample group:
+```{r}
+plotCount(hinch_co_counts, group_by = "sampleGroup")
+```
+In addition, we can also plot the number of crossovers per chromosome (with
+mean number of crossovers and standard error bar):
+```{r,fig.width=14,fig.height=6}
+plotCount(hinch_co_counts, ,by_chr = TRUE)+
+  theme(axis.text.x = element_text(angle=90))
+```
+We can also generate bar plot counts of number of crossovers for each
+chromosome. We can see that  for fewer double crossovers were called for
+smaller chromosomes.
+```{r,include=FALSE,fig.width=16,fig.height =8,dpi=300}
+tmp <- assay(hinch_co_counts)
+tmp$chr <- seqnames(hinch_co_counts)
+tmp <- data.frame(tmp,check.names = FALSE) %>%
+  tidyr::pivot_longer(cols = colnames(hinch_co_counts),
+                      values_to = "COs", names_to = "BC")
+tmp$sampleGroup <- "all"
+p <- tmp %>% group_by(chr,BC) %>% summarise(ChrCOs = sum(COs),
+                    sampleGroup = unique(sampleGroup)) %>% mutate(ChrCOs=as.integer(ChrCOs)) %>%
+  ggplot()+geom_bar(mapping = aes(x=ChrCOs))+facet_wrap(.~chr)
+p+theme_classic(base_size = 25)+facet_wrap(.~chr,ncol=7)+xlab("Crossover counts per chromosome")+ylab("Counts")
+```
+```{r,fig.width=12}
+plotCount(hinch_co_counts,by_chr = TRUE,
+          plot_type ="hist")+theme_classic(base_size = 22)+facet_wrap(.~chr,ncol=8)
+```
+## Plot SNP density track
+The informative SNP markers' distributions along the chromosome affects the 
+crossover resolutions, therefore it is helpful to visualize the SNP density 
+distribution.
+We can generate the SNP density DataTrack with function `getSNPDensityTrack` 
+which returns a `DataTrack` object from `Gviz` package.
+```{r}
+## log=TRUE, the result after aggregation is returned on a log10 scale
+snp_track_chr10 <- getSNPDensityTrack(chrom = "chr10",
+                                      path_loc = dataset_dir,
+                                      sampleName = "hinch",
+                                      nwindow = 80,
+                                      log = TRUE,
+                                      plot_type = "hist")
+```
+```{r,fig.width=8,fig.height=3}
+plotTracks(snp_track_chr10)
+snp_track_chr10 <- setPar(snp_track_chr10,"cex.axis",1.5)
+```
+To change visualisation parameters we can use setPar:
+```{r,fig.height=3,fig.width=8}
+snp_track_chr10 <- setPar(snp_track_chr10,"background.title","firebrick")
+plotTracks(snp_track_chr10)
+```
+Change aggregation function to "sum"
+```{r,fig.width=8,fig.height=3}
+snp_track_chr10 <- setPar(snp_track_chr10,"aggregation","sum")
+plotTracks(snp_track_chr10)
+```
+## Plot Mean DP (read depth) across cells for each chromosome 
+```{r}
+meanDP_track_chr10 <- getMeanDPTrack(chrom = "chr10",
+                            path_loc = dataset_dir,
+                            nwindow = 80,
+                            sampleName ="hinch",
+                            barcodeFile=barcodeFile_path,
+                            plot_type = "hist",
+                            selectedBarcodes = colnames(hinch_co_counts),
+                            snp_track = snp_track_chr10,
+                            log =TRUE)
+```
+```{r,fig.height=3,fig.width=8}
+plotTracks(meanDP_track_chr10)
+```
+## Visulise the raw Alternative Frequency (AF) plot with crossover region highlighted
+for the selected cell
+We can select a cell and visulise the raw Alternative Frequency (AF) plot with the
+called crossover region highlighted.
+```{r}
+cell_af <- getCellAFTrack(chrom = "chr10",
+               path_loc = dataset_dir,
+               sampleName = "hinch",
+               barcodeFile = barcodeFile_path,
+               nwindow = 80,
+               snp_track = snp_track_chr10,
+               cellBarcode = colnames(hinch_co_counts)[1],
+               co_count = hinch_co_counts,
+               chunk = 8000L)
+```
+Generate a Highlight track with the returned list object `cell_af`
+```{r,fig.width=8,fig.height=3}
+changed_bgcolor <- cell_af$af_track
+changed_bgcolor <- setPar(changed_bgcolor, "background.title","#4C5270")
+ht <- HighlightTrack(changed_bgcolor,
+                     range = cell_af$co_range[seqnames(cell_af$co_range)=="chr10"],
+                     chromosome = "chr10")
+plotTracks(ht,cex = 1.5)
+```
+Easily combined with `GenomeAxisTrack` and `IdeogramTrack`
+```{r,fig.width=10,fig.height=4}
+gtrack <- GenomeAxisTrack()
+chr10_ideo <- IdeogramTrack(genome = "mm10", chromosome = "chr10")
+plotTracks(list(chr10_ideo,gtrack, ht),cex.title = 1.2,
+           cex.axis = 1.5,cex = 1.5)
+```
+## Plot SNP density along with crossover counts
+While one can get the DataTracks for the AF and the called crossover regions of
+a set of cells with `getAFTracks`, comapr also offers the function for plotting
+crossover counts for each cell or averaged crossover counts across sample groups.
+```{r,fig.height=8,fig.width=8}
+crossover_count_track <- DataTrack(range = rowRanges(hinch_co_counts),
+                       genome = "mm10",
+                       data = data.frame(assay(hinch_co_counts)),
+                       name = "expected crossover counts across SNP intervals",
+                       type = "heatmap",
+                       groups = hinch_co_counts$sampleGroup,
+                       col = c("red","blue"),
+                       #aggregateGroups = TRUE,
+                       aggregation = mean,
+                       window =80)
+plotTracks(list(gtrack,snp_track_chr10,crossover_count_track),
+           chromosome = "chr10")
+```
+**Chromosome 10**
+```{r,fig.height=5,fig.width=10}
+gtrack <- GenomeAxisTrack(cex=1.5)
+snp_track_chr10 <- setPar(snp_track_chr10,"cex.axis",1.5)
+snp_track_chr10 <- setPar(snp_track_chr10,"cex.title",1.5)
+#plotTracks(snp_track_chr10)
+crossover_count_track <- DataTrack(range = rowRanges(hinch_co_counts),
+                       genome = "mm10",
+                       data = data.frame(assay(hinch_co_counts)),
+                       name = "Mean\ncrossovers",
+                       type =  c("heatmap"),
+                       groups = hinch_co_counts$sampleGroup,
+                       col = c("red","blue"),
+                       aggregateGroups = TRUE,
+                       aggregation = mean,
+                       window =80,cex.title=1.5,
+                       cex.axis =1.5,
+                       background.title = "#F652A0")
+plotTracks(list(gtrack,snp_track_chr10,crossover_count_track),
+           chromosome = "chr10",sizes = c(1,2,2),window =50)
+```
+**Chromosome 1**
+```{r}
+snp_track_chr1 <- getSNPDensityTrack(chrom = "chr1",
+                                      path_loc = dataset_dir,
+                                      sampleName = "hinch",
+                                      nwindow = 80,
+                                      log = FALSE,
+                                      plot_type = "hist")
+snp_track_chr1 <- setPar(snp_track_chr1,"background.title","firebrick")
+plotTracks(list(gtrack,snp_track_chr1,crossover_count_track),
+           chromosome = "chr1")
+```
+## Calculate Genetic distances from crossover rates
+The raw crossover rates estimated from observed crossovers across SNP intervals
+for a group of samples are commonly converted into genetic distances in units
+of centiMorgans via mapping functions such as the Kosambi or the Haldane function.
+- Haldane, cM =−0.5×ln(1−2r)×100,
+- Kosambi, cM=0.25×ln ((1+2r)/(1−2r))×100,
+where r is the recombination fraction.
+The Haldane mapping function adds mathematical adjustments to the recombination 
+fraction. It assumes that crossover events are random and independent along the 
+chromosome, and the number of crossover events between two loci follows a 
+Poisson distribution. Haldane’s mapping function adjusts underestimated crossover
+rate in larger intervals that are likely to have unobserved even number of crossovers.
+Kosambi’s mapping function was derived based on Haldane’s and takes consideration 
+of crossover interference.
+We can calculate the genetic distances with the sperm dataset using 
+`calGeneticDist` function:
+```{r}
+# mapping_fun = "k" for applying the kosambi function
+hinch_dist <- calGeneticDist(hinch_co_counts,
+                             mapping_fun = "k")
+```
+The total genetic distances across the autosomes are then:
+```{r}
+sum(rowData(hinch_dist)$kosambi)
+```
+The genetic distances can also be calculated per sample group. It is useful for
+doing comparative analysis. We can also supply a `bin_size` parameter to get
+the genetic distances calcuated on binned chromosome intervals.
+```{r}
+hinch_dist_groups <- calGeneticDist(hinch_co_counts, 
+                                    group_by = "sampleGroup",
+                                    bin_size = 1e7)
+```
+The genetic distances per group can be derived as:
+```{r}
+Matrix::colSums(as.matrix(mcols(hinch_dist_groups)))
+```
+### Plot genetic distances 
+The genetic distances across chromosome bins can be visualized by `plotGeneticDist`
+function:
+```{r,fig.width=6.5,fig.height=5.5,dpi=300}
+plotGeneticDist(hinch_dist_groups,chr = "chr10")
+plotGeneticDist(hinch_dist_groups,chr = "chr1")+theme_classic(base_size = 25)+
+  scale_color_manual(values = c("Group1"="#122620",
+                                "Group2" = "#D6AD60" ))+
+  theme(legend.position = "top",
+        plot.margin=margin(t =  0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
+  ylab("CentiMorgans per 10 Mb")
+```
+```{r,fig.width=7,fig.height=4.5,dpi=300}
+hinch_dist_groups_non_bin <- calGeneticDist(hinch_co_counts,
+                                            group_by = "sampleGroup")
+hinch_dist_groups_non_bin_gr <- rowRanges(hinch_dist_groups_non_bin)
+mcols(hinch_dist_groups_non_bin_gr) <- mcols(hinch_dist_groups_non_bin_gr)$kosambi
+plotGeneticDist(hinch_dist_groups_non_bin_gr,chr = "chr1",cumulative = TRUE)+
+  theme_classic(base_size = 25)+
+  theme(legend.position = "none",
+                plot.margin=margin(t =  0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
+  scale_color_manual(values = c("Group1"="#122620",
+                                "Group2" = "#D6AD60" ))+
+  ylab("Cumulative\ncentiMorgans")
+```
+```{r,fig.width=6,fig.height=4,dpi=300}
+non_group_gr <- calGeneticDist(hinch_co_counts,bin_size = 1e7)
+#mcols(non_group_gr) <- mcols(non_group_gr)$kosambi
+colnames(mcols(non_group_gr)) <- "allCells"
+plotGeneticDist(non_group_gr,chr = "chr1")+theme_classic(base_size = 25)+
+  theme(legend.position = "none",
+        plot.margin=margin(t =  0.5, r = 1.5, b = 0, l = 0.5, unit = "cm"))+
+  scale_color_manual(values =c("allCells" = "darkblue"))+ylab("CentiMorgans per 10 Mb")
+```
+```{r}
+plotGeneticDist(hinch_dist_groups,chr = c("chr1","chr2"))
+plotGeneticDist(hinch_dist_groups,chr = c("chr15","chr16"))
+```
+We can also do cumulative centiMorgans plots and the whole genome plot:
+```{r,fig.width=8,fig.height=6}
+plotGeneticDist(hinch_dist_groups,chr = c("chr15","chr16"),cumulative = TRUE)
+```
+```{r, fig.width=12,fig.height=5}
+plotWholeGenome(hinch_dist_groups)+theme_classic(base_size = 25)+
+  theme(axis.text.x = element_text(angle = 90),
+        legend.position = "none")+
+  scale_color_manual(values = c("Group1"="#122620",
+                                "Group2" = "#D6AD60" ))+xlab("Cumulative whole genome")+
+  ylab("Cumulative\ncentiMorgans")
+```
+The two sample groups are similar by looking at the cumulative centiMorgan 
+growth curves of the two.
+## Group comparison
+The calculated total genetic distances for the two groups show that Group1 has
+slightly larger total genetic distances resulted from more meiotic crossovers 
+observed.
+To test whether the observed difference is statistically significant, we can 
+apply Bootstrapping test to get confidence intervals of group differences and 
+permutation testing for calculating a significance level.
+**Bootstrapping**
+```{r}
+set.seed(100)
+bootsResult <- bootstrapDist(hinch_co_counts,
+                             group_by = "sampleGroup",
+                             B = 2000)
+```
+The 95% confidence intervals for the group differences by bootstrapping is then:
+```{r}
+quantile(bootsResult,c(0.025,0.975))
+```
+which includes zero thus the observed difference is not significant at level of 
+0.05.
+The histogram of the bootstrapping results:
+```{r,bootstrapExampleFig,fig.width=6.5,fig.height=6,dpi = 300}
+btrp_quantile <- quantile(bootsResult,c(0.025,0.975))
+ggplot()+geom_histogram(mapping = aes(x = bootsResult),
+                        fill = "#7c7b89")+theme_classic(base_size = 18)+
+  geom_vline(mapping = aes(xintercept  = btrp_quantile[1],color = "2.5%"),size =1.5)+
+  geom_vline(mapping = aes(xintercept  = btrp_quantile[2],
+             color = "97.5%"),size =1.5)+
+  scale_x_continuous(breaks = c( -100,-50, 
+                                 round(btrp_quantile[1],2),
+                                 0  , 50 , 75 ,
+                                round(btrp_quantile[2],2),
+                                150,  200) , 
+                     labels = c( -100,-50, 
+                                 round(btrp_quantile[1],2),
+                                 0  , 50 , 75 ,
+                                round(btrp_quantile[2],2),
+                                150,  200))+
+  xlab("Bootstrapping results")+
+  theme_classic(base_size = 25)+
+  scale_color_manual(values = c("97.5%"="#0b7fab","2.5%"="#e9723d"),
+                     name = "Quantile")+ylab("Count")+
+  theme(legend.position = "top",
+        axis.text.x = element_text(angle = 90))
+## seq(from = -100, to = 200, by = 50)
+```
+```{r}
+dat <- with(density(bootsResult), data.frame(x, y))
+ggplot(dat,mapping = aes(x = x,y=y))+geom_line(fill = "#7c7b89")+
+  geom_area(mapping = aes(x = ifelse(x > btrp_quantile[1] &
+                                       x < btrp_quantile[2],
+                                     x, NA)),
+                            fill = "hotpink",alpha=0.3)+theme_classic()+
+  xlab("Bootstrapping results")
+```
+**Permutation**
+We next apply permutation testing using the `permuteDist` function.
+```{r}
+set.seed(2021)
+BiocParallel::register(BiocParallel::SerialParam())
+perms <- permuteDist(hinch_co_counts,group_by = "sampleGroup",
+                     B=2000)
+perms$observed_diff
+perms$nSample
+```
+```{r}
+sum(is.na(perms$permutes))
+```
+We can then use the `statmod::permp()` function [@Phipson2010-xi] to calculate
+an exact p-value for this set of permutation results:
+```{r}
+padjust <- statmod::permp(x = sum(perms$permutes> perms$observed_diff),
+               nperm = 2000,
+               n1 = perms$nSample[1],
+               n2 = perms$nSample[2],
+               twosided = F)
+padjust
+```
+We can see that the calculated p-value was not significant. 
+```{r,permutateResultExample,fig.width=7,fig.height=6,dpi = 300}
+ggplot()+geom_histogram(mapping = aes(x = perms$permutes),
+                        fill = "#7c7b89")+theme_classic()+
+  geom_vline(mapping = aes(xintercept = perms$observed_diff,color ="observed difference"),
+             size = 1.5,)+theme_classic(base_size = 25)+
+  geom_text(mapping = aes(x= 108, y = 180,
+                          label = paste0("p-value = ",round(padjust,2))),
+            size = 7)+xlab("Permutation results")+ 
+  scale_color_manual(values = c("observed difference"="black"),
+                     name = "")+theme(legend.position = "top",
+                                      axis.title.x = element_text(margin = margin(t = 30, r = 20, b = 0, l = 0)))+ylab("Count")
+```
+## Match with crossovers called in the original paper
+The cell's crossover ranges can be found by `getCellCORange` function:
+```{r}
+called_co_cell <- getCellCORange(co_count = hinch_co_counts,
+                                 cellBarcode = colnames(hinch_co_counts)[1])
+```
+We now compare the crossovers called by `ssocaller` and `comapr` with the crossovers
+positions identified in the original paper [@Hinch2019-dt].
+We first collect the crossover regions for each cell:
+```{r}
+called_co_df <- lapply(colnames(hinch_co_counts), function(srr){
+  called_co_cell <- getCellCORange(co_count = hinch_co_counts,
+                                   cellBarcode = srr)
+  called_co_df <- as.data.frame(called_co_cell)
+  called_co_df$SRR <- srr
+  called_co_df
+})
+called_co_df <- do.call("rbind",called_co_df)
+```
+```{r}
+head(called_co_df)
+called_co_df$chr <- called_co_df$seqnames
+```
+The published crossover positions from [@Hinch2019-dt] was downloaded from
+GEO with GSE125326: 
+```{r}
+pub_co <- read.table(file ="references/publishedCrossovers.txt")
+pub_co$chr <- paste0("chr",pub_co$chr)
+pub_co <- pub_co[pub_co$chr!="chrX",]
+merged_df <- merge(called_co_df, pub_co, by.x =c("SRR","chr"), 
+                   suffixes = c(".called",".pub"))
+```
+The number of crossovers called for each cell are highly concordant. The differences
+in number of crossovers called per cell are plotted in histogram down below.
+and we can see that our approach is more conservative. However, one can adjust 
+the filtering thresholds mentioned at the start of this section to be less 
+conservative. 
+```{r codiff,fig.cap="Differences in number of crossovers called by the two methods",fig.width=6,fig.height=5}
+sgcocaller_cos <- called_co_df %>% group_by(SRR) %>% summarise(COs_sgcocaller = n())
+hinch_cos <- pub_co %>% group_by(SRR) %>% summarise(COs_hinch = n())
+sgcocaller_cos %>% left_join(hinch_cos) %>%
+  ggplot()+geom_histogram(mapping = 
+                            aes(x= (COs_sgcocaller - COs_hinch)))+theme_classic()
+sgcocaller_cos %>% left_join(hinch_cos) %>%
+  ggplot()+geom_jitter(mapping = 
+                            aes(x= (COs_sgcocaller - COs_hinch), y ="diff"),
+                       width = 0.3)+
+  theme_classic()
+sgcocaller_cos %>% left_join(hinch_cos) %>% 
+  mutate(diff = as.character(abs(COs_sgcocaller-COs_hinch))) %>% ggplot()+
+  geom_jitter(mapping = aes(x = COs_sgcocaller,
+                            y= COs_hinch,
+                            color = diff))+
+  theme_classic(base_size = 25)+scale_color_manual(
+    values = c("0"="#58c8c9","1"="#48a3a4","2"="#367a7a","4"="#235050")
+  )+theme(legend.position = c(0.5, .96),
+          legend.direction = "horizontal")
+```
+```{r}
+# ggplot(data = hinch_cos,
+#        mapping = aes(x = "Hinch Published",
+#                      y = COs_hinch)) + geom_boxplot()+ geom_jitter()+
+#   theme_classic()
+# 
+# ggplot(data = hinch_cos,
+#        mapping = aes( x = COs_hinch)) + geom_histogram(stat = "count")+
+#   theme_classic()
+# ggplot(data = sgcocaller_cos,
+#        mapping = aes( x = COs_sgcocaller)) + geom_histogram(stat = "count")+
+#   theme_classic()
+```
+The cells with discrepancy in number of crossovers called:
+```{r}
+sgcocaller_cos %>% left_join(hinch_cos) %>% filter(COs_sgcocaller != COs_hinch)
+```
+We can find the cell which has the largest difference in the number of
+crossovers called by the two methods: 
+```{r}
+sgcocaller_cos %>% left_join(hinch_cos) %>% 
+  mutate(diff = (COs_hinch - COs_sgcocaller)) %>% 
+  filter(diff>3)
+```
+We can then list the cells and chrs that have different number of crossovers 
+called by the two methods for cell `SRR8454715`:
+```{r}
+sgcocaller_cos <- called_co_df %>% group_by(SRR,chr) %>% summarise(ChrCOs_sgcocaller = n())
+hinch_cos <- pub_co %>% group_by(SRR,chr) %>% summarise(ChrCOs_hinch = n())
+hinch_cos  %>% full_join(sgcocaller_cos) %>% filter(SRR == "SRR8454715")
+```
+And then plot the alternative allele frequencies for the chromosomes that do
+not have the same number of crossovers called:
+```{r}
+cell <- "SRR8454715"
+SRR8454715_af <- getCellAFTrack(chrom = "chr7",
+                                path_loc = dataset_dir,sampleName = "hinch",
+                                barcodeFile = barcodeFile_path,
+                                co_count = hinch_co_counts,
+                                nwindow = 500,
+                                chunk = 50000L,
+                                cellBarcode = cell)
+```
+```{r}
+pub_co_range <- GRanges(seqnames = pub_co[pub_co$SRR==cell,]$chr,
+                        IRanges(start = pub_co[pub_co$SRR==cell,]$crossover_breakpoint_leftpos,
+                        end = pub_co[pub_co$SRR==cell,]$crossover_breakpoint_rightpos))
+```
+Two extra crossovers on chromosome 17 were called from the original paper:
+```{r,fig.width=12,fig.height=4}
+pub_co_chr7 <- AnnotationTrack(pub_co_range[seqnames(pub_co_range)=="chr7"],
+                              name = "chr7 published CO ranges")
+pub_co_chr7 <- setPar(pub_co_chr7,"background.title","lightblue")
+SRR8454715_af_ht <- HighlightTrack(trackList = list(gtrack, SRR8454715_af$af_track,
+                                                    pub_co_chr7),
+                                   range = SRR8454715_af$co_range[seqnames(SRR8454715_af$co_range)=="chr7"])
+plotTracks(SRR8454715_af_ht)
+```
+```{r}
+SRR8454715_af_chr18 <- getCellAFTrack(chrom = "chr18",
+                                path_loc = dataset_dir,sampleName = "hinch",
+                                barcodeFile = barcodeFile_path,
+                                co_count = hinch_co_counts,
+                                nwindow = 300,
+                                chunk = 50000L,
+                                cellBarcode = cell)
+```
+Two extra crossovers on chromosome 18 were called from the original paper:
+```{r,fig.width=12,fig.height=4}
+pub_co_chr18 <- AnnotationTrack(pub_co_range[seqnames(pub_co_range)=="chr18"],
+                                  name = "chr18 published CO ranges")
+pub_co_chr18 <- setPar(pub_co_chr18,"background.title","lightblue")
+SRR8454715_af_ht <- HighlightTrack(trackList =  
+                                     list(gtrack, SRR8454715_af_chr18$af_track,
+                                          pub_co_chr18),
+                                   range = SRR8454715_af_chr18$co_range[seqnames(SRR8454715_af_chr18$co_range)=="chr18"])
+plotTracks(SRR8454715_af_ht)
+```
+It is justifiable that `comapr` classify these 4 crossovers as false positives.
+## Summary
+We have demonstrated the application of `sgcocaller` to find the haplotype states
+for the list of cell barcodes against the list of informative SNP markers using
+a binomial Hidden Markov Model. We have also showed the functionality of `comapr`
+for downstream analyses including cell quality control, finding crossover intervals, 
+visualising crossover regions, calculating genetic distances and resampling testings
+for sample group comparisons. In addition, the tunable filtering parameters for
+calling crossovers enable `comapr` to be applied for datasets with different 
+coverages.
+## Session info
+```{r}
+sessionInfo()
+```
+## References
--- a/analysis/index.Rmd
+++ b/analysis/index.Rmd
 ---
-title: "index"
+title: "Crossover-identification-with-sscocaller-and-comapr"
 author: "Ruqian Lyu"
 date: "11/28/2021"
 output: html_document

--- a/analysis/rejy.bib
+++ b/analysis/rejy.bib
+@ARTICLE{Phipson2010-xi,
+  title    = "Permutation P-values should never be zero: calculating exact
+              P-values when permutations are randomly drawn",
+  author   = "Phipson, Belinda and Smyth, Gordon K",
+  abstract = "Permutation tests are amongst the most commonly used statistical
+              tools in modern genomic research, a process by which p-values are
+              attached to a test statistic by randomly permuting the sample or
+              gene labels. Yet permutation p-values published in the genomic
+              literature are often computed incorrectly, understated by about
+              1/m, where m is the number of permutations. The same is often
+              true in the more general situation when Monte Carlo simulation is
+              used to assign p-values. Although the p-value understatement is
+              usually small in absolute terms, the implications can be serious
+              in a multiple testing context. The understatement arises from the
+              intuitive but mistaken idea of using permutation to estimate the
+              tail probability of the test statistic. We argue instead that
+              permutation should be viewed as generating an exact discrete null
+              distribution. The relevant literature, some of which is likely to
+              have been relatively inaccessible to the genomic community, is
+              reviewed and summarized. A computation strategy is developed for
+              exact p-values when permutations are randomly drawn. The strategy
+              is valid for any number of permutations and samples. Some simple
+              recommendations are made for the implementation of permutation
+              tests in practice.",
+  journal  = "Stat. Appl. Genet. Mol. Biol.",
+  volume   =  9,
+  pages    = "Article39",
+  month    =  oct,
+  year     =  2010,
+  language = "en"
+}
+@ARTICLE{Keane2011-be,
+  title    = "Mouse genomic variation and its effect on phenotypes and gene
+              regulation",
+  author   = "Keane, Thomas M and Goodstadt, Leo and Danecek, Petr and White,
+              Michael A and Wong, Kim and Yalcin, Binnaz and Heger, Andreas and
+              Agam, Avigail and Slater, Guy and Goodson, Martin and Furlotte,
+              Nicholas A and Eskin, Eleazar and Nell{\aa}ker, Christoffer and
+              Whitley, Helen and Cleak, James and Janowitz, Deborah and
+              Hernandez-Pliego, Polinka and Edwards, Andrew and Belgard, T
+              Grant and Oliver, Peter L and McIntyre, Rebecca E and Bhomra,
+              Amarjit and Nicod, J{\'e}r{\^o}me and Gan, Xiangchao and Yuan,
+              Wei and van der Weyden, Louise and Steward, Charles A and Bala,
+              Sendu and Stalker, Jim and Mott, Richard and Durbin, Richard and
+              Jackson, Ian J and Czechanski, Anne and Guerra-Assun{\c c}{\~a}o,
+              Jos{\'e} Afonso and Donahue, Leah Rae and Reinholdt, Laura G and
+              Payseur, Bret A and Ponting, Chris P and Birney, Ewan and Flint,
+              Jonathan and Adams, David J",
+  abstract = "We report genome sequences of 17 inbred strains of laboratory
+              mice and identify almost ten times more variants than previously
+              known. We use these genomes to explore the phylogenetic history
+              of the laboratory mouse and to examine the functional
+              consequences of allele-specific variation on transcript
+              abundance, revealing that at least 12\% of transcripts show a
+              significant tissue-specific expression bias. By identifying
+              candidate functional variants at 718 quantitative trait loci we
+              show that the molecular nature of functional variants and their
+              position relative to genes vary according to the effect size of
+              the locus. These sequences provide a starting point for a new era
+              in the functional analysis of a key model organism.",
+  journal  = "Nature",
+  volume   =  477,
+  number   =  7364,
+  pages    = "289--294",
+  month    =  sep,
+  year     =  2011,
+  language = "en"
+}
+@ARTICLE{Hinch2019-dt,
+  title    = "Factors influencing meiotic recombination revealed by
+              whole-genome sequencing of single sperm",
+  author   = "Hinch, Anjali G and Zhang, Gang and Becker, Philipp W and
+              Moralli, Daniela and Hinch, Robert and Davies, Benjamin and
+              Bowden, Rory and Donnelly, Peter",
+  abstract = "Recombination is critical to meiosis and evolution, yet many
+              aspects of the physical exchange of DNA via crossovers remain
+              poorly understood. We report an approach for single-cell
+              whole-genome DNA sequencing by which we sequenced 217 individual
+              hybrid mouse sperm, providing a kilobase-resolution genome-wide
+              map of crossovers. Combining this map with molecular assays
+              measuring stages of recombination, we identified factors that
+              affect crossover probability, including PRDM9 binding on the
+              non-initiating template homolog and telomere proximity. These
+              factors also influence the time for sites of
+              recombination-initiating DNA double-strand breaks to find and
+              engage their homologs, with rapidly engaging sites more likely to
+              form crossovers. We show that chromatin environment on the
+              template homolog affects positioning of crossover breakpoints.
+              Our results also offer insights into recombination in the
+              pseudoautosomal region.",
+  journal  = "Science",
+  volume   =  363,
+  number   =  6433,
+  month    =  mar,
+  year     =  2019,
+  language = "en"
+}
--- a/submit-runAlign.sh
+++ b/submit-runAlign.sh
+#!/bin/bash
+#SBATCH --job-name=hinchDNA
+#SBATCH --ntasks=1
+#SBATCH --time=324:10:00
+#SBATCH --mem=1000
+#SBATCH --partition=primary
+/mnt/mcfiles/rlyu/Software/miniconda3.4.7/envs/rlyubase/bin/snakemake -s run_alignment.snk --rerun-incomplete --use-singularity --singularity-args "-B /mnt/mcscratch/:/mnt/mcscratch/,/mnt/beegfs/:/mnt/beegfs/" -j 8 -c "sbatch --partition=gpu --mem {resources.mem} --time=120:00:00 --cpus-per-task={resources.cpus} --error=Snakefile_hinch.log.out"