Skip to content
Snippets Groups Projects
README.md 11.00 KiB

sgcocaller: Calling crossovers from single-gamete DNA sequencing reads

Two main modules are available in sgcocaller, phasing and crossover calling. sgcocaller phase generates donor haplotype from single-gamete DNA sequencing data. sgcocaller xo and sgcocaller sxo both call crossovers in each single gamete using a Hidden Markov Model.

sgcocaller processes DNA reads from each single gamete in the aligned and sorted BAM file for inferring the haplotypes of single gamete genomes that can later be used for identifying crossovers through finding haplotype shifts (see comapr).

It takes the large bam file which contains all aligned DNA reads from gamete cells and summarizes allele counts for the provided informative SNP markers. While counting the alleles, the Viterbi algorithm is executed for finding the haplotype sequence for the list of SNP markers.

Publication

BioRxiv paper

sgcocaller_fig

Hidden Markov Model configuration

HMM_fig

Inputs

  • Bam, sorted and index bam file which contains DNA reads of single gamete cells with CB tag, eg. from single-cell preprocessing pipeline (CellRanger, STARSolo, etc)
  • VCF, variant call file that contains the list of informative SNPs (phased or unphased SNPs)
  • barcodeFile, the list of cell barcodes of the gametes

Main outputs

  • *.mtx
    • sparse matrix with columns corresponding to the list of gamete cell barcodes and rows corresponding to the list of SNP positions in VCF file
    • {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts
    • {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts
    • {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state)
  • {sample}_chr1_snpAnnot.txt, the SNP positions and allele
  • {sample}_chr1_viSegInfo.txt, statistics of the Viterbi state segments in text file format. It contains consecutive viterbi states for each chromosome with statistics including, starting SNP position, ending SNP position, the number of SNPs supporting the segment, the log likelihood ratio of the viterbi segment and the inferred hidden state.

Usage


Usage:
      sgcocaller phase [options] <BAM> <VCF> <barcodeFile> <out_prefix>
      sgcocaller swphase [options] <gtMtxFile> <phasedSnpAnnotFile> <referenceVCF> <out_prefix>
      sgcocaller sxo [options] <SNPPhaseFile> <phaseOutputPrefix> <barcodeFile> <out_prefix>
      sgcocaller xo [options] <BAM> <VCF> <barcodeFile> <out_prefix>

Arguments:

  <BAM> the read alignment file with records of single-cell DNA reads

  <VCF> the variant call file with records of SNPs with hetSNPs phased in the form of REF/ALT or the GT field

  <barcodeFile> the text file containing the list of cell barcodes

  <out_prefix>  the prefix of output files

Options:
  -t --threads <threads>  number of BAM decompression threads [default: 4]
  --barcodeTag <barcodeTag>  the cell barcode tag in BAM [default: CB]
  --minMAPQ <mapq>  Minimum MAPQ for read filtering [default: 20]
  --baseq <baseq>  base quality threshold for a base to be used for counting [default: 13]
  --chrom <chrom>  the selected chromsome (whole genome if not supplied,separate by comma if multiple chroms)
  --minDP <minDP>  the minimum DP for a SNP to be included in the output file [default: 1]
  --maxDP <maxDP>  the maximum DP for a SNP to be included in the output file [default: 5]
  --maxTotalDP <maxTotalDP>  the maximum DP across all barcodes for a SNP to be included in the output file [default: 25]
  --minTotalDP <minTotalDP>  the minimum DP across all barcodes for a SNP to be included in the output file [default: 10]
  --minSNPdepth <minSNPdepth>  the minimum depth of cell coverage for a SNP to be includes in generated genotype matrix file [default: 1]
  --thetaREF <thetaREF>  the theta for the binomial distribution conditioning on hidden state being REF [default: 0.1]
  --thetaALT <thetaALT>  the theta for the binomial distribution conditioning on hidden state being ALT [default: 0.9]
  --cmPmb <cmPmb>  the average centiMorgan distances per megabases default 0.1 cm per Mb [default: 0.1]
  --phased  the input VCF for calling crossovers contains the phased GT of heterozygous SNPs
  --outvcf  generate the output in vcf format (phase)
  --templateCell <templateCell>  the cell's genotype to be used a template cell, as the cell's index (0-starting) in the barcode file, default as not supplied [default: -1]
  --maxDissim <maxDissim>  the maximum dissimilarity for a pair of cell to be selected as potential template cells due to not having crossovers in either cell [default: 0.0099]
  --maxExpand <maxExpand>  the maximum number of iterations to look for locally coexisting positions for inferring missing SNPs in template haplotype sequence [default: 1000]
  --posteriorProbMin <posteriorProbMin>  the min posterior probability when inferring missing SNPs [default: 0.99]
  --lookBeyondSnps <lookBeyondSnps>  the number of local SNPs to use when finding switch positions [default: 25]
  --minSwitchScore <minSwitchScore>  the minimum switch score for a site to be identified as having a switch error in the inferred haplotype and corrected [default: 50.0]
  --minPositiveSwitchScores <minPositiveSwitchScores>  the min number of continuing SNPs with positive switch scores to do switch error correction [default: 8]
  -h --help  show help


  Examples
      ./sgcocaller phase --threads 4 --barcodeTag CB --chrom 'chr1' --minDP 2 possorted_bam hetSNPs.vcf.gz barcodeFile.txt outdir/path/withPrefix_ 
      ./sgcocaller xo --threads 4 possorted_bam.bam dbSNP-hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp
      ./sgcocaller sxo phaseOutputPrefix barcodeFile.tsv ./percell/ccsnp

Run for single-cell DNA sequenced gametes with donor haplotype known

In cases where the haplotype of the donors is known (i.e the list of hetSNPs has been phased, or the donor is an F1 hybrid sample), sgcocaller xo can be called directly:

For F1 hybrid donor:

      ./sgcocaller xo --threads 4 possorted_bam.bam --cmPmb 1e-3  dbSNP-hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp

cmPmb controls the transition probabilites in the HMM model and it is dependent on the physical distances of the two markers[1]. It is recommended to set a smaller transition probability for sparse and low coverage dataset.

dbSNP-hetSNPs.vcf.gz contains the list of hetSNPs, and the list of REF alleles (or the list of ALT alleles) is the haplotype of the F1 hybrid donor.

For phased hetSNPs provided in VCF file, the GT field is used to extract the haplotype of the donor and sgcocaller xo can be called directly:

      ./sgcocaller xo --threads 4 --cmPmb 1e-3  --phased possorted_bam.bam phased-hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp

Run for single-cell DNA sequenced gametes with donor haplotype unknown

When hetSNPs' phase or haplotype of the donor is unknown, sgcocaller phase module can produce the donor haplotype using the single-gamate DNA sequencing dataset.

      ./sgcocaller phase --threads 4 --barcodeTag CB --chrom 'chr1' --minDP 2 possorted_bam hetSNPs.vcf.gz barcodeFile.txt outdir/path/withPrefix_ 

when specifying --outvcf with above command, ./sgcocaller phase will generate the phased hetSNPs in VCF format.

./sgcocaller swphase is an additional module for identifying potential switch errors in the genrated haplotype from ./sgcocaller phase. The provided utility R code can be used for plotting diagnostic plots which reveal whether ./sgcocaller swphase should be run to scrutinize switching errors.

       ./sgcocaller swphase --chrom "chr1" gtMtxfile.mtx chr1_phased_snpAnnot.txt unphasedHetSNPs.vcf.gz output/dirpath/phaseCorrected/chr1_

Since the ./sgcocaller phase has done the heavy lifting of parsing through the BAM and VCF file that generated the allele counts file and genotype matrix file, the crossover calling step in this use-case can be done through calling sgcocaller sxo that takes these outputs directly:

        ./sgcocaller sxo --thetaREF 0.1 --thetaALT 0.9 --cmPmb 1e-3 --chrom "chr1" sgphase_snp_annot.txt output/sgcocaller/phaseOneStep/apricot_ 

Run for a Bam of a single cell (bulk sample)

For cases where the DNA reads do not have a cell barocode tag (CB) ie. Bam file for one gamete or a bulk sample, sgcocaller can still be applied by adding one simple step:

1, Prepare a bulkBC.txt file in which only one dummy cell barcode "bulk" is listed. In other words, in the bulkBC.txt file, there is only one row and it has the text "bulk".

2, Apply sgcocaller as above :

    ./sgcocaller xo --threads 10 bulk.bam SNPs.vcf.gz bulkBC.txt ./output/bulksample

Setup/installation

Static builds

The static bianry can be simply downloaded which works for GNU/Linux type OS: ./src/sgcocaller

The static build was generated by using docker image docker://svirlyu/sgcocaller_nsb adapted from https://github.com/brentp/hts-nim/blob/master/Dockerfile