FindGVCFsBlocks

Last commit

Find common blocks of calleable regions from a set of gvcfs

Usage

This program is now part of the main jvarkit tool. See jvarkit for compiling.

Usage: java -jar dist/jvarkit.jar findgvcfsblocks  [options] Files

Usage: findgvcfsblocks [options] Files
  Options:
    --bed
      Restrict output blocks to those overlapping this bed
    --min-size, --block-size
      min block size. A distance specified as a positive integer.Commas are 
      removed. The following suffixes are interpreted : b,bp,k,kb,m,mb,g,gb
      Default: 0
    -c, --chrom, --chromosome, --contig
      limit to that contig
    -h, --help
      print help and exit
    --helpFormat
      What kind of help. One of [usage,markdown,xml].
    --merge-size, -M
      merge adjacent blocks distance. A distance specified as a positive 
      integer.Commas are removed. The following suffixes are interpreted : 
      b,bp,k,kb,m,mb,g,gb 
      Default: 1
    -o, --out
      Output file. Optional . Default: stdout
    --version
      print version and exit

Keywords

  • gvcf
  • gatk
  • vcf

Creation Date

20210806

Source code

https://github.com/lindenb/jvarkit/tree/master/src/main/java/com/github/lindenb/jvarkit/tools/gvcf/FindGVCFsBlocks.java

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite findgvcfsblocks ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Motivation

find regions for running GATK CombineGVCFs in parallel.

Input

input is a set of path to the indexed g.vcf files or a picard-style interval file generated with a previous invocation of findgvcfsblocks.jar with one sample. or it's a file with the '.list' suffix containing the path to the g.vcf files/interval files

g.vcf files must be indexed if option -c is used.

Output

output is a picard-style Interval file containing the calleable GVCFs blocks.

Example

$ java -jar dist/findgvcfsblocks.jar --min-size 100 --chrom RF11 S1.g.vcf.gz S2.g.vcf.gz S3.g.vcf.gz 
@HD VN:1.6
@SQ SN:RF01 LN:3302
@SQ SN:RF02 LN:2687
@SQ SN:RF03 LN:2592
@SQ SN:RF04 LN:2362
@SQ SN:RF05 LN:1579
@SQ SN:RF06 LN:1356
@SQ SN:RF07 LN:1074
@SQ SN:RF08 LN:1059
@SQ SN:RF09 LN:1062
@SQ SN:RF10 LN:751
@SQ SN:RF11 LN:666
@CO findgvcfsblocks. compilation:20210807160340 githash:b442941 htsjdk:2.24.1 date:20210807160354. cmd:--min-size 100 --chrom RF11 S1.g.vcf.gz S2.g.vcf.gz S3.g.vcf.gz
RF11    1   95  +   .
RF11    96  182 +   .
RF11    183 237 +   .
RF11    238 428 +   .
RF11    429 528 +   .
RF11    529 628 +   .
RF11    629 666 +   .
(...)

Nextflow

(...)
Channel.fromPath(params.reference+".fai").
    splitCsv(header: false,sep:'\t',strip:true).
    filter{T->T[0].matches("(chr)?[0-9XY]+")}.
    map{T->[T[0]]}.
    set{each_contig}

process gvcflists {
executor "local"
output:
    path("gvcfs.list") into (gvcfs_list1,gvcfs_list2)
script:
"""
test -s "${params.gvcfs}"

SQRT=`awk 'END{z=sqrt(NR); print (z==int(z)?z:int(z)+1);}' "${params.gvcfs}"`

split -a 9 --additional-suffix=.list --lines=\${SQRT} "${params.gvcfs}" chunck.

find \${PWD} -type f -name "chunck.*.list"  > gvcfs.list

"""
}


gvcfs_list2.splitCsv(header: false,sep:'\t',strip:true).
    map{T->T[0]}.
    set{one_gvcf_list}


process findBlocks {
tag "${file(vcfs).name} ${contig}"
memory "5g"
input:
    tuple vcfs,contig from one_gvcf_list.combine(each_contig)
output:
    tuple contig,path("block.interval_list.gz") into blocks_interval
script:
"""
module load jvarkit/0.0.0

java -Xmx${task.memory.giga}g -Djava.io.tmpdir=. -jar \${JVARKIT_DIST}/findgvcfsblocks.jar \
    --contig "${contig}" \
    -T . \
    -o "block.interval_list.gz" \
    "${vcfs}"

test -s block.interval_list.gz
"""
}


process findCommonBlocks {
tag "${contig} N=${L.size()}"
memory "10g"
input:
    tuple contig,L from blocks_interval.groupTuple()
output:
    path("block.interval_list") into blocks_common
script:
"""
module load jvarkit/0.0.0

cat << EOF > jeter.list
${L.join("\n")}
EOF

java -Xmx${task.memory.giga}g -Djava.io.tmpdir=. -jar \${JVARKIT_DIST}/findgvcfsblocks.jar \
    --contig "${contig}" \
    -T . \
    --block-size "${params.blocksize}" \
    -o "block.interval_list" \
    jeter.list

rm jeter.list

test -s block.interval_list
"""
}


blocks_common.splitCsv(header: false,sep:'\t',strip:true).
    filter{T->!T[0].startsWith("@")}.
    map{T->T[0]+":"+T[1]+"-"+T[2]}.
    set{intervals}

process genotypeRegion {
tag "${region}"
cache "lenient"
errorStrategy "finish"
afterScript "rm -rf TMP"
memory "15g"
input:
    val region from intervals
    path gvcfs from gvcfs_list1
output:
    path("genotyped.vcf.gz")  into genotyped
    path("genotyped.vcf.gz.tbi")
script:
(...)