Rules specific to Poppy that are not defined in Hydra Genetics

pindel_processing.smk

These are custom rules created for Poppy to process the output from Pindel so that it can be processed by VEP.

Pindel creates an older type of VCF and therefore has to be processed slightly different than more modern VCFs. Here we add the AF and DP fields to the VCF INFO column, annotate the calls using vep and add artifact annotation based an on artifact panel created with the reference pipeline.

Rule

rule pindel_processing_annotation_vep:
    input:
        cache=config.get("vep", {}).get("vep_cache", ""),
        fasta=config["reference"]["fasta"],
        tabix="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz.tbi",
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz",
    output:
        vcf=temp("cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf"),
    params:
        extra=config.get("vep", {}).get("extra", "--pick"),
        mode=config.get("vep", {}).get("mode", "--offline --cache --merged "),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.benchmark.tsv",
            config.get("vep", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("vep", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("vep", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("vep", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("vep", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("vep", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("vep", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("vep", {}).get("container", config["default_container"])
    message:
        "{rule}: vep annotate {input.vcf}"
    script:
        "../scripts/pindel_processing_annotation_vep.sh"

input / output files

Rule parameters	Key	Value	Description
input	cache	`config.get("vep", {}).get("vep_cache", "")`	path to vep cache directory from config["vep"]["vep_cache"]
	fasta	`config["reference"]["fasta"]`	path to fasta reference genome
	tabix	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz.tbi"`	vcf index file
_ _	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz"`	gzipped vcf file to be annotated
output	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf"`	annotated (or incase of empty just copied) vcf file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
container	string	From config["vep"]: Name of path to container containing the vep executable
vep_cache	string	From config["vep"]: Path to offline VEP cache
mode	string	From config["vep"]: VEP arguments for run mode
extra	string	From config["vep"]: Additional command line arguments for VEP
benchmark_repeats	integer	From config["vep"]: set number of times benchmark should be repeated

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

Rule

There are instances where the VEP annotation is not added to a variant. This rule adds missing CSQ annotations back to the VCF file.

rule pindel_processing_add_missing_csq:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz",
        tbi="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz.tbi",
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf",
    params:
        field="CSQ",
        extra=config.get("pindel_processing_add_missing_csq", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.benchmark.tsv",
            config.get("pindel_processing_add_missing_csq", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_add_missing_csq", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_add_missing_csq", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_add_missing_csq", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("pindel_processing_add_missing_csq", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pindel_processing_add_missing_csq", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_add_missing_csq", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_add_missing_csq", {}).get("container", config["default_container"])
    message:
        "{rule}: if need be, add missing CSQ annotation to variants in {input.vcf}"
    script:
        "../scripts/pindel_processing_add_missing_csq.py"

input / output files

Rule parameters	Key	Value	Description
input	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz"`	gzipped vcf to be corrected for missing CSQ
_ _	tbi	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz.tbi"`	tbi index to input.vcf
output	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf"`	annotated vcf file with blank CSQ if needed

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

Rule

rule pindel_processing_fix_af:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.vcf",
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf",
    params:
        extra=config.get("pindel_processing_fix_af", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf.benchmark.tsv",
            config.get("pindel_processing_fix_af", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_fix_af", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_fix_af", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_fix_af", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("pindel_processing_fix_af", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pindel_processing_fix_af", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_fix_af", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_fix_af", {}).get("container", config["default_container"])
    message:
        "{rule}: add af and dp to info field in {input.vcf}"
    script:
        "../scripts/pindel_processing_fix_af.py"

input / output files

Rule parameters	Key	Value	Description
input	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.vcf"`	vcf where AF and DP is needed in INFO field
output	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf"`	vcf with added AF and DP in INFO field

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

Rule

rule pindel_processing_artifact_annotation:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz",
        tbi="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz.tbi",
        artifacts=config["reference"]["artifacts_pindel"],
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf",
    params:
        extra=config.get("pindel_processing_artifact_annotation", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf.benchmark.tsv",
            config.get("pindel_processing_artifact_annotation", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_artifact_annotation", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_artifact_annotation", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_artifact_annotation", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("pindel_processing_artifact_annotation", {}).get(
            "partition", config["default_resources"]["partition"]
        ),
        threads=config.get("pindel_processing_artifact_annotation", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_artifact_annotation", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_artifact_annotation", {}).get("container", config["default_container"])
    message:
        "{rule}: add artifact annotation on {input.vcf}, based on arifact_panel_pindel.tsv "
    script:
        "../scripts/pindel_processing_artifact_annotation.py"

input / output files

Rule parameters	Key	Value	Description
input	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz"`	gzipped vcf to be artifact annotated
	tbi	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz.tbi"`	tbi index to input.vcf
_ _	artifacts	`config["reference"]["artifacts_pindel"]`	tsv file with artifact pindel calls, created in reference pipeline
output	vcf	`"cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf"`	vcf with artifact annotation

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

svdb.smk

Since when running svdb --merge with the priority flag set, svdb cuts off the FORMAT column for cnvkit variants git issue. We use a non-Hydra Genetics rule for the svdb --merge command.

Rule

rule svdb_merge_wo_priority:
    input:
        vcfs=get_vcfs_for_svdb_merge,
    output:
        vcf=temp("cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf"),
    params:
        extra=config.get("svdb_merge", {}).get("extra", ""),
        overlap=config.get("svdb_merge", {}).get("overlap", 0.6),
        bnd_distance=config.get("svdb_merge", {}).get("bnd_distance", 10000),
    log:
        "cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.benchmark.tsv",
            config.get("svdb_merge", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("svdb_merge", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("svdb_merge", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("svdb_merge", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("svdb_merge", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("svdb_merge", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("svdb_merge", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("svdb_merge", {}).get("container", config["default_container"])
    message:
        "{rule}: merges vcf files from different cnv callers into {output.vcf}"
    shell:
        "(svdb --merge "
        "--vcf {input.vcfs} "
        "--bnd_distance {params.bnd_distance} "
        "--overlap {params.overlap} "
        "{params.extra} "
        "> {output.vcf}) 2> {log}"

input / output files

Rule parameters	Key	Value	Description
input	vcfs	`get_vcfs_for_svdb_merge`	a function get_vcfs_for_svdb_merge (common.smk) is used to list all files (eg. from different callers) that should be merge into a SVDB 'vcf'
output	vcf	`"cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf"`	a 'vcf' file containing the merged SV calls

Configuration

Software settings (`config.yaml`)

Key	Type	Description
container	string	Name or path to container containing the svdb executable
tc_method	array	Tumor cell content estimation methods
overlap	number	Minimum overlap between regions for merging
extra	string	Additional arguments to pass to svdb

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

reference_rules.smk

Software used specifically to create the reference-files for Poppy.

Rule

rule reference_rules_create_artifact_file_pindel:
    input:
        vcfs=set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz" for t in units.itertuples()]),
        tbis=set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz.tbi" for t in units.itertuples()]),
    output:
        artifact_panel=temp("references/create_artifact_file_pindel/artifact_panel.tsv"),
    params:
        extra=config.get("create_artifact_file_pindel", {}).get("extra", ""),
    log:
        "references/create_artifact_file_pindel/artifact_panel.tsv.log",
    benchmark:
        repeat(
            "references/create_artifact_file_pindel/artifact_panel.tsv.benchmark.tsv",
            config.get("create_artifact_file_pindel", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("create_artifact_file_pindel", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("create_artifact_file_pindel", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("create_artifact_file_pindel", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("create_artifact_file_pindel", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("create_artifact_file_pindel", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("create_artifact_file_pindel", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("create_artifact_file_pindel", {}).get("container", config["default_container"])
    message:
        "{rule}: create artifact PoN for pindel"
    script:
        "../scripts/create_artifact_file_pindel.py"

input / output files

Rule parameters	Key	Value	Description
input	vcfs	`set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz" for t in units.itertuples()])`	all (gzipped) vcfs to be used for artifact panel
_ _	tbis	`set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz.tbi" for t in units.itertuples()])`	tbi index to all input vcfs
output	artifact_panel	`"references/create_artifact_file_pindel/artifact_panel.tsv"`	tsv file with chr, pos, svtype, median, sd, num_obs of detected variants

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Rules specific to Poppy that are not defined in Hydra Genetics

pindel_processing.smk

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

svdb.smk

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

reference_rules.smk

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)