Rules specific to Poppy that are not defined in Hydra Genetics

pindel_processing.smk

These are custom rules created for Poppy to process the output from Pindel so that it can be processed by VEP.

Pindel creates an older type of VCF and therefore has to be processed slightly different than more modern VCFs. Here we add the AF and DP fields to the VCF INFO column, annotate the calls using vep and add artifact annotation based an on artifact panel created with the reference pipeline.

🐍 Rule

rule pindel_processing_annotation_vep:
    input:
        cache=config.get("vep", {}).get("vep_cache", ""),
        fasta=config["reference"]["fasta"],
        tabix="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz.tbi",
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz",
    output:
        vcf=temp("cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf"),
    params:
        extra=config.get("vep", {}).get("extra", "--pick"),
        mode=config.get("vep", {}).get("mode", "--offline --cache --merged "),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.benchmark.tsv",
            config.get("vep", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("vep", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("vep", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("vep", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("vep", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("vep", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("vep", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("vep", {}).get("container", config["default_container"])
    message:
        "{rule}: vep annotate {input.vcf}"
    script:
        "../scripts/pindel_processing_annotation_vep.sh"

↔ input / output files

Rule parameters Key Value Description
input cache config.get("vep", {}).get("vep_cache", "") path to vep cache directory from config["vep"]["vep_cache"]
fasta config["reference"]["fasta"] path to fasta reference genome
tabix "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz.tbi" vcf index file
_ _ vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz" gzipped vcf file to be annotated
output vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf" annotated (or incase of empty just copied) vcf file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string From config["vep"]: Name of path to container containing the vep executable
vep_cache string From config["vep"]: Path to offline VEP cache
mode string From config["vep"]: VEP arguments for run mode
extra string From config["vep"]: Additional command line arguments for VEP
benchmark_repeats integer From config["vep"]: set number of times benchmark should be repeated

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

🐍 Rule

There are instances where the VEP annotation is not added to a variant. This rule adds missing CSQ annotations back to the VCF file.

rule pindel_processing_add_missing_csq:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz",
        tbi="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz.tbi",
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf",
    params:
        field="CSQ",
        extra=config.get("pindel_processing_add_missing_csq", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.benchmark.tsv",
            config.get("pindel_processing_add_missing_csq", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_add_missing_csq", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_add_missing_csq", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_add_missing_csq", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("pindel_processing_add_missing_csq", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pindel_processing_add_missing_csq", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_add_missing_csq", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_add_missing_csq", {}).get("container", config["default_container"])
    message:
        "{rule}: if need be, add missing CSQ annotation to variants in {input.vcf}"
    script:
        "../scripts/pindel_processing_add_missing_csq.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz" gzipped vcf to be corrected for missing CSQ
_ _ tbi "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.vcf.gz.tbi" tbi index to input.vcf
output vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf" annotated vcf file with blank CSQ if needed

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

🐍 Rule

rule pindel_processing_fix_af:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.vcf",
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf",
    params:
        extra=config.get("pindel_processing_fix_af", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf.benchmark.tsv",
            config.get("pindel_processing_fix_af", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_fix_af", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_fix_af", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_fix_af", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("pindel_processing_fix_af", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pindel_processing_fix_af", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_fix_af", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_fix_af", {}).get("container", config["default_container"])
    message:
        "{rule}: add af and dp to info field in {input.vcf}"
    script:
        "../scripts/pindel_processing_fix_af.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.vcf" vcf where AF and DP is needed in INFO field
output vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.fix_af.vcf" vcf with added AF and DP in INFO field

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

🐍 Rule

rule pindel_processing_artifact_annotation:
    input:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz",
        tbi="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz.tbi",
        artifacts=config["reference"]["artifacts_pindel"],
    output:
        vcf="cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf",
    params:
        extra=config.get("pindel_processing_artifact_annotation", {}).get("extra", ""),
    log:
        "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf.benchmark.tsv",
            config.get("pindel_processing_artifact_annotation", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pindel_processing_artifact_annotation", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pindel_processing_artifact_annotation", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pindel_processing_artifact_annotation", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("pindel_processing_artifact_annotation", {}).get(
            "partition", config["default_resources"]["partition"]
        ),
        threads=config.get("pindel_processing_artifact_annotation", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pindel_processing_artifact_annotation", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pindel_processing_artifact_annotation", {}).get("container", config["default_container"])
    message:
        "{rule}: add artifact annotation on {input.vcf}, based on arifact_panel_pindel.tsv "
    script:
        "../scripts/pindel_processing_artifact_annotation.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz" gzipped vcf to be artifact annotated
tbi "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.csq_corrected.vcf.gz.tbi" tbi index to input.vcf
_ _ artifacts config["reference"]["artifacts_pindel"] tsv file with artifact pindel calls, created in reference pipeline
output vcf "cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf" vcf with artifact annotation

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

svdb.smk

Since when running svdb --merge with the priority flag set, svdb cuts off the FORMAT column for cnvkit variants git issue. We use a non-Hydra Genetics rule for the svdb --merge command.

🐍 Rule

rule svdb_merge_wo_priority:
    input:
        vcfs=get_vcfs_for_svdb_merge,
    output:
        vcf=temp("cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf"),
    params:
        extra=config.get("svdb_merge", {}).get("extra", ""),
        overlap=config.get("svdb_merge", {}).get("overlap", 0.6),
        bnd_distance=config.get("svdb_merge", {}).get("bnd_distance", 10000),
    log:
        "cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.benchmark.tsv",
            config.get("svdb_merge", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("svdb_merge", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("svdb_merge", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("svdb_merge", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("svdb_merge", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("svdb_merge", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("svdb_merge", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("svdb_merge", {}).get("container", config["default_container"])
    message:
        "{rule}: merges vcf files from different cnv callers into {output.vcf}"
    shell:
        "(svdb --merge "
        "--vcf {input.vcfs} "
        "--bnd_distance {params.bnd_distance} "
        "--overlap {params.overlap} "
        "{params.extra} "
        "> {output.vcf}) 2> {log}"

↔ input / output files

Rule parameters Key Value Description
input vcfs get_vcfs_for_svdb_merge a function get_vcfs_for_svdb_merge (common.smk) is used to list all files (eg. from different callers) that should be merge into a SVDB 'vcf'
output vcf "cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf" a 'vcf' file containing the merged SV calls

🔧 Configuration

Software settings (config.yaml)

Key Type Description
container string Name or path to container containing the svdb executable
tc_method array Tumor cell content estimation methods
overlap number Minimum overlap between regions for merging
extra string Additional arguments to pass to svdb

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

reference_rules.smk

Software used specifically to create the reference-files for Poppy.

🐍 Rule

rule reference_rules_create_artifact_file_pindel:
    input:
        vcfs=set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz" for t in units.itertuples()]),
        tbis=set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz.tbi" for t in units.itertuples()]),
    output:
        artifact_panel=temp("references/create_artifact_file_pindel/artifact_panel.tsv"),
    params:
        extra=config.get("create_artifact_file_pindel", {}).get("extra", ""),
    log:
        "references/create_artifact_file_pindel/artifact_panel.tsv.log",
    benchmark:
        repeat(
            "references/create_artifact_file_pindel/artifact_panel.tsv.benchmark.tsv",
            config.get("create_artifact_file_pindel", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("create_artifact_file_pindel", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("create_artifact_file_pindel", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("create_artifact_file_pindel", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("create_artifact_file_pindel", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("create_artifact_file_pindel", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("create_artifact_file_pindel", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("create_artifact_file_pindel", {}).get("container", config["default_container"])
    message:
        "{rule}: create artifact PoN for pindel"
    script:
        "../scripts/create_artifact_file_pindel.py"

↔ input / output files

Rule parameters Key Value Description
input vcfs set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz" for t in units.itertuples()]) all (gzipped) vcfs to be used for artifact panel
_ _ tbis set([f"cnv_sv/pindel_vcf/{t.sample}_{t.type}.no_tc.normalized.vep_annotated.vcf.gz.tbi" for t in units.itertuples()]) tbi index to all input vcfs
output artifact_panel "references/create_artifact_file_pindel/artifact_panel.tsv" tsv file with chr, pos, svtype, median, sd, num_obs of detected variants

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded