Poppy set up and configuration
- Create python venv
- Generate reference input files
- Required files for references pipeline and Poppy
- Set up config files
- Run References pipeline
- Prepare Poppy run
- Launch Poppy
To run Poppy, first go through the setup steps to download and generate reference files. (requires internet connection)
Generating the necessary references only needs to be done once. To start Poppy, go to Prepare Poppy run.
1. Create python venv
Make sure you have python 3.8 installed.
python3 -m venv poppy_env
source poppy_env/bin/activate
pip3 install -r requirements.txt
Creating an environment with micromamba and micromamba+pip didn't work.
2. Generate reference input files
This only needs to be done once and it's necessary to run Poppy.
Create samples.tsv and units.tsv for panel of normals
Read more about samples.tsv and units.tsv files at the Hydra-Genetics page: create_sample_files
Requirements:
- Fastq directory with samples to include in the PoN
Generate the input files with Hydra-Genetics create-input-files:
hydra-genetics create-input-files -d <path to fastqs> -p <seq machine>
Selected flags:
-d, --directory TEXT path to dir where fastq-files should be looked
for when platform is Illumina.Path to unmapped
BAM files when platform is ONT or PACBIO
-p, --platform TEXT Sequence platform that the data originate from, e.g., NextSeq, MiSeq, Illumina.
-f, --force overwrite existing files
Specify the path to the directory with the PoN fastq files and the sequencing platform that was used.
This creates units.tsv and samples.tsv, used as input to the Poppy References pipeline: workflow/Snakefile_references.smk.
Example units.tsv:
sample type platform barcode machine flowcell lane fastq1 fastq2 adapter
N1 N Illumina GTGAAGTG+GAGCAATC @NB501037 HGMMJAFX2 L001 /path/to/fastqs/N1_S9_R1_001.fastq.gz /path/to/fastqs/N1_S9_R2_001.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
N2 N Illumina CATGGCTA+CACACATC @NB501037 HGMMJAFX2 L001 /path/to/fastqs/N2_S10_R1_001.fastq.gz /path/to/fastqs/N2_S10_R2_001.fastq.gz AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Example samples.tsv:
sample
N1
N2
Note: N1 and N2 are just examples to show how the
samples.tsvandunits.tsvfiles should look like. There are no N1 and N2 fastq files included in this repository.
Rename files to units_ref.tsv and samples_ref.tsv.
3. Required files for references pipeline and Poppy
The files specified in the table below are required to run both the references pipeline and Poppy. Some files can be downloaded with hydra-genetics references download, others need to be provided by the user.
See Download files for hg19 and GRCh38
| File / Resource | Description / Usage | Source / Notes |
|---|---|---|
| FASTA & indexes | Reference genome FASTA and associated index files | Reference genome (GRCh38;GRCh37) |
| Design files* | Target region design files | Custom file* |
| bcftools annotation_db | Annotation database for bcftools | hg19 - Twist Solid v0.6.1 references or GRCh38 - GATK best practices |
| gatk_collect_allelic_counts: SNP_interval | SNP interval list for allelic counts collection | genomic-medicine-sweden/Twist_Solid_pipeline_files/refs: hg19 - gnomad_SNP_0.001_target.annotated.interval_list or GRCh38 - gnomad_SNP_0.001_target.annotated.hg38.interval_list (remove "chr" prefix if needed) |
| pindel_call: include_bed** | BED file for Pindel calling | twist_shortlist_pindel.bed - custom file** |
| vep: vep_cache | VEP cache for variant annotation | homo_sapiens_merged_vep_111_GRCh37.tar.gz or homo_sapiens_merged_vep_111_GRCh38.tar.gz |
| SNP_interval | hg19, GRCh38 |
* Design file: Custom file designed by geneticists.
** pindel bed file: custom file with regions of interest - regions that are known to show longer indels (> 50 bp) which are relevant to diagnostics. Custom file designed by geneticists.
Four columns:
- Chromosome
- Start position
- End position
- Gene name
hg19 only (version included in repo): The gatk_collect_allelic_counts: SNP_interval file has "chr" at the beginning of the line.
Use this line below to remove "chr". Otherwise, the pipeline will crash at the cnv_sv_gatk_collect_allelic_counts step.
sed -i 's/^chr//' initial_references/ref_data/GNOMAD/gnomad_SNP_0.001_target.annotated.interval_list
Download files for hg19 and GRCh38
The initial reference files are defined in the config/references/required_references_hg19.yaml or config/references/required_references_GRCh38.yaml files. Some will be downloaded from public resources, and some need to be provided by the user (design files, pindel bed file). The latter should be created together with the center's geneticists, see above.
- design files
- pindel bed file
The following commands should be run from the Poppy directory.
Run hydra-genetics references download to download the references to the initial_references directory (requires internet connection):
source poppy_env/bin/activate
mkdir -p initial_references
hydra-genetics references download -v config/references/required_references_GRCh38.yaml -o initial_references/
The bwa indexes and picard.dict file need to be created manually after the genome files are downloaded. Use the commands below and change paths according to your reference genome.
SINGULARITY_PREFIX="<path to your singularity cache>" # same as in the cluster profile
apptainer exec -e \
--bind $PWD:$PWD \
docker://hydragenetics/bwa_mem:0.7.17 \
bwa index initial_references/ref_data/genome/GRCh38/human_GRCh38_no_alt_analysis_set/human_GRCh38_no_alt_analysis_set.fasta
SINGULARITY_PREFIX="<path to your singularity cache>" # same as in the cluster profile
apptainer exec -e \
--bind $PWD:$PWD \
docker://hydragenetics/picard:2.25.0 \
picard CreateSequenceDictionary \
R=initial_references/ref_data/genome/GRCh38/human_GRCh38_no_alt_analysis_set/human_GRCh38_no_alt_analysis_set.fasta \
O=initial_references/ref_data/genome/GRCh38/human_GRCh38_no_alt_analysis_set/human_GRCh38_no_alt_analysis_set.dict
These are the general references to run poppy. The references pipeline will create additional files needed to run Poppy, based on the provided panel of normals (PoN).
4. Set up config files
Genome version, GENOME: hg19 or GRCH38
The config_references_pipeline_<GENOME>.yaml and config_<genome>.yaml don't have to be changed before running the references pipeline or Poppy, given that the references pipeline is run on the Poppy directory.
config/config_references_pipeline_<GENOME>.yaml- necessary to run the references pipeline.config_<GENOME>.yaml- main config necessary to run both the references pipeline and Poppy.config/cnv_genes.<GENOME>.bed- Optional - make sure that the chromosome column matches the chromosome notation your references are using (chromosome name starts with or without "chr", according to your reference genome).profiles/grid_engine/config.yaml- config with cluster execution parameters. The config provided is an example config for a SGE cluster using Singularity. Adjust settings as needed. The snakefile and the config files can be specified in this file or on the command line.
Command line:
--snakefile <filename>
--config_file config_references_pipeline_<GENOME>.yaml --config_file config_<GENOME>.yaml
In profile config: Attention! the order of the config files matters, as the latter will override the settings of the previous ones.
snakefile: /path/to/Snakefile
configfile:
- /path/to/config_references_pipeline_<GENOME>.yaml,
- /path/to/config_<GENOME>.yaml
5. Run References pipeline
Activate the venv and run from the Poppy directory:
POPPY_HOME=/path/to/poppy_repo
source $POPPY_HOME/poppy_env/bin/activate
snakemake --snakefile $POPPY_HOME/workflow/Snakefile_references.smk \
--profile $POPPY_HOME/profiles/grid_engine/ \
--configfiles \
$POPPY_HOME/config/config_references_pipeline_<GENOME>.yaml \
$POPPY_HOME/config/config_<GENOME>.yaml \
--config POPPY_HOME=$POPPY_HOME
This pipeline will create a references folder reference_files in the Poppy directory (if run from there), containing the necessary reference files to run Poppy:
artifact_panel_pindel.tsvbackground_panel.tsvdesign.preprocessed.interval_listpurecn_mapping_bias.rdspurecn_targets_intervals.txtartifact_panel.tsvcnvkit.PoN.cnngatk.PoN.hdf5purecn_normal_db.rdssvdb_cnv.vcf
6. Prepare Poppy run
Create samples.tsv and units.tsv for your samples
Use the Hydra-Genetics create-input-files tool to create the samples and units files for the samples you want to process with Poppy
hydra-genetics create-input-files -d /path/to/fastq/ -p <seq machine> -f
Selected flags:
-d, --directory TEXT path to dir where fastq-files should be looked
for when platform is Illumina.Path to unmapped
BAM files when platform is ONT or PACBIO
-p, --platform TEXT Sequence platform that the data originate from, e.g., NextSeq, MiSeq, Illumina.
-f, --force overwrite existing files
Check config files
- The desired output files are defined in this config (no need to change anything if running default):
config/output_files.yaml
7. Launch Poppy
Execute the command from the location where snakemake should run and where the results will be saved.
POPPY_HOME=/path/to/poppy_repo
source $POPPY_HOME/poppy_env/bin/activate
snakemake --snakefile $POPPY_HOME/workflow/Snakefile \
--profile $POPPY_HOME/profiles/grid_engine/ \
--configfile $POPPY_HOME/config/config_<GENOME>.yaml \
--config POPPY_HOME=$POPPY_HOME