class: title-slide, center, middle background-image: url(https://www.biovoicenews.com/wp-content/uploads/2016/11/Bioinfo-imp.jpg) background-size: cover # Bioinformatics pipelines ## Alejandra Hernández Segura ### IDS Coding Club ### 22-March-2021 .white-font[ Image credit: [biovoicenews](https://www.biovoicenews.com/wp-content/uploads/2016/11/Bioinfo-imp.jpg) ] --- class: center, middle # Why to build bioinformatics pipelines <img src = "https://f1000researchdata.s3.amazonaws.com/manuscripts/32078/2dd5741c-0e33-41b8-a76b-a26a36ae899c_figure1.gif" width = 950px /> --- class: left, middle # Why to build bioinformatics pipelines. Some extras... - Easier to reuse code (part of Adaptability) -- - Web-based tools are not ideal + Security + Automation + Repetitive (error-prone) -- - We can run them in iRODS + Availability RIVM-wide + Proper file management --- class: inverse, center, middle # Snakemake ### [Latest Publication](https://f1000research.com/articles/10-33/v1) --- class: middle, center <img src = "https://f1000researchdata.s3.amazonaws.com/manuscripts/32078/2dd5741c-0e33-41b8-a76b-a26a36ae899c_figure2.gif", width = 500px /> --- class: middle, center <img src = "https://f1000researchdata.s3.amazonaws.com/manuscripts/32078/2dd5741c-0e33-41b8-a76b-a26a36ae899c_figure3.gif", width = 550px /> --- ## Understanding the scheduling... <img src="https://f1000researchdata.s3.amazonaws.com/manuscripts/32078/2dd5741c-0e33-41b8-a76b-a26a36ae899c_figure4.gif" width="60%" style="display: block; margin: auto;" /> ??? Greenish area depicts jobs ready for scheduling (input files present) Red job generates temporary file which may be deleted once all blue jobs are finished. (b) temporary file has to remain on disk until all blue jobs are finished (c) 3 blue jobs scheduled. Temporary file can be deleted afterwards. --- ## Scheduling has rules: - Availability of resources (cores, memory, etc.) -- - Dependency of jobs -- - First priority jobs -- - Delete temporary files ASAP --- class: middle ## Snakemake report <img src="files/snakemake_report.html" width="70%" style="display: block; margin: auto;" /> --- padding: 1px 1px 1px 1px margin: 5px 5px 5px 5px ``` rule all : input : expand(OUT + "/prokka/{sample}/{sample}.gbk", sample = SAMPLES) rule circlator : input : lambda wildcards : SAMPLES[wildcards.sample]["fasta"] output: OUT + "/circlator/{sample}/{sample}.fasta" shell : circlator fixstart {input} {params} &> {log} rule annotation_prokka : input : fasta = OUT + "/circlator/{sample}/{sample}.fasta" output: OUT + "/prokka/{sample}/{sample}.gbk" shell : """ output_dir={output} sample_name=`basename ${{output_dir}}` output_dir=`dirname ${{output_dir}}` prokka --outdir ${{output_dir}} --force --genus {params.genus} \ --prefix ${{sample_name%.gbk}} --proteins /path/to/db/ \ --addgenes --cpus {threads} {input.fasta} &>> {log} """ ``` --- class: middle ## Include set-up with Python code ``` SAMPLES = { "sample1": "fasta": "path/to/sample1.fasta" "sample2": "fasta": "path/to/sample3.fasta" "sample3": "fasta": "path/to/sample3.fasta" } OUT = "paht/to/my/desired/output_dir" ``` --- class: middle ## Rule all gives final expected output ``` rule all : input : expand(OUT + "/prokka/{sample}/{sample}.gbk", sample = SAMPLES) ``` -- ONLY expected output is created! --- class: middle ## Some rules are included... / Other not... .pull-left[ ``` rule circlator : input : lambda wildcards : SAMPLES[wildcards.sample]["fasta"] output: OUT + "/circlator/{sample}/{sample}.fasta" shell : circlator {input} -o {output} rule annotation_prokka : input : fasta = OUT + "/circlator/{sample}/{sample}.fasta" output: OUT + "/prokka/{sample}/{sample}.gbk" log: OUT + "/log/prokka_log_{sample}.txt" shell : """ prokka {input} &> {log} """ ``` ] -- .pull-right[ ``` rule make_plot : input : OUT + "/prokka/{sample}/{sample}.gbk" output: OUT + "/plots/{sample}.jpg" python : "plotter.py" ``` ] --- class: middle ## Order does not matter ``` rule annotation_prokka : input : fasta = OUT + "/circlator/{sample}/{sample}.fasta" output: OUT + "/prokka/{sample}/{sample}.gbk" log: OUT + "/log/prokka_log_{sample}.txt" shell : """ prokka {input} &> {log} """ rule circlator : input : lambda wildcards : SAMPLES[wildcards.sample]["fasta"] output: OUT + "/circlator/{sample}/{sample}.fasta" shell : circlator {input} -o {output} ``` --- class: left, middle ## Not so basics - Add package management: `conda` or `docker` <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 640 512" class="rfa" style="height:0.75em;fill:currentColor;position:relative;"><path d="M349.9 236.3h-66.1v-59.4h66.1v59.4zm0-204.3h-66.1v60.7h66.1V32zm78.2 144.8H362v59.4h66.1v-59.4zm-156.3-72.1h-66.1v60.1h66.1v-60.1zm78.1 0h-66.1v60.1h66.1v-60.1zm276.8 100c-14.4-9.7-47.6-13.2-73.1-8.4-3.3-24-16.7-44.9-41.1-63.7l-14-9.3-9.3 14c-18.4 27.8-23.4 73.6-3.7 103.8-8.7 4.7-25.8 11.1-48.4 10.7H2.4c-8.7 50.8 5.8 116.8 44 162.1 37.1 43.9 92.7 66.2 165.4 66.2 157.4 0 273.9-72.5 328.4-204.2 21.4.4 67.6.1 91.3-45.2 1.5-2.5 6.6-13.2 8.5-17.1l-13.3-8.9zm-511.1-27.9h-66v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm-78.1-72.1h-66.1v60.1h66.1v-60.1z"/></svg> (BETTER REPRODUCIBILITY!) -- - Computer resources <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="rfa" style="height:0.75em;fill:currentColor;position:relative;"><path d="M480 160H32c-17.673 0-32-14.327-32-32V64c0-17.673 14.327-32 32-32h448c17.673 0 32 14.327 32 32v64c0 17.673-14.327 32-32 32zm-48-88c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24zm-64 0c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24zm112 248H32c-17.673 0-32-14.327-32-32v-64c0-17.673 14.327-32 32-32h448c17.673 0 32 14.327 32 32v64c0 17.673-14.327 32-32 32zm-48-88c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24zm-64 0c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24zm112 248H32c-17.673 0-32-14.327-32-32v-64c0-17.673 14.327-32 32-32h448c17.673 0 32 14.327 32 32v64c0 17.673-14.327 32-32 32zm-48-88c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24zm-64 0c-13.255 0-24 10.745-24 24s10.745 24 24 24 24-10.745 24-24-10.745-24-24-24z"/></svg> + threads + memory -- - Parameters -- - Log files <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 384 512" class="rfa" style="height:0.75em;fill:currentColor;position:relative;"><path d="M288 256H96v64h192v-64zm89-151L279.1 7c-4.5-4.5-10.6-7-17-7H256v128h128v-6.1c0-6.3-2.5-12.4-7-16.9zm-153 31V0H24C10.7 0 0 10.7 0 24v464c0 13.3 10.7 24 24 24h336c13.3 0 24-10.7 24-24V160H248c-13.2 0-24-10.8-24-24zM64 72c0-4.42 3.58-8 8-8h80c4.42 0 8 3.58 8 8v16c0 4.42-3.58 8-8 8H72c-4.42 0-8-3.58-8-8V72zm0 64c0-4.42 3.58-8 8-8h80c4.42 0 8 3.58 8 8v16c0 4.42-3.58 8-8 8H72c-4.42 0-8-3.58-8-8v-16zm256 304c0 4.42-3.58 8-8 8h-80c-4.42 0-8-3.58-8-8v-16c0-4.42 3.58-8 8-8h80c4.42 0 8 3.58 8 8v16zm0-200v96c0 8.84-7.16 16-16 16H80c-8.84 0-16-7.16-16-16v-96c0-8.84 7.16-16 16-16h224c8.84 0 16 7.16 16 16z"/></svg> -- - Benchmarking <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="rfa" style="height:0.75em;fill:currentColor;position:relative;"><path d="M256,8C119,8,8,119,8,256S119,504,256,504,504,393,504,256,393,8,256,8Zm92.49,313h0l-20,25a16,16,0,0,1-22.49,2.5h0l-67-49.72a40,40,0,0,1-15-31.23V112a16,16,0,0,1,16-16h32a16,16,0,0,1,16,16V256l58,42.5A16,16,0,0,1,348.49,321Z"/></svg> -- --- ## Using snakemake + conda environments for package management .pull-left[ Specify dependencies in a `envs/prokka.yaml` file: ``` name: prokka channels: - bioconda - defaults dependencies: - prokka - blast=2.9 ``` ] -- .pull-right[ Passed to snakemake as: ``` rule prokka_annotation: ... conda: "envs/prokka.yaml" ... ``` ] --- ## Reserving computer resources within a rule Specify resources per rule: ``` rule annotation_prokka: ... threads: 4 resources: mem_mb=12000 ... ``` -- Specify resources per rule using a `configfile.yaml` file: ``` config: configfile.yaml rule annotation_prokka: ... threads: config["threads"]["prokka"] resources: mem_mb=config["mem_mb"]["prokka"] ... ``` --- ## Other additions to rules (parameters, log files, benchmarking) ``` rule annotation_prokka: ... params: genus = "Klebsiella", species = "pneumoniae" log: OUT + "/log/prokka/{sample}.log" benchmark: OUT + "/log/benchmark/prokka_{sample}.txt" shell : """ prokka --genus {params.genus} --species {params.species} {input} &> {log} """ ``` --- ## Real rule (or almost) ``` rule annotation_prokka: input: fasta = OUT + "/circlator/{sample}/{sample}.fasta", protein_db = PROTEIN_DB output: OUT + "/prokka/{sample}/{sample}.gbk" threads: config["threads"]["prokka"] resources: mem_mb=config["mem_mb"]["prokka"] conda: "envs/prokka.yaml" params: genus = lambda wildcards: SAMPLES[wildcards.sample]["genus"], species = lambda wildcards: SAMPLES[wildcards.sample]["species"] log: OUT + "/log/prokka/{sample}.log" benchmark: OUT + "/log/benchmark/prokka_{sample}.txt" shell: """ output_dir={output} sample_name=`basename ${{output_dir}}` output_dir=`dirname ${{output_dir}}` prokka --outdir ${{output_dir}} --force \ --genus {params.genus} \ --species {params.species} \ --prefix ${{sample_name%.gbk}} \ --proteins {input.protein_db} \ --addgenes \ --cpus {threads} {input.fasta} &>> {log} """ ``` --- class: middle ## Modularization is possible - Available wrappers -- - Common-Workflow-Language ([CWL](https://www.commonwl.org/)) support -- - `include:` statement to add another Snakefile -- - Use some (or all) rules from another `Snakemake` pipelines as "modules" or as a sub-workflow. + In modules you truly use (some) rules from another Snakefile. + In sub-workflow you declare dependency to an output from another pipeline. -- - Learn more at the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#) ??? - Use of remote (on the web) files. + AWS, Google storage, SFTP, HTTP[S], Dropbox, iRODS... + NCBI Entrez/GenBank - Smooth use with containers. - Automatically generate unit tests. + Do not use big files for testing! Containarization of Conda based workflows [Docs](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#containerization-of-conda-based-workflows): `snakemake --containerize > Dockerfile` Automated unit tests: [Docs](https://snakemake.readthedocs.io/en/stable/snakefiles/testing.html) `snakemake --generate-unit-tests` --- ## General options in Snakemake .pull-left[ - `cores` (and `local_cores`) - `default_resources` - `config` (overrides config file) - `dryrun` - `keepgoing` - `drmaa` - `unlock` - `cleanup-metadata` - `conda_cleanup_envs` - `restart-times` - `use-conda` - `use-singularity` - `singularity_args` ] .pull-right[ ``` snakemake --profile config --cores $CORES \ --drmaa " -q ${QUEUE} -n {threads} \ -o ${OUTPUT_DIR}/log/drmaa/{name}_{wildcards}_{jobid}.out \ -e ${OUTPUT_DIR}/log/drmaa/{name}_{wildcards}_{jobid}.err \ -R \"span[hosts=1] rusage[mem={resources.mem_mb}]\" " \ --drmaa-log-dir ${OUTPUT_DIR}/log/drmaa ${@} ``` ] --- class: middle # Why a wrapper? - Who is going go use the pipeline? + iRODS 😊 -- + Non-bioinformaticians 😖 -- - Future you might not remember how to use it... -- - Able to add "set-up" -- - `bash` or `python` <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512" class="rfa" style="height:0.75em;fill:currentColor;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg> wrapper? --- class: middle, center, inverse ## Questions? ### Ask them now or later! ☺️