From 0a35b98ae493dd197b36770fe13aa360ff9c3101 Mon Sep 17 00:00:00 2001 From: Benjamin Rombaut Date: Thu, 12 Sep 2024 10:33:29 +0200 Subject: [PATCH] add slides --- book/disk_based/disk_based_pipelines.qmd | 2 +- slides/slides.qmd | 95 ++++++++++++++++++++++++ 2 files changed, 96 insertions(+), 1 deletion(-) diff --git a/book/disk_based/disk_based_pipelines.qmd b/book/disk_based/disk_based_pipelines.qmd index 9ae160a..fede57f 100644 --- a/book/disk_based/disk_based_pipelines.qmd +++ b/book/disk_based/disk_based_pipelines.qmd @@ -197,4 +197,4 @@ docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/ Another approach is to use **multi-package containers**. Tools like [Multi-Package BioContainers](https://midnighter.github.io/mulled/) and [Seqera Containers](https://seqera.io/containers/) can make this quick and easy, by allowing for custom combinations of packages. -You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Nextflow or Snakemake to manage the pipeline for you. +You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Viash, Nextflow or Snakemake to manage the pipeline for you. diff --git a/slides/slides.qmd b/slides/slides.qmd index b9982d9..6b31b39 100644 --- a/slides/slides.qmd +++ b/slides/slides.qmd @@ -340,6 +340,24 @@ adata # Disk-based interoperability +Disk-based interoperability is a strategy for achieving interoperability between tools written in different programming languages by **storing intermediate results in standardized, language-agnostic file formats**. + +Upside: +- Simple, just add reading and witing lines +- Modular scripts + +Downside: +- increased disk usage +- less direct interaction, debugging... + +# Important features of interoperable file formats + +- Compression +- Sparse matrix support +- Large images +- Lazy chunk loading +- Remote storage + ## General single cell file formats of interest for Python and R {{< include ../book/disk_based/_general_file_formats.qmd >}} @@ -348,6 +366,83 @@ adata {{< include ../book/disk_based/_specialized_file_formats.qmd >}} +# Disk-based pipelines + +Script pipeline: +```bash +#!/bin/bash + +bash scripts/1_load_data.sh +python scripts/2_compute_pseudobulk.py +Rscript scripts/3_analysis_de.R +``` + +Notebook pipeline: +```bash +# Every step can be a new notebook execution with inspectable output +jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/ +``` + +## Just stay in your language and call scripts +```python +import subprocess + +subprocess.run("bash scripts/1_load_data.sh", shell=True) +# Alternatively you can run Python code here instead of calling a Python script +subprocess.run("python scripts/2_compute_pseudobulk.py", shell=True) +subprocess.run("Rscript scripts/3_analysis_de.R", shell=True) +``` + +# Pipelines with different environments + +1. interleave with environment (de)activation functions +2. use rvenv +3. use Pixi + +## Pixi to manage different environments + +```bash +pixi run -e bash scripts/1_load_data.sh +pixi run -e scverse scripts/2_compute_pseudobulk.py +pixi run -e rverse scripts/3_analysis_de.R +``` + +## Define tasks in Pixi + +```bash +... +[feature.bash.tasks] +load_data = "bash book/disk_based/scripts/1_load_data.sh" +... +[feature.scverse.tasks] +compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py" +... +[feature.rverse.tasks] +analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R" +... +[tasks] +pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] } +``` +```bash +pixi run pipeline +``` + +## Also possible to use containers + +```bash +docker pull berombau/polygloty-docker:latest +docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline +``` + +Another approach is to use multi-package containers to create custom combinations of packages. +- [Multi-Package BioContainers](https://midnighter.github.io/mulled/) +- [Seqera Containers](https://seqera.io/containers/) + + # Workflows +You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Viash, Nextflow or Snakemake to manage the pipeline for you. + +See https://saeyslab.github.io/polygloty/book/workflow_frameworks/ + # Takeaways