Skip to content

Commit

Permalink
add slides
Browse files Browse the repository at this point in the history
  • Loading branch information
berombau committed Sep 12, 2024
1 parent 3e1bf49 commit 0a35b98
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 1 deletion.
2 changes: 1 addition & 1 deletion book/disk_based/disk_based_pipelines.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -197,4 +197,4 @@ docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/
Another approach is to use **multi-package containers**. Tools like [Multi-Package BioContainers](https://midnighter.github.io/mulled/) and [Seqera Containers](https://seqera.io/containers/) can make this quick and easy, by allowing for custom combinations of packages.
You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Nextflow or Snakemake to manage the pipeline for you.
You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Viash, Nextflow or Snakemake to manage the pipeline for you.
95 changes: 95 additions & 0 deletions slides/slides.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,24 @@ adata

# Disk-based interoperability

Disk-based interoperability is a strategy for achieving interoperability between tools written in different programming languages by **storing intermediate results in standardized, language-agnostic file formats**.

Upside:
- Simple, just add reading and witing lines
- Modular scripts

Downside:
- increased disk usage
- less direct interaction, debugging...

# Important features of interoperable file formats

- Compression
- Sparse matrix support
- Large images
- Lazy chunk loading
- Remote storage

## General single cell file formats of interest for Python and R

{{< include ../book/disk_based/_general_file_formats.qmd >}}
Expand All @@ -348,6 +366,83 @@ adata

{{< include ../book/disk_based/_specialized_file_formats.qmd >}}

# Disk-based pipelines

Script pipeline:
```bash
#!/bin/bash

bash scripts/1_load_data.sh
python scripts/2_compute_pseudobulk.py
Rscript scripts/3_analysis_de.R
```

Notebook pipeline:
```bash
# Every step can be a new notebook execution with inspectable output
jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/
```

## Just stay in your language and call scripts
```python
import subprocess

subprocess.run("bash scripts/1_load_data.sh", shell=True)
# Alternatively you can run Python code here instead of calling a Python script
subprocess.run("python scripts/2_compute_pseudobulk.py", shell=True)
subprocess.run("Rscript scripts/3_analysis_de.R", shell=True)
```

# Pipelines with different environments

1. interleave with environment (de)activation functions
2. use rvenv
3. use Pixi

## Pixi to manage different environments

```bash
pixi run -e bash scripts/1_load_data.sh
pixi run -e scverse scripts/2_compute_pseudobulk.py
pixi run -e rverse scripts/3_analysis_de.R
```

## Define tasks in Pixi

```bash
...
[feature.bash.tasks]
load_data = "bash book/disk_based/scripts/1_load_data.sh"
...
[feature.scverse.tasks]
compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py"
...
[feature.rverse.tasks]
analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R"
...
[tasks]
pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] }
```
```bash
pixi run pipeline
```
## Also possible to use containers
```bash
docker pull berombau/polygloty-docker:latest
docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline
```
Another approach is to use multi-package containers to create custom combinations of packages.
- [Multi-Package BioContainers](https://midnighter.github.io/mulled/)
- [Seqera Containers](https://seqera.io/containers/)
# Workflows
You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a **[workflow framework](../workflow_frameworks)** like Viash, Nextflow or Snakemake to manage the pipeline for you.
See https://saeyslab.github.io/polygloty/book/workflow_frameworks/
# Takeaways

0 comments on commit 0a35b98

Please sign in to comment.