Merge branch 'main' of github.com:saeyslab/polygloty

saeyslab · Sep 7, 2024 · 677f90d · 677f90d
2 parents bb43f45 + 0862e45
commit 677f90d
Show file tree

Hide file tree

Showing 24 changed files with 807 additions and 62 deletions.
diff --git a/.github/workflows/test-mac-arm.yml b/.github/workflows/test-mac-arm.yml
diff --git a/book/in_memory_interoperability.qmd b/book/in_memory_interoperability.qmd
@@ -62,6 +62,10 @@ The `anndata2ri` package provides, apart from functionality to convert SingleCel
 
 TODO: how to subscript sparse matrix? Is it possible?
 
+```{r include=FALSE}
+library(SingleCellExperiment)
+```
+
 ```{python rpy2_sparse}
 import scipy as sp
 

diff --git a/book/on_disk_interoperability.qmd b/book/on_disk_interoperability.qmd
@@ -11,29 +11,65 @@ The downside is that it can lead to increased storage requirements and I/O overh
 
 Debugging is only possible for individual scripts and the workflow in not as interactive and explorative as the in-memory approach.
 
-## Different files 
+## Different file formats
 
-Data format based interoperability
+It's important to differentiate between language specific file formats and language agnostic file formats. For example, most languages can serialize objects to disk. R has the `.RDS` file format and Python has the `.pickle/.pkl` file format, but these are not interoperable. Older versions of the language could also have problems reading in serialized objects created by the latest language version.
+
+For a file format to be language agnostic, it should have a mature standard describing how the data is stored on disk. This standard should be implemented in multiple languages, documented and tested for compatibility. Some file formats have a reference implementation in C, which can be used to create bindings for other languages. Most also have a status page that list the implementations and which version or how much of the standard they support.
+
+
+
+### Dataframes
+
+| File Format | Support in Python | Support in R | Text-Based | Binary | High-Performance | Ease-of-Use | Compression | Suitable for Single-Cell | Suitable for Spatial |
+|-------------|-------------------|--------------|------------|--------|------------------|-------------|-------------|--------------------------|----------------------|
+| CSV         | Yes               | Yes          | Yes        | No     | No               | High        | No          | Limited                   | No                   |
+| JSON        | Yes               | Yes          | Yes        | No     | No               | Medium      | No          | Limited                   | No                   |
+| Parquet     | Yes               | Yes          | No         | Yes    | Yes              | High        | Yes         | Limited                   | No                   |
+| Feather     | Yes               | Yes          | No         | Yes    | Yes              | High        | Yes         | Yes                       | No                   |
+| HDF5        | Yes               | Yes          | No         | Yes    | Yes              | Medium      | Yes         | Yes                       | Limited              |
+
+
+### n-dimensional arrays
+
+| File Format | Support in Python | Support in R | Text-Based | Binary | High-Performance | Ease-of-Use | Compression | Suitable for Single-Cell | Suitable for Spatial |
+|-------------|-------------------|--------------|------------|--------|------------------|-------------|-------------|--------------------------|----------------------|
+| h5ad        | Yes               | Limited      | No         | Yes    | Yes              | Medium      | Yes         | Yes                       | Yes                  |
+| Zarr        | Yes               | Yes          | No         | Yes    | Yes              | Medium      | Yes         | Yes                       | Yes                  |
+| NumPy (npy) | Yes               | Limited      | No         | Yes    | Yes              | High        | No          | Yes                       | No                   |
+| NetCDF      | Yes               | Yes          | No         | Yes    | Yes              | Medium      | Yes         | Yes                       | Yes                  |
+| TIFF        | Yes               | Limited      | No         | Yes    | Medium           | Medium      | Yes         | No                        | Yes                  |
 
-1. h5ad / zarr / Apache Arrow
-2. Reading and writing these formats
 
-TODO: comparison table
 
 ## Different on-disk pipelines
 
 You can use a shell script to run the pipeline in a sequential manner. This requires all the dependencies to be installed in one large environment.
 
-TODO: compare notebook and script pipelines
+Usually you start in a notebook with an exploratory analysis, then move to a script for reproducibility and finally to a pipeline for scalability.
+
+The scripts in such a script pipeline are a collection of the code snippets from the notebooks and can be written in different languages and executed in sequence.
+
+Alternatively, there are frameworks that keep the notebooks and create a pipeline with it. The upside is that you can avoid converting the code snippets in the notebooks to scripts. The downside is that you have to use a specific framework and the notebooks can become very large and unwieldy.
 
 ## Notebook pipelines
 
-TODO: talk about Quarto, nb and papermill
+You can use [Quarto](https://quarto.org/) to run code snippets in different languages in the same `.qmd` notebook. Our [Use-case chapter](./usecase/) is one example of this.
+
+For example, [Papermill]https://github.com/nteract/papermill) can execute Jupyter notebooks in sequence and pass variables between them. [Ploomber](https://github.com/ploomber/ploomber) is another example.
+
+### Execute notebooks via the CLI
 
+Jupyter via [nbconvert](https://nbconvert.readthedocs.io/en/latest/#).:
 ```bash
 jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/
 ```
 
+[RMarkdown](https://rmarkdown.rstudio.com/):
+```bash
+Rscript -e "rmarkdown::render('my_notebook.Rmd',params=list(args = myarg))"
+```
+
 ## Script pipelines
 
 ### Calling scripts in the same environment

diff --git a/book/workflow_frameworks/examples/viash_nextflow/.gitignore b/book/workflow_frameworks/examples/viash_nextflow/.gitignore
@@ -0,0 +1,3 @@
+/work
+/target
+/.nextflow*
diff --git a/book/workflow_frameworks/examples/viash_nextflow/_viash.yaml b/book/workflow_frameworks/examples/viash_nextflow/_viash.yaml
@@ -0,0 +1,3 @@
+name: polygloty_usecase
+version: 0.1.0
+viash_version: 0.9.0
diff --git a/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/config.vsh.yaml b/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/config.vsh.yaml
@@ -0,0 +1,56 @@
+name: compute_pseudobulk
+description: Compute pseudobulk expression from anndata object
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - type: file
+        name: --input
+        description: Path to the input h5ad file
+        example: /path/to/input.h5ad
+        required: true
+  - name: Pseudobulk arguments
+    arguments:
+      - type: string
+        name: --obs_column_index
+        description: Name of the column to pseudobulk on
+        example: cell_type
+        required: true
+      - type: string
+        name: --obs_column_values
+        description: List of column names for the new obs data frame
+        example: ["batch", "sample"]
+        multiple: true
+        required: true
+  - name: Outputs
+    arguments:
+      - type: file
+        name: --output
+        description: Path to the output h5ad file
+        example: /path/to/output.h5ad
+        required: true
+        direction: output
+
+resources:
+  - type: python_script
+    path: script.py
+
+test_resources:
+  - type: python_script
+    path: test.py
+
+engines:
+  - type: docker
+    image: python:3.10
+    setup:
+      - type: python
+        pypi:
+          - anndata
+    test_setup:
+      - type: python
+        pypi:
+          - viashpy
+
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/script.py b/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/script.py
@@ -0,0 +1,41 @@
+import anndata as ad
+import pandas as pd
+import numpy as np
+
+## VIASH START
+par = {"input": "", "obs_column_index": "", "obs_column_values": [], "output": ""}
+## VIASH END
+
+print("Load data", flush=True)
+adata = ad.read_h5ad(par["input"])
+
+print(f"Format of input data: {adata}", flush=True)
+assert par["obs_column_index"] in adata.obs.columns, f"Column '{par['obs_column']}' not found in obs."
+for col in par["obs_column_values"]:
+  assert col in adata.obs.columns, f"Column '{col}' not found in obs."
+
+print("Compute pseudobulk", flush=True)
+X = adata.X
+if not isinstance(X, np.ndarray):
+  X = X.toarray()
+combined = pd.DataFrame(
+  X,
+  index=adata.obs[par["obs_column_index"]],
+)
+combined.columns = adata.var_names
+pb_X = combined.groupby(level=0).sum()
+
+print("Construct obs for pseudobulk")
+pb_obs = adata.obs[par["obs_column_values"]].copy()
+pb_obs.index = adata.obs[par["obs_column_index"]]
+pb_obs = pb_obs.drop_duplicates()
+
+print("Create AnnData object")
+pb_adata = ad.AnnData(
+  X=pb_X.loc[pb_obs.index].values,
+  obs=pb_obs,
+  var=adata.var,
+)
+
+print("Store to disk")
+pb_adata.write_h5ad(par["output"], compression="gzip")
diff --git a/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/test.py b/book/workflow_frameworks/examples/viash_nextflow/src/compute_pseudobulk/test.py
@@ -0,0 +1,43 @@
+import sys
+import anndata as ad
+import pytest
+import numpy as np
+
+def test_subset_var(run_component, tmp_path):
+  input_path = tmp_path / "input.h5ad"
+  output_path = tmp_path / "output.h5ad"
+
+  # create data
+  adata_in = ad.AnnData(
+    X=np.array([[1, 2], [3, 4], [5, 6], [7, 8]]),
+    obs={
+      "cell_type": ["A", "A", "B", "B"],
+      "time": [1, 2, 1, 2],
+      "condition": ["ctrl", "ctrl", "trt", "trt"],
+    },
+    var={"highly_variable": [True, False]},
+  )
+
+  adata_in.write_h5ad(input_path)
+
+  # run component
+  run_component([
+    "--input", str(input_path),
+    "--obs_column_index", "cell_type",
+    "--obs_column_values", "condition",
+    "--output", str(output_path),
+  ])
+
+  # load output
+  adata_out = ad.read_h5ad(output_path)
+
+  # check output
+  assert adata_out.X.shape == (2, 2)
+  assert np.all(adata_out.X == np.array([[4, 6], [12, 14]]))
+  assert adata_out.obs.index.tolist() == ["A", "B"]
+  assert adata_out.obs["condition"].tolist() == ["ctrl", "trt"]
+  assert adata_out.var["highly_variable"].tolist() == [True, False]
+
+
+if __name__ == "__main__":
+  sys.exit(pytest.main([__file__]))
diff --git a/book/workflow_frameworks/examples/viash_nextflow/src/differential_expression/config.vsh.yaml b/book/workflow_frameworks/examples/viash_nextflow/src/differential_expression/config.vsh.yaml
@@ -0,0 +1,68 @@
+name: differential_expression
+description: Compute differential expression between two observation types
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - type: file
+        name: --input
+        description: Path to the input h5ad file
+        example: /path/to/input.h5ad
+        required: true
+  - name: Differential expression arguments
+    arguments:
+      - type: string
+        name: --contrast
+        description: |
+          Contrast to compute. Must be of length 3:
+
+          1. The name of the column to contrast on
+          2. The name of the first observation type
+          3. The name of the second observation type
+        example: ["cell_type", "batch", "sample"]
+        multiple: true
+        required: true
+      - type: string
+        name: --design_formula
+        description: Design formula for the differential expression model
+        example: ~ batch + cell_type
+  - name: Outputs
+    arguments:
+      - type: file
+        name: --output
+        description: Path to the output h5ad file
+        example: /path/to/output.h5ad
+        required: true
+        direction: output
+
+resources:
+  - type: r_script
+    path: script.R
+
+test_resources:
+  - type: r_script
+    path: test.R
+
+engines:
+  - type: docker
+    image: rocker/r2u:22.04
+    setup:
+      - type: apt
+        packages:
+          - python3
+          - python3-pip
+          - python3-dev
+          - python-is-python3
+      - type: python
+        pypi:
+          - anndata
+      - type: r
+        cran:
+          - anndata
+          - processx
+        bioc:
+          - DESeq2
+
+runners:
+  - type: executable
+  - type: nextflow