Skip to content

Commit

Permalink
Merge branch 'main' of github.com:saeyslab/polygloty
Browse files Browse the repository at this point in the history
  • Loading branch information
LouiseDck committed Sep 7, 2024
2 parents bb43f45 + 0862e45 commit 677f90d
Show file tree
Hide file tree
Showing 24 changed files with 807 additions and 62 deletions.
55 changes: 0 additions & 55 deletions .github/workflows/test-mac-arm.yml

This file was deleted.

4 changes: 4 additions & 0 deletions book/in_memory_interoperability.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,10 @@ The `anndata2ri` package provides, apart from functionality to convert SingleCel

TODO: how to subscript sparse matrix? Is it possible?

```{r include=FALSE}
library(SingleCellExperiment)
```

```{python rpy2_sparse}
import scipy as sp
Expand Down
50 changes: 43 additions & 7 deletions book/on_disk_interoperability.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,65 @@ The downside is that it can lead to increased storage requirements and I/O overh

Debugging is only possible for individual scripts and the workflow in not as interactive and explorative as the in-memory approach.

## Different files
## Different file formats

Data format based interoperability
It's important to differentiate between language specific file formats and language agnostic file formats. For example, most languages can serialize objects to disk. R has the `.RDS` file format and Python has the `.pickle/.pkl` file format, but these are not interoperable. Older versions of the language could also have problems reading in serialized objects created by the latest language version.

For a file format to be language agnostic, it should have a mature standard describing how the data is stored on disk. This standard should be implemented in multiple languages, documented and tested for compatibility. Some file formats have a reference implementation in C, which can be used to create bindings for other languages. Most also have a status page that list the implementations and which version or how much of the standard they support.



### Dataframes

| File Format | Support in Python | Support in R | Text-Based | Binary | High-Performance | Ease-of-Use | Compression | Suitable for Single-Cell | Suitable for Spatial |
|-------------|-------------------|--------------|------------|--------|------------------|-------------|-------------|--------------------------|----------------------|
| CSV | Yes | Yes | Yes | No | No | High | No | Limited | No |
| JSON | Yes | Yes | Yes | No | No | Medium | No | Limited | No |
| Parquet | Yes | Yes | No | Yes | Yes | High | Yes | Limited | No |
| Feather | Yes | Yes | No | Yes | Yes | High | Yes | Yes | No |
| HDF5 | Yes | Yes | No | Yes | Yes | Medium | Yes | Yes | Limited |


### n-dimensional arrays

| File Format | Support in Python | Support in R | Text-Based | Binary | High-Performance | Ease-of-Use | Compression | Suitable for Single-Cell | Suitable for Spatial |
|-------------|-------------------|--------------|------------|--------|------------------|-------------|-------------|--------------------------|----------------------|
| h5ad | Yes | Limited | No | Yes | Yes | Medium | Yes | Yes | Yes |
| Zarr | Yes | Yes | No | Yes | Yes | Medium | Yes | Yes | Yes |
| NumPy (npy) | Yes | Limited | No | Yes | Yes | High | No | Yes | No |
| NetCDF | Yes | Yes | No | Yes | Yes | Medium | Yes | Yes | Yes |
| TIFF | Yes | Limited | No | Yes | Medium | Medium | Yes | No | Yes |

1. h5ad / zarr / Apache Arrow
2. Reading and writing these formats

TODO: comparison table

## Different on-disk pipelines

You can use a shell script to run the pipeline in a sequential manner. This requires all the dependencies to be installed in one large environment.

TODO: compare notebook and script pipelines
Usually you start in a notebook with an exploratory analysis, then move to a script for reproducibility and finally to a pipeline for scalability.

The scripts in such a script pipeline are a collection of the code snippets from the notebooks and can be written in different languages and executed in sequence.

Alternatively, there are frameworks that keep the notebooks and create a pipeline with it. The upside is that you can avoid converting the code snippets in the notebooks to scripts. The downside is that you have to use a specific framework and the notebooks can become very large and unwieldy.

## Notebook pipelines

TODO: talk about Quarto, nb and papermill
You can use [Quarto](https://quarto.org/) to run code snippets in different languages in the same `.qmd` notebook. Our [Use-case chapter](./usecase/) is one example of this.

For example, [Papermill]https://github.com/nteract/papermill) can execute Jupyter notebooks in sequence and pass variables between them. [Ploomber](https://github.com/ploomber/ploomber) is another example.

### Execute notebooks via the CLI

Jupyter via [nbconvert](https://nbconvert.readthedocs.io/en/latest/#).:
```bash
jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/
```

[RMarkdown](https://rmarkdown.rstudio.com/):
```bash
Rscript -e "rmarkdown::render('my_notebook.Rmd',params=list(args = myarg))"
```

## Script pipelines

### Calling scripts in the same environment
Expand Down
3 changes: 3 additions & 0 deletions book/workflow_frameworks/examples/viash_nextflow/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/work
/target
/.nextflow*
3 changes: 3 additions & 0 deletions book/workflow_frameworks/examples/viash_nextflow/_viash.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name: polygloty_usecase
version: 0.1.0
viash_version: 0.9.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: compute_pseudobulk
description: Compute pseudobulk expression from anndata object

argument_groups:
- name: Inputs
arguments:
- type: file
name: --input
description: Path to the input h5ad file
example: /path/to/input.h5ad
required: true
- name: Pseudobulk arguments
arguments:
- type: string
name: --obs_column_index
description: Name of the column to pseudobulk on
example: cell_type
required: true
- type: string
name: --obs_column_values
description: List of column names for the new obs data frame
example: ["batch", "sample"]
multiple: true
required: true
- name: Outputs
arguments:
- type: file
name: --output
description: Path to the output h5ad file
example: /path/to/output.h5ad
required: true
direction: output

resources:
- type: python_script
path: script.py

test_resources:
- type: python_script
path: test.py

engines:
- type: docker
image: python:3.10
setup:
- type: python
pypi:
- anndata
test_setup:
- type: python
pypi:
- viashpy

runners:
- type: executable
- type: nextflow
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import anndata as ad
import pandas as pd
import numpy as np

## VIASH START
par = {"input": "", "obs_column_index": "", "obs_column_values": [], "output": ""}
## VIASH END

print("Load data", flush=True)
adata = ad.read_h5ad(par["input"])

print(f"Format of input data: {adata}", flush=True)
assert par["obs_column_index"] in adata.obs.columns, f"Column '{par['obs_column']}' not found in obs."
for col in par["obs_column_values"]:
assert col in adata.obs.columns, f"Column '{col}' not found in obs."

print("Compute pseudobulk", flush=True)
X = adata.X
if not isinstance(X, np.ndarray):
X = X.toarray()
combined = pd.DataFrame(
X,
index=adata.obs[par["obs_column_index"]],
)
combined.columns = adata.var_names
pb_X = combined.groupby(level=0).sum()

print("Construct obs for pseudobulk")
pb_obs = adata.obs[par["obs_column_values"]].copy()
pb_obs.index = adata.obs[par["obs_column_index"]]
pb_obs = pb_obs.drop_duplicates()

print("Create AnnData object")
pb_adata = ad.AnnData(
X=pb_X.loc[pb_obs.index].values,
obs=pb_obs,
var=adata.var,
)

print("Store to disk")
pb_adata.write_h5ad(par["output"], compression="gzip")
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import sys
import anndata as ad
import pytest
import numpy as np

def test_subset_var(run_component, tmp_path):
input_path = tmp_path / "input.h5ad"
output_path = tmp_path / "output.h5ad"

# create data
adata_in = ad.AnnData(
X=np.array([[1, 2], [3, 4], [5, 6], [7, 8]]),
obs={
"cell_type": ["A", "A", "B", "B"],
"time": [1, 2, 1, 2],
"condition": ["ctrl", "ctrl", "trt", "trt"],
},
var={"highly_variable": [True, False]},
)

adata_in.write_h5ad(input_path)

# run component
run_component([
"--input", str(input_path),
"--obs_column_index", "cell_type",
"--obs_column_values", "condition",
"--output", str(output_path),
])

# load output
adata_out = ad.read_h5ad(output_path)

# check output
assert adata_out.X.shape == (2, 2)
assert np.all(adata_out.X == np.array([[4, 6], [12, 14]]))
assert adata_out.obs.index.tolist() == ["A", "B"]
assert adata_out.obs["condition"].tolist() == ["ctrl", "trt"]
assert adata_out.var["highly_variable"].tolist() == [True, False]


if __name__ == "__main__":
sys.exit(pytest.main([__file__]))
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
name: differential_expression
description: Compute differential expression between two observation types

argument_groups:
- name: Inputs
arguments:
- type: file
name: --input
description: Path to the input h5ad file
example: /path/to/input.h5ad
required: true
- name: Differential expression arguments
arguments:
- type: string
name: --contrast
description: |
Contrast to compute. Must be of length 3:
1. The name of the column to contrast on
2. The name of the first observation type
3. The name of the second observation type
example: ["cell_type", "batch", "sample"]
multiple: true
required: true
- type: string
name: --design_formula
description: Design formula for the differential expression model
example: ~ batch + cell_type
- name: Outputs
arguments:
- type: file
name: --output
description: Path to the output h5ad file
example: /path/to/output.h5ad
required: true
direction: output

resources:
- type: r_script
path: script.R

test_resources:
- type: r_script
path: test.R

engines:
- type: docker
image: rocker/r2u:22.04
setup:
- type: apt
packages:
- python3
- python3-pip
- python3-dev
- python-is-python3
- type: python
pypi:
- anndata
- type: r
cran:
- anndata
- processx
bioc:
- DESeq2

runners:
- type: executable
- type: nextflow
Loading

0 comments on commit 677f90d

Please sign in to comment.