Skip to content

Commit

Permalink
Various docs rejigging
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromekelleher committed May 14, 2024
1 parent 65efc3e commit b297246
Show file tree
Hide file tree
Showing 7 changed files with 140 additions and 165 deletions.
123 changes: 4 additions & 119 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,124 +1,9 @@
[![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
[![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
![PyPI](https://img.shields.io/pypi/v/PACKAGE?label=pypi%20bio2zarr)
![PyPI - Downloads](https://img.shields.io/pypi/dm/bio2zarr)

# bio2zarr
Convert bioinformatics file formats to Zarr

Initially supports converting VCF to the
[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)

**This is early alpha-status code: everything is subject to change,
and it has not been thoroughly tested**

## Install

```
$ python3 -m pip install bio2zarr
```

This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
into your local Python path. You may need to update your $PATH to call the
executables directly.

Alternatively, calling
```
$ python3 -m bio2zarr vcf2zarr <args>
```
is equivalent to

```
$ vcf2zarr <args>
```
and will always work.


## vcf2zarr


Convert a VCF to zarr format:

```
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
```

Converts the VCF to zarr format.

**Do not use this for anything but the smallest files**

The recommended approach is to use a multi-stage conversion

First, convert the VCF into the intermediate format:

```
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
```

Then, (optionally) inspect this representation to get a feel for your dataset
```
vcf2zarr inspect tmp/sample.exploded
```

Then, (optionally) generate a conversion schema to describe the corresponding
Zarr arrays:

```
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
```

View and edit the schema, deleting any columns you don't want, or tweaking
dtypes and compression settings to your taste.

Finally, encode to Zarr:
```
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
```

Use the ``-p, --worker-processes`` argument to control the number of workers used
in the ``explode`` and ``encode`` phases.

### Shell completion

To enable shell completion for a particular session in Bash do:

```
eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
```

If you add this to your ``.bashrc`` vcf2zarr shell completion should available
in all new shell sessions.

See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
for instructions on how to enable completion in other shells.
a

## plink2zarr

Convert a plink ``.bed`` file to zarr format. **This is incomplete**

## vcf_partition

Partition a given VCF file into (approximately) a give number of regions:

```
vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
```
gives
```
chr20:1-6799360
chr20:6799361-14319616
chr20:14319617-21790720
chr20:21790721-28770304
chr20:28770305-31096832
chr20:31096833-38043648
chr20:38043649-45580288
chr20:45580289-52117504
chr20:52117505-58834944
chr20:58834945-
```

These reqion strings can then be used to split computation of the VCF
into chunks for parallelisation.

**TODO give a nice example here using xargs**

**WARNING that this does not take into account that indels may overlap
partitions and you may count variants twice or more if they do**
See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details.
47 changes: 2 additions & 45 deletions bio2zarr/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,51 +459,8 @@ def vcf2zarr_main():
"""
Convert VCF file(s) to the vcfzarr format.
The simplest usage is:
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
step. As this writes the intermediate columnar format to a temporary directory,
we only recommend this approach for small files (< 1GB, say).
The recommended approach is to run the conversion in two passes, and
to keep the intermediate columnar format ("exploded") around to facilitate
experimentation with chunk sizes and compression settings:
\b
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
The inspect command provides a way to view contents of an exploded ICF
or Zarr:
$ vcf2zarr inspect [PATH]
This is useful when tweaking chunk sizes and compression settings to suit
your dataset, using the mkschema command and --schema option to encode:
\b
$ vcf2zarr mkschema [ICF_PATH] > schema.json
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
By editing the schema.json file you can drop columns that are not of interest
and edit column specific compression settings. The --max-variant-chunks option
to encode allows you to try out these options on small subsets, hopefully
arriving at settings with the desired balance of compression and query
performance.
ADVANCED USAGE
For very large datasets (terabyte scale) it may be necessary to distribute the
explode and encode steps across a cluster:
\b
$ vcf2zarr dexplode-init [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] [NUM_PARTITIONS]
$ vcf2zarr dexplode-partition [ICF_PATH] [PARTITION_INDEX]
$ vcf2zarr dexplode-finalise [ICF_PATH]
See the online documentation at [FIXME] for more details on distributed explode.
See the online documentation at https://sgkit-dev.github.io/bio2zarr/
for more information.
"""


Expand Down
1 change: 1 addition & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ root: intro
chapters:
- file: installation
- file: vcf2zarr
- file: vcfpartition
- file: cli
2 changes: 1 addition & 1 deletion docs/cli.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Command Line Interface
# CLI Reference

% A note on cross references... There's some weird long-standing problem with
% cross referencing program values in Sphinx, which means that we can't use
Expand Down
14 changes: 14 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,17 @@ $ vcf2zarr <args>
```
and will always work.


## Shell completion

To enable shell completion for a particular session in Bash do:

```
eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
```

If you add this to your ``.bashrc`` vcf2zarr shell completion should available
in all new shell sessions.

See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
for instructions on how to enable completion in other shells.
88 changes: 88 additions & 0 deletions docs/vcf2zarr.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,94 @@ kernelspec:
# vcf2zarr


## Overview


Convert a VCF to zarr format:

```
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
```

Converts the VCF to zarr format.

**Do not use this for anything but the smallest files**

The recommended approach is to use a multi-stage conversion

First, convert the VCF into the intermediate format:

```
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
```

Then, (optionally) inspect this representation to get a feel for your dataset
```
vcf2zarr inspect tmp/sample.exploded
```

Then, (optionally) generate a conversion schema to describe the corresponding
Zarr arrays:

```
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
```

View and edit the schema, deleting any columns you don't want, or tweaking
dtypes and compression settings to your taste.

Finally, encode to Zarr:
```
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
```

Use the ``-p, --worker-processes`` argument to control the number of workers used
in the ``explode`` and ``encode`` phases.

## To be merged with above

The simplest usage is:

```
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
```


This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
step. As this writes the intermediate columnar format to a temporary directory,
we only recommend this approach for small files (< 1GB, say).

The recommended approach is to run the conversion in two passes, and
to keep the intermediate columnar format ("exploded") around to facilitate
experimentation with chunk sizes and compression settings:

```
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
```

The inspect command provides a way to view contents of an exploded ICF
or Zarr:

```
$ vcf2zarr inspect [PATH]
```

This is useful when tweaking chunk sizes and compression settings to suit
your dataset, using the mkschema command and --schema option to encode:

```
$ vcf2zarr mkschema [ICF_PATH] > schema.json
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
```

By editing the schema.json file you can drop columns that are not of interest
and edit column specific compression settings. The --max-variant-chunks option
to encode allows you to try out these options on small subsets, hopefully
arriving at settings with the desired balance of compression and query
performance.



## Tutorial

Expand Down
30 changes: 30 additions & 0 deletions docs/vcfpartition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# vcfpartition

## Overview

Partition a given VCF file into (approximately) a give number of regions:

```
vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
```
gives
```
chr20:1-6799360
chr20:6799361-14319616
chr20:14319617-21790720
chr20:21790721-28770304
chr20:28770305-31096832
chr20:31096833-38043648
chr20:38043649-45580288
chr20:45580289-52117504
chr20:52117505-58834944
chr20:58834945-
```

These reqion strings can then be used to split computation of the VCF
into chunks for parallelisation.

**TODO give a nice example here using xargs**

**WARNING that this does not take into account that indels may overlap
partitions and you may count variants twice or more if they do**

0 comments on commit b297246

Please sign in to comment.