From 65efc3e65ac69fdbab0e4f9bb19a65d0f5a7ad0e Mon Sep 17 00:00:00 2001 From: Jerome Kelleher Date: Tue, 14 May 2024 09:45:55 +0100 Subject: [PATCH 1/2] Docs rejig --- docs/_toc.yml | 3 +- docs/installation.md | 22 +++++++ docs/intro.md | 77 ++-------------------- docs/{vcf2zarr_tutorial.md => vcf2zarr.md} | 12 ++-- 4 files changed, 37 insertions(+), 77 deletions(-) create mode 100644 docs/installation.md rename docs/{vcf2zarr_tutorial.md => vcf2zarr.md} (92%) diff --git a/docs/_toc.yml b/docs/_toc.yml index 008d36f..19591dd 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -1,5 +1,6 @@ format: jb-book root: intro chapters: -- file: vcf2zarr_tutorial +- file: installation +- file: vcf2zarr - file: cli diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000..30a67a7 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,22 @@ +# Installation + + +``` +$ python3 -m pip install bio2zarr +``` + +This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` +into your local Python path. You may need to update your $PATH to call the +executables directly. + +Alternatively, calling +``` +$ python3 -m bio2zarr vcf2zarr +``` +is equivalent to + +``` +$ vcf2zarr +``` +and will always work. + diff --git a/docs/intro.md b/docs/intro.md index 01a4e58..07f11c2 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -1,76 +1,9 @@ -# bio2zarr Documentation +# bio2zarr -`bio2zarr` efficiently converts common bioinformatics formats to -[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting -VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/). +`bio2zarr` efficiently converts common bioinformatics formats to +[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting +VCF to the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/). -`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome +`bio2zarr` is in development, contributions, feedback and issues are welcome at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr). -## Installation -`bio2zarr` can be installed from PyPI using pip: - -```bash -$ python3 -m pip install bio2zarr -``` - -This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` -into your local Python path. You may need to update your $PATH to call the -executables directly. - -Alternatively, calling -``` -$ python3 -m bio2zarr vcf2zarr -``` -is equivalent to - -``` -$ vcf2zarr -``` -and will always work. - -## Basic vcf2zarr usage -For modest VCF files (up to a few GB), a single command can be used to convert a VCF file -(or set of VCF files) using the {ref}`convert` command: - -```bash -$ vcf2zarr convert ... -``` - -For larger files a multi-step process is recommended. - - -First, convert the VCF into the intermediate format: - -```bash -$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded -``` - -Then, (optionally) inspect this representation to get a feel for your dataset -```bash -$ vcf2zarr inspect tmp/sample.exploded -``` - -Then, (optionally) generate a conversion schema to describe the corresponding -Zarr arrays: - -```bash -$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json -``` - -View and edit the schema, deleting any columns you don't want, or tweaking -dtypes and compression settings to your taste. - -Finally, encode to Zarr: -```bash -$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json -``` - -Use the ``-p, --worker-processes`` argument to control the number of workers used -in the ``explode`` and ``encode`` phases. - - - - -```{tableofcontents} -``` diff --git a/docs/vcf2zarr_tutorial.md b/docs/vcf2zarr.md similarity index 92% rename from docs/vcf2zarr_tutorial.md rename to docs/vcf2zarr.md index fcf734a..5b10754 100644 --- a/docs/vcf2zarr_tutorial.md +++ b/docs/vcf2zarr.md @@ -9,7 +9,11 @@ kernelspec: language: bash name: bash --- -# Vcf2zarr tutorial +# vcf2zarr + + + +## Tutorial This is a step-by-step tutorial showing you how to convert your VCF data into Zarr format. There's three different ways to @@ -17,7 +21,7 @@ convert your data, basically providing different levels of convenience and flexibility corresponding to what you might need for small, intermediate and large datasets. -## Small +### Small @@ -32,6 +36,6 @@ need for small, intermediate and large datasets. }); -## Intermediate +### Intermediate -## Large +### Large From b297246e9351cfaaf661c566980c7e2ffaec4288 Mon Sep 17 00:00:00 2001 From: Jerome Kelleher Date: Tue, 14 May 2024 09:56:15 +0100 Subject: [PATCH 2/2] Various docs rejigging --- README.md | 123 ++----------------------------------------- bio2zarr/cli.py | 47 +---------------- docs/_toc.yml | 1 + docs/cli.md | 2 +- docs/installation.md | 14 +++++ docs/vcf2zarr.md | 88 +++++++++++++++++++++++++++++++ docs/vcfpartition.md | 30 +++++++++++ 7 files changed, 140 insertions(+), 165 deletions(-) create mode 100644 docs/vcfpartition.md diff --git a/README.md b/README.md index 3ffbc2c..8c09b17 100644 --- a/README.md +++ b/README.md @@ -1,124 +1,9 @@ [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml) +[![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr) +![PyPI](https://img.shields.io/pypi/v/PACKAGE?label=pypi%20bio2zarr) +![PyPI - Downloads](https://img.shields.io/pypi/dm/bio2zarr) # bio2zarr Convert bioinformatics file formats to Zarr -Initially supports converting VCF to the -[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/) - -**This is early alpha-status code: everything is subject to change, -and it has not been thoroughly tested** - -## Install - -``` -$ python3 -m pip install bio2zarr -``` - -This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` -into your local Python path. You may need to update your $PATH to call the -executables directly. - -Alternatively, calling -``` -$ python3 -m bio2zarr vcf2zarr -``` -is equivalent to - -``` -$ vcf2zarr -``` -and will always work. - - -## vcf2zarr - - -Convert a VCF to zarr format: - -``` -$ vcf2zarr convert -``` - -Converts the VCF to zarr format. - -**Do not use this for anything but the smallest files** - -The recommended approach is to use a multi-stage conversion - -First, convert the VCF into the intermediate format: - -``` -vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded -``` - -Then, (optionally) inspect this representation to get a feel for your dataset -``` -vcf2zarr inspect tmp/sample.exploded -``` - -Then, (optionally) generate a conversion schema to describe the corresponding -Zarr arrays: - -``` -vcf2zarr mkschema tmp/sample.exploded > sample.schema.json -``` - -View and edit the schema, deleting any columns you don't want, or tweaking -dtypes and compression settings to your taste. - -Finally, encode to Zarr: -``` -vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json -``` - -Use the ``-p, --worker-processes`` argument to control the number of workers used -in the ``explode`` and ``encode`` phases. - -### Shell completion - -To enable shell completion for a particular session in Bash do: - -``` -eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" -``` - -If you add this to your ``.bashrc`` vcf2zarr shell completion should available -in all new shell sessions. - -See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion) -for instructions on how to enable completion in other shells. -a - -## plink2zarr - -Convert a plink ``.bed`` file to zarr format. **This is incomplete** - -## vcf_partition - -Partition a given VCF file into (approximately) a give number of regions: - -``` -vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10 -``` -gives -``` -chr20:1-6799360 -chr20:6799361-14319616 -chr20:14319617-21790720 -chr20:21790721-28770304 -chr20:28770305-31096832 -chr20:31096833-38043648 -chr20:38043649-45580288 -chr20:45580289-52117504 -chr20:52117505-58834944 -chr20:58834945- -``` - -These reqion strings can then be used to split computation of the VCF -into chunks for parallelisation. - -**TODO give a nice example here using xargs** - -**WARNING that this does not take into account that indels may overlap -partitions and you may count variants twice or more if they do** +See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details. diff --git a/bio2zarr/cli.py b/bio2zarr/cli.py index 034b6a2..67399ab 100644 --- a/bio2zarr/cli.py +++ b/bio2zarr/cli.py @@ -459,51 +459,8 @@ def vcf2zarr_main(): """ Convert VCF file(s) to the vcfzarr format. - The simplest usage is: - - $ vcf2zarr convert [VCF_FILE] [ZARR_PATH] - - This will convert the indexed VCF (or BCF) into the vcfzarr format in a single - step. As this writes the intermediate columnar format to a temporary directory, - we only recommend this approach for small files (< 1GB, say). - - The recommended approach is to run the conversion in two passes, and - to keep the intermediate columnar format ("exploded") around to facilitate - experimentation with chunk sizes and compression settings: - - \b - $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] - $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] - - The inspect command provides a way to view contents of an exploded ICF - or Zarr: - - $ vcf2zarr inspect [PATH] - - This is useful when tweaking chunk sizes and compression settings to suit - your dataset, using the mkschema command and --schema option to encode: - - \b - $ vcf2zarr mkschema [ICF_PATH] > schema.json - $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json - - By editing the schema.json file you can drop columns that are not of interest - and edit column specific compression settings. The --max-variant-chunks option - to encode allows you to try out these options on small subsets, hopefully - arriving at settings with the desired balance of compression and query - performance. - - ADVANCED USAGE - - For very large datasets (terabyte scale) it may be necessary to distribute the - explode and encode steps across a cluster: - - \b - $ vcf2zarr dexplode-init [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] [NUM_PARTITIONS] - $ vcf2zarr dexplode-partition [ICF_PATH] [PARTITION_INDEX] - $ vcf2zarr dexplode-finalise [ICF_PATH] - - See the online documentation at [FIXME] for more details on distributed explode. + See the online documentation at https://sgkit-dev.github.io/bio2zarr/ + for more information. """ diff --git a/docs/_toc.yml b/docs/_toc.yml index 19591dd..b744322 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -3,4 +3,5 @@ root: intro chapters: - file: installation - file: vcf2zarr +- file: vcfpartition - file: cli diff --git a/docs/cli.md b/docs/cli.md index f0f8afb..cd096a4 100644 --- a/docs/cli.md +++ b/docs/cli.md @@ -1,4 +1,4 @@ -# Command Line Interface +# CLI Reference % A note on cross references... There's some weird long-standing problem with % cross referencing program values in Sphinx, which means that we can't use diff --git a/docs/installation.md b/docs/installation.md index 30a67a7..52f7155 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -20,3 +20,17 @@ $ vcf2zarr ``` and will always work. + +## Shell completion + +To enable shell completion for a particular session in Bash do: + +``` +eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" +``` + +If you add this to your ``.bashrc`` vcf2zarr shell completion should available +in all new shell sessions. + +See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion) +for instructions on how to enable completion in other shells. diff --git a/docs/vcf2zarr.md b/docs/vcf2zarr.md index 5b10754..872c4c8 100644 --- a/docs/vcf2zarr.md +++ b/docs/vcf2zarr.md @@ -12,6 +12,94 @@ kernelspec: # vcf2zarr +## Overview + + +Convert a VCF to zarr format: + +``` +$ vcf2zarr convert +``` + +Converts the VCF to zarr format. + +**Do not use this for anything but the smallest files** + +The recommended approach is to use a multi-stage conversion + +First, convert the VCF into the intermediate format: + +``` +vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded +``` + +Then, (optionally) inspect this representation to get a feel for your dataset +``` +vcf2zarr inspect tmp/sample.exploded +``` + +Then, (optionally) generate a conversion schema to describe the corresponding +Zarr arrays: + +``` +vcf2zarr mkschema tmp/sample.exploded > sample.schema.json +``` + +View and edit the schema, deleting any columns you don't want, or tweaking +dtypes and compression settings to your taste. + +Finally, encode to Zarr: +``` +vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json +``` + +Use the ``-p, --worker-processes`` argument to control the number of workers used +in the ``explode`` and ``encode`` phases. + +## To be merged with above + +The simplest usage is: + +``` +$ vcf2zarr convert [VCF_FILE] [ZARR_PATH] +``` + + +This will convert the indexed VCF (or BCF) into the vcfzarr format in a single +step. As this writes the intermediate columnar format to a temporary directory, +we only recommend this approach for small files (< 1GB, say). + +The recommended approach is to run the conversion in two passes, and +to keep the intermediate columnar format ("exploded") around to facilitate +experimentation with chunk sizes and compression settings: + +``` +$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] +$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] +``` + +The inspect command provides a way to view contents of an exploded ICF +or Zarr: + +``` +$ vcf2zarr inspect [PATH] +``` + +This is useful when tweaking chunk sizes and compression settings to suit +your dataset, using the mkschema command and --schema option to encode: + +``` +$ vcf2zarr mkschema [ICF_PATH] > schema.json +$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json +``` + +By editing the schema.json file you can drop columns that are not of interest +and edit column specific compression settings. The --max-variant-chunks option +to encode allows you to try out these options on small subsets, hopefully +arriving at settings with the desired balance of compression and query +performance. + + ## Tutorial diff --git a/docs/vcfpartition.md b/docs/vcfpartition.md new file mode 100644 index 0000000..ab91b70 --- /dev/null +++ b/docs/vcfpartition.md @@ -0,0 +1,30 @@ +# vcfpartition + +## Overview + +Partition a given VCF file into (approximately) a give number of regions: + +``` +vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10 +``` +gives +``` +chr20:1-6799360 +chr20:6799361-14319616 +chr20:14319617-21790720 +chr20:21790721-28770304 +chr20:28770305-31096832 +chr20:31096833-38043648 +chr20:38043649-45580288 +chr20:45580289-52117504 +chr20:52117505-58834944 +chr20:58834945- +``` + +These reqion strings can then be used to split computation of the VCF +into chunks for parallelisation. + +**TODO give a nice example here using xargs** + +**WARNING that this does not take into account that indels may overlap +partitions and you may count variants twice or more if they do**