From 65efc3e65ac69fdbab0e4f9bb19a65d0f5a7ad0e Mon Sep 17 00:00:00 2001
From: Jerome Kelleher <jk@well.ox.ac.uk>
Date: Tue, 14 May 2024 09:45:55 +0100
Subject: [PATCH 1/2] Docs rejig

---
 docs/_toc.yml                              |  3 +-
 docs/installation.md                       | 22 +++++++
 docs/intro.md                              | 77 ++--------------------
 docs/{vcf2zarr_tutorial.md => vcf2zarr.md} | 12 ++--
 4 files changed, 37 insertions(+), 77 deletions(-)
 create mode 100644 docs/installation.md
 rename docs/{vcf2zarr_tutorial.md => vcf2zarr.md} (92%)
diff --git a/docs/_toc.yml b/docs/_toc.yml
index 008d36f..19591dd 100644
--- a/docs/_toc.yml
+++ b/docs/_toc.yml
@@ -1,5 +1,6 @@
 format: jb-book
 root: intro
 chapters:
-- file: vcf2zarr_tutorial
+- file: installation
+- file: vcf2zarr
 - file: cli
diff --git a/docs/installation.md b/docs/installation.md
new file mode 100644
index 0000000..30a67a7
--- /dev/null
+++ b/docs/installation.md
@@ -0,0 +1,22 @@
+# Installation
+
+
+```
+$ python3 -m pip install bio2zarr
+```
+
+This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
+into your local Python path. You may need to update your $PATH to call the
+executables directly.
+
+Alternatively, calling
+```
+$ python3 -m bio2zarr vcf2zarr <args>
+```
+is equivalent to
+
+```
+$ vcf2zarr <args>
+```
+and will always work.
+
diff --git a/docs/intro.md b/docs/intro.md
index 01a4e58..07f11c2 100644
--- a/docs/intro.md
+++ b/docs/intro.md
@@ -1,76 +1,9 @@
-# bio2zarr Documentation
+# bio2zarr
 
-`bio2zarr` efficiently converts common bioinformatics formats to 
-[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting 
-VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/).
+`bio2zarr` efficiently converts common bioinformatics formats to
+[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting
+VCF to the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/).
 
-`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome
+`bio2zarr` is in development, contributions, feedback and issues are welcome
 at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
 
-## Installation
-`bio2zarr` can be installed from PyPI using pip:
-
-```bash
-$ python3 -m pip install bio2zarr
-```
-
-This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
-into your local Python path. You may need to update your $PATH to call the 
-executables directly.
-
-Alternatively, calling 
-```
-$ python3 -m bio2zarr vcf2zarr <args>
-```
-is equivalent to 
-
-```
-$ vcf2zarr <args>
-```
-and will always work.
-
-## Basic vcf2zarr usage
-For modest VCF files (up to a few GB), a single command can be used to convert a VCF file
-(or set of VCF files) using the {ref}`convert<cmd-vcf2zarr-convert>` command:
-
-```bash
-$ vcf2zarr convert <VCF1> <VCF2> ... <VCFN> <zarr>
-```
-
-For larger files a multi-step process is recommended. 
-
-
-First, convert the VCF into the intermediate format:
-
-```bash
-$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
-```
-
-Then, (optionally) inspect this representation to get a feel for your dataset
-```bash
-$ vcf2zarr inspect tmp/sample.exploded
-```
-
-Then, (optionally) generate a conversion schema to describe the corresponding
-Zarr arrays:
-
-```bash
-$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
-```
-
-View and edit the schema, deleting any columns you don't want, or tweaking 
-dtypes and compression settings to your taste.
-
-Finally, encode to Zarr:
-```bash
-$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
-```
-
-Use the ``-p, --worker-processes`` argument to control the number of workers used
-in the ``explode`` and ``encode`` phases.
-
-
-
-
-```{tableofcontents}
-```
diff --git a/docs/vcf2zarr_tutorial.md b/docs/vcf2zarr.md
similarity index 92%
rename from docs/vcf2zarr_tutorial.md
rename to docs/vcf2zarr.md
index fcf734a..5b10754 100644
--- a/docs/vcf2zarr_tutorial.md
+++ b/docs/vcf2zarr.md
@@ -9,7 +9,11 @@ kernelspec:
   language: bash
   name: bash
 ---
-# Vcf2zarr tutorial
+# vcf2zarr
+
+
+
+## Tutorial
 
 This is a step-by-step tutorial showing you how to convert your
 VCF data into Zarr format. There's three different ways to
@@ -17,7 +21,7 @@ convert your data, basically providing different levels of
 convenience and flexibility corresponding to what you might
 need for small, intermediate and large datasets.
 
-## Small
+### Small
 
 <!-- ```{code-cell} bash -->
 <!-- vcf2zarr convert ../tests/data/vcf/sample.vcf.gz sample.zarr -vf -->
@@ -32,6 +36,6 @@ need for small, intermediate and large datasets.
  });
  </script>
 
-## Intermediate
+### Intermediate
 
-## Large
+### Large

From b297246e9351cfaaf661c566980c7e2ffaec4288 Mon Sep 17 00:00:00 2001
From: Jerome Kelleher <jk@well.ox.ac.uk>
Date: Tue, 14 May 2024 09:56:15 +0100
Subject: [PATCH 2/2] Various docs rejigging

---
 README.md            | 123 ++-----------------------------------------
 bio2zarr/cli.py      |  47 +----------------
 docs/_toc.yml        |   1 +
 docs/cli.md          |   2 +-
 docs/installation.md |  14 +++++
 docs/vcf2zarr.md     |  88 +++++++++++++++++++++++++++++++
 docs/vcfpartition.md |  30 +++++++++++
 7 files changed, 140 insertions(+), 165 deletions(-)
 create mode 100644 docs/vcfpartition.md

diff --git a/README.md b/README.md
index 3ffbc2c..8c09b17 100644
--- a/README.md
+++ b/README.md
@@ -1,124 +1,9 @@
 [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
+[![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
+![PyPI](https://img.shields.io/pypi/v/PACKAGE?label=pypi%20bio2zarr)
+![PyPI - Downloads](https://img.shields.io/pypi/dm/bio2zarr)
 
 # bio2zarr
 Convert bioinformatics file formats to Zarr
 
-Initially supports converting VCF to the
-[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)
-
-**This is early alpha-status code: everything is subject to change,
-and it has not been thoroughly tested**
-
-## Install
-
-```
-$ python3 -m pip install bio2zarr
-```
-
-This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
-into your local Python path. You may need to update your $PATH to call the 
-executables directly.
-
-Alternatively, calling 
-```
-$ python3 -m bio2zarr vcf2zarr <args>
-```
-is equivalent to 
-
-```
-$ vcf2zarr <args>
-```
-and will always work.
-
-
-## vcf2zarr
-
-
-Convert a VCF to zarr format:
-
-```
-$ vcf2zarr convert <VCF1> <VCF2> <zarr>
-```
-
-Converts the VCF to zarr format.
-
-**Do not use this for anything but the smallest files**
-
-The recommended approach is to use a multi-stage conversion
-
-First, convert the VCF into the intermediate format:
-
-```
-vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
-```
-
-Then, (optionally) inspect this representation to get a feel for your dataset
-```
-vcf2zarr inspect tmp/sample.exploded
-```
-
-Then, (optionally) generate a conversion schema to describe the corresponding
-Zarr arrays:
-
-```
-vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
-```
-
-View and edit the schema, deleting any columns you don't want, or tweaking 
-dtypes and compression settings to your taste.
-
-Finally, encode to Zarr:
-```
-vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
-```
-
-Use the ``-p, --worker-processes`` argument to control the number of workers used
-in the ``explode`` and ``encode`` phases.
-
-### Shell completion
-
-To enable shell completion for a particular session in Bash do:
-
-```
-eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" 
-```
-
-If you add this to your ``.bashrc`` vcf2zarr shell completion should available
-in all new shell sessions.
-
-See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
-for instructions on how to enable completion in other shells.
-a
-
-## plink2zarr
-
-Convert a plink ``.bed`` file to zarr format. **This is incomplete**
-
-## vcf_partition
-
-Partition a given VCF file into (approximately) a give number of regions:
-
-```
-vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
-```
-gives
-```
-chr20:1-6799360
-chr20:6799361-14319616
-chr20:14319617-21790720
-chr20:21790721-28770304
-chr20:28770305-31096832
-chr20:31096833-38043648
-chr20:38043649-45580288
-chr20:45580289-52117504
-chr20:52117505-58834944
-chr20:58834945-
-```
-
-These reqion strings can then be used to split computation of the VCF 
-into chunks for parallelisation.
-
-**TODO give a nice example here using xargs**
-
-**WARNING that this does not take into account that indels may overlap 
-partitions and you may count variants twice or more if they do**
+See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details.
diff --git a/bio2zarr/cli.py b/bio2zarr/cli.py
index 034b6a2..67399ab 100644
--- a/bio2zarr/cli.py
+++ b/bio2zarr/cli.py
@@ -459,51 +459,8 @@ def vcf2zarr_main():
     """
     Convert VCF file(s) to the vcfzarr format.
 
-    The simplest usage is:
-
-    $ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
-
-    This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
-    step. As this writes the intermediate columnar format to a temporary directory,
-    we only recommend this approach for small files (< 1GB, say).
-
-    The recommended approach is to run the conversion in two passes, and
-    to keep the intermediate columnar format ("exploded") around to facilitate
-    experimentation with chunk sizes and compression settings:
-
-    \b
-    $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
-    $ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
-
-    The inspect command provides a way to view contents of an exploded ICF
-    or Zarr:
-
-    $ vcf2zarr inspect [PATH]
-
-    This is useful when tweaking chunk sizes and compression settings to suit
-    your dataset, using the mkschema command and --schema option to encode:
-
-    \b
-    $ vcf2zarr mkschema [ICF_PATH] > schema.json
-    $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
-
-    By editing the schema.json file you can drop columns that are not of interest
-    and edit column specific compression settings. The --max-variant-chunks option
-    to encode allows you to try out these options on small subsets, hopefully
-    arriving at settings with the desired balance of compression and query
-    performance.
-
-    ADVANCED USAGE
-
-    For very large datasets (terabyte scale) it may be necessary to distribute the
-    explode and encode steps across a cluster:
-
-    \b
-    $ vcf2zarr dexplode-init [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] [NUM_PARTITIONS]
-    $ vcf2zarr dexplode-partition [ICF_PATH] [PARTITION_INDEX]
-    $ vcf2zarr dexplode-finalise [ICF_PATH]
-
-    See the online documentation at [FIXME] for more details on distributed explode.
+    See the online documentation at https://sgkit-dev.github.io/bio2zarr/
+    for more information.
     """
 
 
diff --git a/docs/_toc.yml b/docs/_toc.yml
index 19591dd..b744322 100644
--- a/docs/_toc.yml
+++ b/docs/_toc.yml
@@ -3,4 +3,5 @@ root: intro
 chapters:
 - file: installation
 - file: vcf2zarr
+- file: vcfpartition
 - file: cli
diff --git a/docs/cli.md b/docs/cli.md
index f0f8afb..cd096a4 100644
--- a/docs/cli.md
+++ b/docs/cli.md
@@ -1,4 +1,4 @@
-# Command Line Interface
+# CLI Reference
 
 % A note on cross references... There's some weird long-standing problem with
 % cross referencing program values in Sphinx, which means that we can't use
diff --git a/docs/installation.md b/docs/installation.md
index 30a67a7..52f7155 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -20,3 +20,17 @@ $ vcf2zarr <args>
 ```
 and will always work.
 
+
+## Shell completion
+
+To enable shell completion for a particular session in Bash do:
+
+```
+eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
+```
+
+If you add this to your ``.bashrc`` vcf2zarr shell completion should available
+in all new shell sessions.
+
+See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
+for instructions on how to enable completion in other shells.
diff --git a/docs/vcf2zarr.md b/docs/vcf2zarr.md
index 5b10754..872c4c8 100644
--- a/docs/vcf2zarr.md
+++ b/docs/vcf2zarr.md
@@ -12,6 +12,94 @@ kernelspec:
 # vcf2zarr
 
 
+## Overview
+
+
+Convert a VCF to zarr format:
+
+```
+$ vcf2zarr convert <VCF1> <VCF2> <zarr>
+```
+
+Converts the VCF to zarr format.
+
+**Do not use this for anything but the smallest files**
+
+The recommended approach is to use a multi-stage conversion
+
+First, convert the VCF into the intermediate format:
+
+```
+vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
+```
+
+Then, (optionally) inspect this representation to get a feel for your dataset
+```
+vcf2zarr inspect tmp/sample.exploded
+```
+
+Then, (optionally) generate a conversion schema to describe the corresponding
+Zarr arrays:
+
+```
+vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
+```
+
+View and edit the schema, deleting any columns you don't want, or tweaking
+dtypes and compression settings to your taste.
+
+Finally, encode to Zarr:
+```
+vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
+```
+
+Use the ``-p, --worker-processes`` argument to control the number of workers used
+in the ``explode`` and ``encode`` phases.
+
+## To be merged with above
+
+The simplest usage is:
+
+```
+$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
+```
+
+
+This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
+step. As this writes the intermediate columnar format to a temporary directory,
+we only recommend this approach for small files (< 1GB, say).
+
+The recommended approach is to run the conversion in two passes, and
+to keep the intermediate columnar format ("exploded") around to facilitate
+experimentation with chunk sizes and compression settings:
+
+```
+$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
+$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
+```
+
+The inspect command provides a way to view contents of an exploded ICF
+or Zarr:
+
+```
+$ vcf2zarr inspect [PATH]
+```
+
+This is useful when tweaking chunk sizes and compression settings to suit
+your dataset, using the mkschema command and --schema option to encode:
+
+```
+$ vcf2zarr mkschema [ICF_PATH] > schema.json
+$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
+```
+
+By editing the schema.json file you can drop columns that are not of interest
+and edit column specific compression settings. The --max-variant-chunks option
+to encode allows you to try out these options on small subsets, hopefully
+arriving at settings with the desired balance of compression and query
+performance.
+
+
 
 ## Tutorial
 
diff --git a/docs/vcfpartition.md b/docs/vcfpartition.md
new file mode 100644
index 0000000..ab91b70
--- /dev/null
+++ b/docs/vcfpartition.md
@@ -0,0 +1,30 @@
+# vcfpartition
+
+## Overview
+
+Partition a given VCF file into (approximately) a give number of regions:
+
+```
+vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
+```
+gives
+```
+chr20:1-6799360
+chr20:6799361-14319616
+chr20:14319617-21790720
+chr20:21790721-28770304
+chr20:28770305-31096832
+chr20:31096833-38043648
+chr20:38043649-45580288
+chr20:45580289-52117504
+chr20:52117505-58834944
+chr20:58834945-
+```
+
+These reqion strings can then be used to split computation of the VCF
+into chunks for parallelisation.
+
+**TODO give a nice example here using xargs**
+
+**WARNING that this does not take into account that indels may overlap
+partitions and you may count variants twice or more if they do**