Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft bed2zarr code #281

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Draft bed2zarr code #281

wants to merge 17 commits into from

Conversation

percyfal
Copy link

Draft code to convert Bed3-format to Zarr. Addresses sgkit-dev/sgkit#1219 where we briefly discuss the need for a tool to convert bed to Zarr. As suggested, I started a draft (at least that was my interpretation, or would you have preferred an issue to start with @jeromekelleher?).

Before going too far in development, I submit this draft to discuss some of the issues I have at the moment:

  1. currently the tool throws an error if contig lengths are undefined; this information is not available in BED files, but could be supplied by passing a fasta index as an argument. That should probably go in a different tool (e.g. fai2zarr) for filling in missing contig lengths? There are therefore only two small tests at the moment but I plan to add more once we agree on the implementation.
  2. I am still a bit confused by Zarr terminology. The tool CLI is currently bed2zarr [OPTIONS] BED_PATH ZARR_PATH BED_ARRAY where BED_ARRAY is the name of the Zarr dataset.
  3. By default BED_ARRAY is bed_mask where the BED file is store as a 0/1-array, along with bed_mask_contig which contains the contig for each site, modeled after variant_contig. Both these arrays equal the total genome length.
  4. Should there be a specification for these array types, either in a schema or other format?

I guess one could add support for other BED formats later on. Here, I focus on the more specific task of generating 0/1-based sequence masks to indicate missing data / genome accessibilty.

@jeromekelleher
Copy link
Contributor

Thanks for this @percyfal, and welcome to sgkit-dev 👋!

Some high-level thoughts:

  1. Can we use pandas or numpy to parse the BED file? I don't want to get into the business of text parsing, it's tedious and hard to do well/fast.
  2. It looks like you are trying to integrate the BED data into an existing VCF Zarr store? I would think that first making a 1-1 mapping of the BED data into it's equivalent Zarr would be a simpler approach, and then make the problem of integrating the different types of data together the problem of a different tool (keeping bio2zarr focused on this one-to-one literal translation of bio formats into Zarr).
  3. As a general guiding principle, we want to do what htslib does. So, we interpret a given BED file in the same way that it does (rather that what we think the Right Way should be).

Does that clarify at all?

@jeromekelleher
Copy link
Contributor

As a follow-up, we probably would want to write a spec somewhere that lays out exactly the array names etc, but I don't think there's any need initially. Let's get something working first and spec it out later.

@percyfal
Copy link
Author

Thanks for the feedback. Let me address your points one by one.

  1. sure - for simplicity I'll use pandas for testing from now on to then see if numpy is necessary later on

  2. Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS).

  3. this I would need to read up on

Once this tool works, my long-term wish is to have functionality in sgkit that corrects for summary statistics for accessible sites (be it in windows or not) but that also summarizes statistics across features from annotations. I would be happy to draft that too if that is a good fit, but I think in this case opening a discussion on sgkit prior to development would be beneficial?

@percyfal
Copy link
Author

To clarify: with chromosomes as different datasets I mean having a group for a bed feature (say mask) where a dataset is created within the group, instead of as in the current draft where I make one long array equal to the total genome length. Some organisms I work on have very fragmented genomes (+100k scaffolds/contigs).

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Sep 18, 2024

Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS)

I'm hoping that BedZarr can be as lightweight as possible and purely just take the input text file and convert it to a Zarr with the corresponding arrays. I think we can then place the burden checking the validity of these intervals on annotating a VCF Zarr with the things we're interested in.

So, suppose we have a BED that specifies an accessibility mask, accessibility.bed. The workflow might look like:

# convert on the command line
$ bed2zarr accessibility.bed accessibility.zarr

then in sgkit we have something like

access_zarr = zarr.open("accessibility.zarr")
# NB: These are names from the top of my head, not an actual proposal!
ds = sgkit.add_accessibility_mask(ds, start=access_zarr["start"], end=access_zarr["end"])

That is, sgkit just takes the start and end coordinates of the intervals as arrays, and is decoupled from the BedZarr format. Sgkit then adds a new boolean variant_accessibility_mask and marks each variant as true or false, depending on whether it falls in one of those intervals. (Let's brush the details of zero-based vs one-based and close-vs-open intervals aside for now!). That way, there's no need to worry too much about whether the intervals cover the entire genome or not,they just have to cover the variants that are in the dataset.

Does that help clarify?

@percyfal
Copy link
Author

Ok, I see what you mean; basically convert the BED file columns to separate arrays:

chr1 5 10
chr1 15 20

would translate to

chrom: ['chr1', 'chr1']
start: [5, 15]
end: [10, 20]

in Zarr.

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites. If you want a genome-wide summary of pi, you need to know the accessibility up unto the end of the chromosome.

VARIANT         *      *       *      * *   
POS            0123456789012345678901234567890
MASK           0000011111000001111100000000000

If *=variant site, 0=accessible, 1=masked, and chopping up in 10bp-windows, the first window has 1 variant site but the actual window size is 5bp, not 10.

I'll focus on the BED conversion for now, provided we agree on the output format, and deal with the other stuff later.

@jeromekelleher
Copy link
Contributor

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites.

I see - well that's a different problem. Let's just convert the file to Zarr first and worry about how to use it later!

- use pandas for reading
- write to isolated zarr archive
- map BED columns to arrays named after BED specification (hts-specs)
@percyfal
Copy link
Author

Ok, I updated the code to translate the three mandatory fields chrom, chromStart, and chromEnd (following naming conventions in https://samtools.github.io/hts-specs/BEDv1.pdf) to Zarr arrays. I created a class BedZarrWriter, modeled on VcfZarrWriter, and added placeholder classes and functions for field definitions and metadata (BedMetadata, BedField, mandatory_bed_field_definitions) to mimic the setup in bio2zarr.vcf2zarr.vcz. Do you want it simpler still?

@jeromekelleher
Copy link
Contributor

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

@percyfal
Copy link
Author

percyfal commented Sep 19, 2024

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

In that case I guess we could adopt the VCZ approach and generate a ID mapping to the chrom field (chrom <-> contig_id), using the IDs in a contig field with length equal to the number of entries and which replaces chrom for this purpose . It disrupts the 1-to-1 mapping for the chrom field though.

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

Sure. The remaining columns are well-defined (table 2 in spec) and would make drafting a spec pretty straight-forward:

4 name String [\x20-\x7e]{1,255} Feature description
5 score Int [0, 1000] A numerical value
6 strand String [-+.] Feature strand
7 thickStart Int [0, 2^64 − 1] Thick start position
8 thickEnd Int [0, 2^64 − 1] Thick end position
9 itemRgb Int,Int,Int ([0, 255], [0, 255], [0, 255]) | 0 Display color
10 blockCount Int [0, chromEnd − chromStart] Number of blocks
11 blockSizes List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block sizes
12 blockStarts List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block start positions

Would you have the spec end up in a schema or be stored separately, as you do with the VCFZarr spec? This begs the question: I haven't figured out where you set the schema, such as self.schema.samples_chunk_size in VcfZarrWriter.encode_samples?

- guess BED file type
- add draft schema
- add tests
@percyfal
Copy link
Author

percyfal commented Sep 20, 2024

I went ahead and implemented most needed functionality based on code in icf.py and vcz.py.

I have added:

Remaining issues:

  • specs for multi-dimensional BED fields (columns 9-12) and how to specify them
  • schema generation is based on the number of records and BED type and I may have over-complicated matters by following the setup in vcf.py. Briefly, schema generation should(?) depend on the entire BED data frame being read, in which case some of the steps in init seem unecessary. For instance, I believe you in the intermediate icf step check for ranges of values in order to set correct dtype.
  • add arrays defining contig ID to name mappings and reset values on the contig column; this requires a schema update too
  • could also the name column benefit from an ID to name mapping? Often times feature names are redundant (CDS, exon, etc)

@jeromekelleher
Copy link
Contributor

Thanks @percyfal!

We have an sgkit developers call every second Monday at 1600 UK time, and the next one is on Monday (30th). If you're available, it would be great to discuss this there? I can send you the details if you're interested.

- simplification of main bed2zarr function
- encode features, contigs as Categories
- add specs for BED9-BED12
@coveralls
Copy link
Collaborator

coveralls commented Oct 15, 2024

Coverage Status

coverage: 98.248% (-0.7%) from 98.91%
when pulling 1bf2b00 on percyfal:bed2zarr
into 6aaee4f on sgkit-dev:main.

@percyfal percyfal marked this pull request as ready for review October 15, 2024 08:16
@percyfal
Copy link
Author

@jeromekelleher I have refactored and simplified the code somewhat and added specs for most cases. I have left out support for the BED9 and BED12 formats for now since the encoding of columns 9-12 require some thought. To reiterate, the BED specifications for the last 4 columns are

9 itemRgb Int,Int,Int ([0, 255], [0, 255], [0, 255]) | 0 Display color
10 blockCount Int [0, chromEnd − chromStart] Number of blocks
11 blockSizes List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block sizes
12 blockStarts List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block start positions

The tricky(?) thing about columns 11 and 12 is that every record (line) consists of arrays of potentially different lengths - blockCount is not consistent across records. I guess one way could be to drop the conversion altogether and encode them as strings. Column 9 is either a single digit or 3 RGB value; I'm not sure at this point if mixing the two is allowed. The parse function currently throws an error and notifies the user that BED9 and BED12 are unsupported.

I haven't added any documentation yet but could do so if required before merging the PR. I added test data files and I have modified test_core::test_du accordingly, verifying that the tests pass locally, but for whatever reason they still fail in CI 🤷

@jeromekelleher
Copy link
Contributor

This looks great @percyfal! I agree we shouldn't bother with the 9 and 12 col formats, they seem deeply obscure to me.

I'm not sure what's happened with du here, but I think this test is more trouble that it's worth and we should think about changing it to working on some known pre-written files or something.

I'll take a close look over the coming days, but I think we can merge this pretty much as-is.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @percyfal - but can we simplify the code down a bit to just to the minimum pandas -> Zarr with name-remapping? Scale isn't really an issue here, right?



@dataclasses.dataclass
class BedZarrWriterMetadata(core.JsonDataclass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this level of complexity for BED files. I had to work quite hard with VCF because things get arbitrarily big and it becomes necessary to allow for incomplete bits of fail and be restrarted. But, I feel like BED files are much smaller and could just be converted from pandas to zarr in a single shot, without worrying too much about optimal field sizes etc in the first instance.

What do you think @percyfal - is the layer of paranioa helpful here, or could be just get rid of it and be comfortable that a BED->conversion that fails could just be rerun entirely?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I started out by modelling some of the code after the VCF implementation, but I agree it could be simpler - the code is a tad too convoluted atm. I'll have a look.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromekelleher I have removed large chunks of the code base. The current implementation relies on xarray.DataSet.to_zarr to do the actual conversion. I did keep the field type definitions to set the dtypes, but that could be removed too I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry @percyfal I should have been clear about this. Xarray is not a dependency we want in bio2zarr - it pulls in way too much stuff and depends on Dask. The motivation for writing this package in the first place was to move away from Xarray/Dask based methods for generating Zarrs! It's just not worth the extra complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining something simple like

array = zarr.empty(...)
array[:] = df[field]

So, you just set the zarr array in one go, letting Zarr deal with all the low-level details. It'll work fine at this scale.

Copy link
Author

@percyfal percyfal Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. Just pushed working code without xarray, but I'm guessing you want to go simpler still? I'm still picking up on Zarr as it were.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I think it's the right starting point. Let's get it cleaned up and merged 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants