Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class method to create simple VariantData files for demos #924

Open
hyanwong opened this issue May 23, 2024 · 2 comments
Open

Class method to create simple VariantData files for demos #924

hyanwong opened this issue May 23, 2024 · 2 comments
Milestone

Comments

@hyanwong
Copy link
Member

hyanwong commented May 23, 2024

It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:

class VariantData:
    ...
    def from_arrays(
        cls,
        path,
        variant_positions,
        variant_matrix_phased,
        alleles,
        ancestral_allele,
        sample_id=None,
    ):
        """
        Create an VariantData instance directly from array data. Only
        useful for small datasets. Larger datasets should use e.g. bio2zarr
        to create a zarr datastore containing the required data and call
        VariantData(path_to_zarr)

        :param path str: The path used to store the data
        :param variant_positions array: a 1D array of variant positions
        :param variant_matrix_phased array: a 3D array of variants X samples x ploidy,
            giving an index into the allele array for each corresponding variant. Values must
            be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased
        :param alleles array: a 2D string array of variants x max_num_alleles at a site.
            Each allele list for a variant must be the same length, equal to the
            maximum value in the `variant_matrix_phased` array. Each allele list can be
            padded with `""` to ensure the correct length
        :param ancestral_allele array: a 1D string array specifying the ancestral allele
            for each variant. For unknown ancestral alleles, any character which is not
            in the allele list can be used.
        :param sample_id: a 1D string array of sample names. If None, samples
            (each corresponding to `ploidy` variant values) will be allocated the IDs
            "0", "1", "2", .. etc.
        """
        call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
        if sample_id is None:
            sample_id = np.arange(call_genotype.shape[1]).astype(str)
        # assume all phased
        call_genotype_phased = np.ones(call_genotype.shape[:2], dtype=bool)
        zarr.save_group(
            path,
            variant_position=np.array(pos),
            call_genotype=call_genotype,
            call_genotype_phased=call_genotype_phased,
            variant_allele=np.array(alleles),
            variant_ancestral_allele=np.array(ancestral_allele),
            sample_id=sample_id,
        )
        return cls(path)

Then we could use it like this:

# For demo only: larger examples could pass `data` as a large numpy array directly
pos, data, alleles, ancestral = list(zip(*[
    (3,  [[0, 1], [0, 0], [0, 0]], ["A", "T", ""], "A"),
    (10, [[0, 1], [1, 1], [0, 0]], ["C", "A", ""], "C"),
    (13, [[0, 1], [1, 0], [0, 0]], ["G", "C", ""], "-"),
    (19, [[0, 0], [0, 1], [1, 0]], ["A", "C", ""], "A"),
    (20, [[0, 1], [2, 0], [0, 0]], ["T", "G", "C"], "T"),
]))
sd = tsinfer.VariantData.from_arrays("test.vcz", pos, data, alleles, ancestral)
ts = tsinfer.infer(sd)
@jeromekelleher
Copy link
Member

We're going to need something like this for testing anyway - even better if it was just an in-memory store.

@hyanwong hyanwong changed the title Class method to create simple SgkitSampleData files for demos Class method to create simple VariantData files for demos Aug 28, 2024
@hyanwong
Copy link
Member Author

hyanwong commented Aug 30, 2024

This could also be an alternative to sgkit-dev/bio2zarr#232, as a method for getting a tree sequence into VariantData format. It would be useful to have an in-memory way to feed the data from a tree sequence into tsinfer, which I think is part of #783

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants