Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the dimension order when using Zarr #9924

Open
golmschenk opened this issue Jan 4, 2025 · 21 comments
Open

Store the dimension order when using Zarr #9924

golmschenk opened this issue Jan 4, 2025 · 21 comments

Comments

@golmschenk
Copy link

golmschenk commented Jan 4, 2025

Is your feature request related to a problem?

Currently, if a dataset is saved to Zarr and then reloaded, the dimensions are reloaded alphabetically.

Example:

import xarray as xr
import numpy as np
from pathlib import Path

original_dataset = xr.Dataset(
    {
        'temperature': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3)),
        'precipitation': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3))
    },
    coords={
        'time': np.arange(5),
        'lat': np.linspace(-90, 90, 4),
        'lon': np.linspace(0, 360, 3)
    }
)

print('Original Dimensions:', list(original_dataset.dims))
zarr_path = Path('example_dataset.zarr')
original_dataset.to_zarr(zarr_path, mode='w')
reloaded_dataset = xr.open_zarr(zarr_path)
print('Reloaded Dimensions:', list(reloaded_dataset.dims))

outputs

Original Dimensions: ['time', 'lat', 'lon']
Reloaded Dimensions: ['lat', 'lon', 'time']

Describe the solution you'd like

Store the dimension order when using Zarr, such that when the dataset is reloaded from the Zarr file, the original dimension order is maintained.

Describe alternatives you've considered

N/A

Additional context

N/A

Copy link

welcome bot commented Jan 4, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@keewis
Copy link
Collaborator

keewis commented Jan 4, 2025

for xarray (and zarr, I believe) the dimension order as listed in the dataset dimensions changes very little: as far as I can tell, the only instances are the string and HTML reprs and the default values for to_dataframe (see also #9921 for a recent discussion).

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

@golmschenk
Copy link
Author

golmschenk commented Jan 4, 2025

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

For consistency when converting to and from Pandas (and from there to and from CSV). And how lack of consistency will affect adoption of using xarray in my team. To and from Pandas is where I originally noticed the inconsistency in the order.

Much of my team members are scientists that are not going to want to deal with xarray/Zarr directly. They have legacy code that works with an ad hoc text format (sorta CSV-like, but not quite) they've dealt with previously. These members would be perfectly happy working with CSV data through. However, it's often useful for some members of the team to have larger-than-memory and distributed processing (the ones who would be happy to work with xarray/Zarr directly). I'm looking to switch to xarray/Zarr to make this part easier for those who are involved in that component of the work. But still easily import and export subsets of the data back to the CSV other members of the team will use. The loss of column order from/to the CSVs makes switching to xarray more difficult to justify. One solution would be to add my own saving and loading of the order. But then any of the team members who will work with the xarray/Zarr data directly will also need to make sure to perform the same steps. This is doable, but an extra hurdle to justify switching to xarray. If the ordering of this was automatically consistent through xarray, that would remove this issue.

@keewis
Copy link
Collaborator

keewis commented Jan 4, 2025

If you're using to_dataframe (to_pandas also calls that but doesn't let you pass arguments) you can try using Dataset.to_dataframe's dim_order parameter to choose a fixed/constant dimension order (see also #9718 for a recent discussion on that topic).

@golmschenk
Copy link
Author

golmschenk commented Jan 4, 2025

Thanks, but unfortunately, the columns/dimensions in our data are not fixed. For a given experiment, it's consistent across all the data from that experiment. But different experiments will have different columns/dimensions, so I'm not able to simply define a consistent order somewhere. It needs to be inferred from the data files. That is, one experiment might produce a collection of CSV(-like) files that all have the same columns. Here, is where xarray would come in to store the larger results in Zarr, process the data, and at some point spit back out some CSVs with the same columns. But then a different experiment would have different columns, but would still need similar processing.

@jhamman
Copy link
Member

jhamman commented Jan 4, 2025

The ordering of list(Dataset.dims) is a red herring here. Or at a minimum, it isn't the core part of the problem you are after. The dims property of the Dataset is a mapping including all the names and sizes of dimensions for all variables in your dataset. Unlike the DataArray.dims property, it may not be ordered.

With that in mind, the actual problem here appears to be that to_dataframe seems to be relying on the ordering of the Dataset.dims property (or something similar -- I haven't looked at the code in a while).

We could change Dataset.dims to always be sorted somehow. Or we could update how to_datafame chooses its dimension order.

@keewis
Copy link
Collaborator

keewis commented Jan 4, 2025

list(Dataset.dims) is the default value for the dim_order parameter. Sorting Dataset.dims won't work since we don't know (in general) how to compare dimension names, which can be any hashable.

However, user code that does know that all dims are strings can just do (use sorted's key parameter if you don't want to sort alphabetically):

ds.to_dataframe(dim_order=sorted(ds.dims))

which will give you a consistent (sorted) order, even if you don't know the actual dimensions.

@golmschenk
Copy link
Author

golmschenk commented Jan 4, 2025

Narrowing down to my more concrete situation (e.g., to and from Pandas), I just want to clarify that the issue only occurs after saving and reloading from Zarr. Just converting to and from Pandas and xarray alone does not encounter the issue I'm running into. For example:

import pandas as pd
import xarray as xr
from pathlib import Path

original_pandas_data_frame = pd.DataFrame({'b': [0, 1], 'a': [2, 3]})

original_xarray_dataset = original_pandas_data_frame.to_xarray()
pandas_data_frame_from_plain_xarray = original_xarray_dataset.to_pandas()

zarr_path = Path('example_dataset.zarr')
original_xarray_dataset.to_zarr(zarr_path, mode='w')
zarr_loaded_xarray_dataset = xr.open_zarr(zarr_path)
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.to_pandas()

print(original_pandas_data_frame.columns)
print(pandas_data_frame_from_plain_xarray.columns)
print(pandas_data_frame_from_zarr_loaded_xarray.columns)

outputs:

Index(['b', 'a'], dtype='object')
Index(['b', 'a'], dtype='object')
Index(['a', 'b'], dtype='object')

(This might have already been clear, but I just wanted to make sure)

@keewis
Copy link
Collaborator

keewis commented Jan 4, 2025

I got that, thanks.

The dimension order of a dataset depends on the order in which it is seen on the variables from which the dataset is constructed. The zarr store returns variables in an alphabetical order (most likely because that's how it got them from the filesystem), which means xarray will see the dimensions in this order. For your first example that would be lat (from the lat coordinate), then lon (from the lon coordinate), then time (from precipitation).

Either way, if you rely on the order somehow it is good practice to explicitly define that order somewhere.

@golmschenk
Copy link
Author

golmschenk commented Jan 5, 2025

For my case then, I suppose I will try to store the dataset dimensions in the Zarr group attrs similar to how xarray currently stores the array dimensions in the Zarr array attrs. The downside of this is that we'll need to consistently use wrapped versions of xarray's Zarr saving and loading. I guess this might also be a solution for this feature request, but I don't know the types of obstacles that might prevent this, or if this would even fall outside the scope expected of xarray. Thank you much!

@keewis
Copy link
Collaborator

keewis commented Jan 5, 2025

What I was thinking of was something like this:

def to_ordered_dataframe(ds):
    # use `sorted(ds.dims, key=lambda k: ...)` for a sorting other than alphabetical
    return ds.to_dataframe(dim_order=sorted(ds.dims))

...
pandas_data_frame_from_plain_xarray = original_xarray_dataset.pipe(to_ordered_dataframe)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

If you want to additionally store the expected order for exact roundtripping, I would define a wrapper around .to_dataframe:

def persist_dim_order(ds):
    return ds.assign_attrs(dim_order=list(ds.dims))

def to_ordered_dataframe(ds):
    dim_order = ds.attrs["dim_order"]
    return ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().pipe(persist_dim_order)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

you could even put that into an accessor:

@xr.register_dataset_accessor("ordered_df")
class OrderedDFAccessor:
    def __init__(self, ds):
        self._ds = ds

    def persist_dim_order(self):
        return self._ds.assign_attrs(dim_order=list(self._ds.dims))

    def to_dataframe(self):
        dim_order = self._ds.attrs["dim_order"]
        return self._ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().ordered_df.persist_dim_order()
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.ordered_df.to_dataframe()
...

@keewis
Copy link
Collaborator

keewis commented Jan 6, 2025

does this answer your question? If so, can we close this?

@golmschenk
Copy link
Author

@keewis, I suppose I would still prefer xarray had this builtin, in which case this issue might remain open as a feature request. However, I understand if this is deemed to be outside of xarray's goals, and this issue is chosen to be closed.

@keewis
Copy link
Collaborator

keewis commented Jan 6, 2025

The only way I could imagine resolving this is to change the default of dim_order. However, since we can't really determine a particular order (since we can't sort), falling back to the (somewhat unstable) Dataset.dims is the only way I can imagine resolving this (other than requiring the user to explicitly pass it). So yes, unless anyone has a better idea I'd probably close the issue.

@dcherian, do you have any opinions on this?

@dcherian
Copy link
Contributor

dcherian commented Jan 6, 2025

My personal opinion is that

  1. Dataset.dims is unordered and reflects the data model.
  2. I think the repr should present sorted(Dataset.dims) to make it easy to scan, and
  3. For to_dataframe we could use order of appearance of dimension names as we iterate through the list of data_vars followed by coords since it is somewhat important to preserve the ordering of values.

@keewis
Copy link
Collaborator

keewis commented Jan 6, 2025

I believe we won't be able to do 2 since sorting arbitrary hashables is not really possible (we can't in general compare hashables).

Should we change the order of Dataset.dims to follow 3 instead? Though that would still be affected by a change in the order of the data variables.

@dcherian
Copy link
Contributor

dcherian commented Jan 6, 2025

2 since sorting arbitrary hashables is not really possible (we can't in general compare hashables).

Seems like we could catch that error though. The vast majority of xarray users use strings....

Should we change the order of Dataset.dims to follow 3 instead? Though that would still be affected by a change in the order of the data variables.

Ya that's unavoidable though. nD->1D reshape will depend on dimension order, and we need to align dimension order across all variables for the output to make sense. We can't hide that from the user.

@keewis
Copy link
Collaborator

keewis commented Jan 6, 2025

Seems like we could catch that error though. The vast majority of xarray users use strings....

As in, only sort the string typed dimension names and keep the order of everything else? Or just skip sorting if sorted raises because it can't compare dimension names?

@dcherian
Copy link
Contributor

dcherian commented Jan 6, 2025

either would be OK IMO

@keewis
Copy link
Collaborator

keewis commented Jan 6, 2025

We can't hide that from the user.

And either way, for to_dataframe we're only talking about the default value for the dimension order, if there's a reason to choose a different order the user should specify that explicitly (as I've tried to argue above).

to_pandas doesn't allow passing dim_order, but maybe it should?

@dcherian
Copy link
Contributor

dcherian commented Jan 6, 2025

Agree with both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants