Store the dimension order when using Zarr #9924

golmschenk · 2025-01-04T04:29:16Z

Is your feature request related to a problem?

Currently, if a dataset is saved to Zarr and then reloaded, the dimensions are reloaded alphabetically.

Example:

import xarray as xr
import numpy as np
from pathlib import Path

original_dataset = xr.Dataset(
    {
        'temperature': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3)),
        'precipitation': (('time', 'lat', 'lon'), np.random.rand(5, 4, 3))
    },
    coords={
        'time': np.arange(5),
        'lat': np.linspace(-90, 90, 4),
        'lon': np.linspace(0, 360, 3)
    }
)

print('Original Dimensions:', list(original_dataset.dims))
zarr_path = Path('example_dataset.zarr')
original_dataset.to_zarr(zarr_path, mode='w')
reloaded_dataset = xr.open_zarr(zarr_path)
print('Reloaded Dimensions:', list(reloaded_dataset.dims))

outputs

Original Dimensions: ['time', 'lat', 'lon']
Reloaded Dimensions: ['lat', 'lon', 'time']

Describe the solution you'd like

Store the dimension order when using Zarr, such that when the dataset is reloaded from the Zarr file, the original dimension order is maintained.

Describe alternatives you've considered

N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

welcome · 2025-01-04T04:29:20Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

keewis · 2025-01-04T16:23:24Z

for xarray (and zarr, I believe) the dimension order as listed in the dataset dimensions changes very little: as far as I can tell, the only instances are the string and HTML reprs and the default values for to_dataframe (see also #9921 for a recent discussion).

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

golmschenk · 2025-01-04T17:02:55Z

Can you give us a bit more context on why you'd need to keep the dataset dimension order?

For consistency when converting to and from Pandas (and from there to and from CSV). And how lack of consistency will affect adoption of using xarray in my team. To and from Pandas is where I originally noticed the inconsistency in the order.

Much of my team members are scientists that are not going to want to deal with xarray/Zarr directly. They have legacy code that works with an ad hoc text format (sorta CSV-like, but not quite) they've dealt with previously. These members would be perfectly happy working with CSV data through. However, it's often useful for some members of the team to have larger-than-memory and distributed processing (the ones who would be happy to work with xarray/Zarr directly). I'm looking to switch to xarray/Zarr to make this part easier for those who are involved in that component of the work. But still easily import and export subsets of the data back to the CSV other members of the team will use. The loss of column order from/to the CSVs makes switching to xarray more difficult to justify. One solution would be to add my own saving and loading of the order. But then any of the team members who will work with the xarray/Zarr data directly will also need to make sure to perform the same steps. This is doable, but an extra hurdle to justify switching to xarray. If the ordering of this was automatically consistent through xarray, that would remove this issue.

keewis · 2025-01-04T17:16:50Z

If you're using to_dataframe (to_pandas also calls that but doesn't let you pass arguments) you can try using Dataset.to_dataframe's dim_order parameter to choose a fixed/constant dimension order (see also #9718 for a recent discussion on that topic).

golmschenk · 2025-01-04T17:32:16Z

Thanks, but unfortunately, the columns/dimensions in our data are not fixed. For a given experiment, it's consistent across all the data from that experiment. But different experiments will have different columns/dimensions, so I'm not able to simply define a consistent order somewhere. It needs to be inferred from the data files. That is, one experiment might produce a collection of CSV(-like) files that all have the same columns. Here, is where xarray would come in to store the larger results in Zarr, process the data, and at some point spit back out some CSVs with the same columns. But then a different experiment would have different columns, but would still need similar processing.

jhamman · 2025-01-04T17:33:35Z

The ordering of list(Dataset.dims) is a red herring here. Or at a minimum, it isn't the core part of the problem you are after. The dims property of the Dataset is a mapping including all the names and sizes of dimensions for all variables in your dataset. Unlike the DataArray.dims property, it may not be ordered.

With that in mind, the actual problem here appears to be that to_dataframe seems to be relying on the ordering of the Dataset.dims property (or something similar -- I haven't looked at the code in a while).

We could change Dataset.dims to always be sorted somehow. Or we could update how to_datafame chooses its dimension order.

keewis · 2025-01-04T17:40:56Z

list(Dataset.dims) is the default value for the dim_order parameter. Sorting Dataset.dims won't work since we don't know (in general) how to compare dimension names, which can be any hashable.

However, user code that does know that all dims are strings can just do (use sorted's key parameter if you don't want to sort alphabetically):

ds.to_dataframe(dim_order=sorted(ds.dims))

which will give you a consistent (sorted) order, even if you don't know the actual dimensions.

golmschenk · 2025-01-04T17:52:00Z

Narrowing down to my more concrete situation (e.g., to and from Pandas), I just want to clarify that the issue only occurs after saving and reloading from Zarr. Just converting to and from Pandas and xarray alone does not encounter the issue I'm running into. For example:

import pandas as pd
import xarray as xr
from pathlib import Path

original_pandas_data_frame = pd.DataFrame({'b': [0, 1], 'a': [2, 3]})

original_xarray_dataset = original_pandas_data_frame.to_xarray()
pandas_data_frame_from_plain_xarray = original_xarray_dataset.to_pandas()

zarr_path = Path('example_dataset.zarr')
original_xarray_dataset.to_zarr(zarr_path, mode='w')
zarr_loaded_xarray_dataset = xr.open_zarr(zarr_path)
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.to_pandas()

print(original_pandas_data_frame.columns)
print(pandas_data_frame_from_plain_xarray.columns)
print(pandas_data_frame_from_zarr_loaded_xarray.columns)

outputs:

Index(['b', 'a'], dtype='object')
Index(['b', 'a'], dtype='object')
Index(['a', 'b'], dtype='object')

(This might have already been clear, but I just wanted to make sure)

keewis · 2025-01-04T21:16:06Z

I got that, thanks.

The dimension order of a dataset depends on the order in which it is seen on the variables from which the dataset is constructed. The zarr store returns variables in an alphabetical order (most likely because that's how it got them from the filesystem), which means xarray will see the dimensions in this order. For your first example that would be lat (from the lat coordinate), then lon (from the lon coordinate), then time (from precipitation).

Either way, if you rely on the order somehow it is good practice to explicitly define that order somewhere.

golmschenk · 2025-01-05T04:20:39Z

For my case then, I suppose I will try to store the dataset dimensions in the Zarr group attrs similar to how xarray currently stores the array dimensions in the Zarr array attrs. The downside of this is that we'll need to consistently use wrapped versions of xarray's Zarr saving and loading. I guess this might also be a solution for this feature request, but I don't know the types of obstacles that might prevent this, or if this would even fall outside the scope expected of xarray. Thank you much!

keewis · 2025-01-05T11:33:01Z

What I was thinking of was something like this:

def to_ordered_dataframe(ds):
    # use `sorted(ds.dims, key=lambda k: ...)` for a sorting other than alphabetical
    return ds.to_dataframe(dim_order=sorted(ds.dims))

...
pandas_data_frame_from_plain_xarray = original_xarray_dataset.pipe(to_ordered_dataframe)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

If you want to additionally store the expected order for exact roundtripping, I would define a wrapper around .to_dataframe:

def persist_dim_order(ds):
    return ds.assign_attrs(dim_order=list(ds.dims))

def to_ordered_dataframe(ds):
    dim_order = ds.attrs["dim_order"]
    return ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().pipe(persist_dim_order)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
...

you could even put that into an accessor:

@xr.register_dataset_accessor("ordered_df")
class OrderedDFAccessor:
    def __init__(self, ds):
        self._ds = ds

    def persist_dim_order(self):
        return self._ds.assign_attrs(dim_order=list(self._ds.dims))

    def to_dataframe(self):
        dim_order = self._ds.attrs["dim_order"]
        return self._ds.to_dataframe(dim_order=dim_order)

...
original_xarray_dataset = original_pandas_data_frame.to_xarray().ordered_df.persist_dim_order()
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.ordered_df.to_dataframe()
...

keewis · 2025-01-06T13:51:32Z

does this answer your question? If so, can we close this?

golmschenk · 2025-01-06T15:48:29Z

@keewis, I suppose I would still prefer xarray had this builtin, in which case this issue might remain open as a feature request. However, I understand if this is deemed to be outside of xarray's goals, and this issue is chosen to be closed.

keewis · 2025-01-06T15:58:39Z

The only way I could imagine resolving this is to change the default of dim_order. However, since we can't really determine a particular order (since we can't sort), falling back to the (somewhat unstable) Dataset.dims is the only way I can imagine resolving this (other than requiring the user to explicitly pass it). So yes, unless anyone has a better idea I'd probably close the issue.

@dcherian, do you have any opinions on this?

dcherian · 2025-01-06T19:09:20Z

My personal opinion is that

Dataset.dims is unordered and reflects the data model.
I think the repr should present sorted(Dataset.dims) to make it easy to scan, and
For to_dataframe we could use order of appearance of dimension names as we iterate through the list of data_vars followed by coords since it is somewhat important to preserve the ordering of values.

keewis · 2025-01-06T19:40:33Z

I believe we won't be able to do 2 since sorting arbitrary hashables is not really possible (we can't in general compare hashables).

Should we change the order of Dataset.dims to follow 3 instead? Though that would still be affected by a change in the order of the data variables.

dcherian · 2025-01-06T21:12:47Z

2 since sorting arbitrary hashables is not really possible (we can't in general compare hashables).

Seems like we could catch that error though. The vast majority of xarray users use strings....

Should we change the order of Dataset.dims to follow 3 instead? Though that would still be affected by a change in the order of the data variables.

Ya that's unavoidable though. nD->1D reshape will depend on dimension order, and we need to align dimension order across all variables for the output to make sense. We can't hide that from the user.

keewis · 2025-01-06T22:27:04Z

Seems like we could catch that error though. The vast majority of xarray users use strings....

As in, only sort the string typed dimension names and keep the order of everything else? Or just skip sorting if sorted raises because it can't compare dimension names?

dcherian · 2025-01-06T22:31:48Z

either would be OK IMO

keewis · 2025-01-06T22:34:44Z

We can't hide that from the user.

And either way, for to_dataframe we're only talking about the default value for the dimension order, if there's a reason to choose a different order the user should specify that explicitly (as I've tried to argue above).

to_pandas doesn't allow passing dim_order, but maybe it should?

dcherian · 2025-01-06T22:36:44Z

Agree with both.

golmschenk added the enhancement label Jan 4, 2025

keewis added usage question and removed enhancement labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the dimension order when using Zarr #9924

Store the dimension order when using Zarr #9924

golmschenk commented Jan 4, 2025 •

edited

Loading

welcome bot commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

jhamman commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 6, 2025

golmschenk commented Jan 6, 2025

keewis commented Jan 6, 2025

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025 •

edited

Loading

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025 •

edited

Loading

dcherian commented Jan 6, 2025

Store the dimension order when using Zarr #9924

Store the dimension order when using Zarr #9924

Comments

golmschenk commented Jan 4, 2025 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

welcome bot commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

jhamman commented Jan 4, 2025

keewis commented Jan 4, 2025

golmschenk commented Jan 4, 2025 • edited Loading

keewis commented Jan 4, 2025 • edited Loading

golmschenk commented Jan 5, 2025 • edited Loading

keewis commented Jan 5, 2025 • edited Loading

keewis commented Jan 6, 2025

golmschenk commented Jan 6, 2025

keewis commented Jan 6, 2025

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025 • edited Loading

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025

dcherian commented Jan 6, 2025

keewis commented Jan 6, 2025 • edited Loading

dcherian commented Jan 6, 2025

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 4, 2025 •

edited

Loading

keewis commented Jan 4, 2025 •

edited

Loading

golmschenk commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 5, 2025 •

edited

Loading

keewis commented Jan 6, 2025 •

edited

Loading

keewis commented Jan 6, 2025 •

edited

Loading