-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store the dimension order when using Zarr #9924
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
for Can you give us a bit more context on why you'd need to keep the dataset dimension order? |
For consistency when converting to and from Pandas (and from there to and from CSV). And how lack of consistency will affect adoption of using xarray in my team. To and from Pandas is where I originally noticed the inconsistency in the order. Much of my team members are scientists that are not going to want to deal with xarray/Zarr directly. They have legacy code that works with an ad hoc text format (sorta CSV-like, but not quite) they've dealt with previously. These members would be perfectly happy working with CSV data through. However, it's often useful for some members of the team to have larger-than-memory and distributed processing (the ones who would be happy to work with xarray/Zarr directly). I'm looking to switch to xarray/Zarr to make this part easier for those who are involved in that component of the work. But still easily import and export subsets of the data back to the CSV other members of the team will use. The loss of column order from/to the CSVs makes switching to xarray more difficult to justify. One solution would be to add my own saving and loading of the order. But then any of the team members who will work with the xarray/Zarr data directly will also need to make sure to perform the same steps. This is doable, but an extra hurdle to justify switching to xarray. If the ordering of this was automatically consistent through xarray, that would remove this issue. |
If you're using |
Thanks, but unfortunately, the columns/dimensions in our data are not fixed. For a given experiment, it's consistent across all the data from that experiment. But different experiments will have different columns/dimensions, so I'm not able to simply define a consistent order somewhere. It needs to be inferred from the data files. That is, one experiment might produce a collection of CSV(-like) files that all have the same columns. Here, is where xarray would come in to store the larger results in Zarr, process the data, and at some point spit back out some CSVs with the same columns. But then a different experiment would have different columns, but would still need similar processing. |
The ordering of With that in mind, the actual problem here appears to be that We could change |
However, user code that does know that all dims are strings can just do (use ds.to_dataframe(dim_order=sorted(ds.dims)) which will give you a consistent (sorted) order, even if you don't know the actual dimensions. |
Narrowing down to my more concrete situation (e.g., to and from Pandas), I just want to clarify that the issue only occurs after saving and reloading from Zarr. Just converting to and from Pandas and xarray alone does not encounter the issue I'm running into. For example: import pandas as pd
import xarray as xr
from pathlib import Path
original_pandas_data_frame = pd.DataFrame({'b': [0, 1], 'a': [2, 3]})
original_xarray_dataset = original_pandas_data_frame.to_xarray()
pandas_data_frame_from_plain_xarray = original_xarray_dataset.to_pandas()
zarr_path = Path('example_dataset.zarr')
original_xarray_dataset.to_zarr(zarr_path, mode='w')
zarr_loaded_xarray_dataset = xr.open_zarr(zarr_path)
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.to_pandas()
print(original_pandas_data_frame.columns)
print(pandas_data_frame_from_plain_xarray.columns)
print(pandas_data_frame_from_zarr_loaded_xarray.columns) outputs:
(This might have already been clear, but I just wanted to make sure) |
I got that, thanks. The dimension order of a dataset depends on the order in which it is seen on the variables from which the dataset is constructed. The Either way, if you rely on the order somehow it is good practice to explicitly define that order somewhere. |
For my case then, I suppose I will try to store the dataset dimensions in the Zarr group |
What I was thinking of was something like this: def to_ordered_dataframe(ds):
# use `sorted(ds.dims, key=lambda k: ...)` for a sorting other than alphabetical
return ds.to_dataframe(dim_order=sorted(ds.dims))
...
pandas_data_frame_from_plain_xarray = original_xarray_dataset.pipe(to_ordered_dataframe)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
... If you want to additionally store the expected order for exact roundtripping, I would define a wrapper around def persist_dim_order(ds):
return ds.assign_attrs(dim_order=list(ds.dims))
def to_ordered_dataframe(ds):
dim_order = ds.attrs["dim_order"]
return ds.to_dataframe(dim_order=dim_order)
...
original_xarray_dataset = original_pandas_data_frame.to_xarray().pipe(persist_dim_order)
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.pipe(to_ordered_dataframe)
... you could even put that into an accessor: @xr.register_dataset_accessor("ordered_df")
class OrderedDFAccessor:
def __init__(self, ds):
self._ds = ds
def persist_dim_order(self):
return self._ds.assign_attrs(dim_order=list(self._ds.dims))
def to_dataframe(self):
dim_order = self._ds.attrs["dim_order"]
return self._ds.to_dataframe(dim_order=dim_order)
...
original_xarray_dataset = original_pandas_data_frame.to_xarray().ordered_df.persist_dim_order()
...
pandas_data_frame_from_zarr_loaded_xarray = zarr_loaded_xarray_dataset.ordered_df.to_dataframe()
... |
does this answer your question? If so, can we close this? |
@keewis, I suppose I would still prefer xarray had this builtin, in which case this issue might remain open as a feature request. However, I understand if this is deemed to be outside of xarray's goals, and this issue is chosen to be closed. |
The only way I could imagine resolving this is to change the default of @dcherian, do you have any opinions on this? |
My personal opinion is that
|
I believe we won't be able to do 2 since sorting arbitrary hashables is not really possible (we can't in general compare hashables). Should we change the order of |
Seems like we could catch that error though. The vast majority of xarray users use strings....
Ya that's unavoidable though. nD->1D reshape will depend on dimension order, and we need to align dimension order across all variables for the output to make sense. We can't hide that from the user. |
As in, only sort the string typed dimension names and keep the order of everything else? Or just skip sorting if |
either would be OK IMO |
And either way, for
|
Agree with both. |
Is your feature request related to a problem?
Currently, if a dataset is saved to Zarr and then reloaded, the dimensions are reloaded alphabetically.
Example:
outputs
Describe the solution you'd like
Store the dimension order when using Zarr, such that when the dataset is reloaded from the Zarr file, the original dimension order is maintained.
Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: