Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer values in pmb.df become float when reading a pyMBE dataframe from file #102

Open
pm-blanco opened this issue Nov 13, 2024 · 8 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pm-blanco
Copy link
Collaborator

This becomes an issue when restarting simulations that have been set up with pyMBE since it breaks the setup of the reaction methods, which cannot be checkpointed with the current funtionalities in ESPResSo.

Here a minimal working example that shows this issue:

import pyMBE

# Create an instance of pyMBE library
pmb = pyMBE.pymbe_library(seed=42)

# Acidic particle
pmb.define_particle(
    name = "A",
    acidity = "basic",
    pka = 4,
    sigma = 1*pmb.units('reduced_length'),
    epsilon = 1*pmb.units('reduced_energy'))

pmb.define_particle(
    name = "C",
    z = 1,
    sigma = 1*pmb.units('reduced_length'),
    epsilon = 1*pmb.units('reduced_energy'))

# Save the pyMBE dataframe in a CSV file
pmb.write_pmb_df(filename='df.csv')

# Here all  charge numbers and espresso types are integers
print(pmb.df)

# Read the pyMBE dataframe 
pmb.read_pmb_df(filename='df.csv')

# Here some of the charge numbers and espresso types are float
print(pmb.df)

# This breaks the setup of the reaction methods
pmb.setup_cpH(counter_ion="C", constant_pH=2)

@paobtorres do you want to take care of solving this?

@pm-blanco pm-blanco added the bug Something isn't working label Nov 13, 2024
@paobtorres
Copy link
Contributor

Yes! I will try to solve it

@pm-blanco pm-blanco added this to the pyMBE 1.0.0 milestone Nov 13, 2024
@paobtorres
Copy link
Contributor

paobtorres commented Nov 23, 2024

@pm-blanco A quick update I have partially solved the issue changing:

for column_name in columns_dtype_int:
            df[column_name] = df[column_name].astype(object)

for:

for column_name in columns_dtype_int:
            df[column_name] = df[column_name].astype(pd.Int64Dtype())

However, it changes the nan values from NaN (np.NaN) to <NA> which is pd.NA in that column. It does not affect functionality but in the df, there is a mix of NaN and , it does not look good. (I'm still working on it but I have not created a branch just yet)

@paobtorres
Copy link
Contributor

Hi! @jngrad We would like your opinion on this issue. In the previous comment I mentioned the solution I found for this issue, however, we still have the problem with np.NAN and pd.NA .

We were talking with @pm-blanco to change in pyMBE all the functions that have np.NaN by default to pd.NA. What do you think about this?

@jngrad
Copy link
Member

jngrad commented Dec 4, 2024

That's really tricky. According to chapter Working with missing data, pd.NA behaves like None and NaN, which came as a surprise to me. I'm used to Javascript's null vs undefined vs NaN semantics, and Python's None vs NaN semantics. Merging None and NaN into a unified data type introduces new side effects, for example sometimes it propagates (even through comparison operations) and sometimes it doesn't (for example you can divide 0 by 0 to yield np.nan). This means we can't use np.isnan() reliably, and should always use pd.isna() instead. In addition, NumPy produces a warning the first time we generate a NaN, while Pandas sometimes flat out refuses to generate a NaN.

Here is a MWE:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(data={"time": [1., 2., pd.NaT], "pos": [1.5, 2.5, 0.]})
>>> df
  time  pos
0  1.0  1.5
1  2.0  2.5
2  NaT  0.0
>>> (0. / df["pos"])  # generates np.nan
0    0.0
1    0.0
2    NaN
Name: pos, dtype: float64
>>> np.isnan(0. / df["pos"])  # works as expected
0    False
1    False
2     True
Name: pos, dtype: bool
>>> (0. / df["time"])  # generates NaN of type object
0    0.0
1    0.0
2    NaN
Name: time, dtype: object
>>> (0. / df["time"]).to_numpy().dtype  # not convertible to numpy array of float64
dtype('O')
>>> np.isnan(np.nan)
True
>>> np.isnan(pd.NA)  # doesn't work
<NA>
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)  # always works
True
>>> pd.Series([1., np.inf, np.nan, pd.NA]).isnull()  # np.nan decays to NULL
0    False
1    False
2     True
3     True
dtype: bool
>>> 1 / np.array([0., 1.])  # warning the first time
<stdin>:1: RuntimeWarning: divide by zero encountered in divide
array([inf,  1.])
>>> 1 / np.array([0., 1.])  # no warning the second time
array([inf,  1.])
>>> (0. / pd.Series([0., 1.]))  # yields NaN
0    NaN
1    0.0
dtype: float64
>>> (0. / pd.Series([0., 1., pd.NaT]))  # doesn't run because pd.NaT changes series type
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: float division by zero

I would say that np.nan isn't the best candidate to represent missing data, because it forces the data series to become a floating-point series. If there was a pd.NULL type, I would recommend it for missing data over np.nan. Looking through the Pandas docs, it seems like pd.NA is the right candidate to represent null, especially considering Pandas treats np.nan as null. But I must admit it's the first time I read about it, and its semantics confuses me a lot, which makes me reluctant to recommend its usage. If we decide to adopt it, everyone in the pyMBE community should get familiar with pd.NA semantics.

@pm-blanco
Copy link
Collaborator Author

Thank you for your insight @jngrad! This is indeed a very tricky issue indeed and not solution seems to be completely satisfying. At the moment, it seems that defaulting to pd.NA is the lesser evil but we should make our community aware of this indeed. Maybe we could add an entry to our wiki about it, so we can refer people to it in the future?

@jngrad
Copy link
Member

jngrad commented Dec 4, 2024

I agree. We can link to the relevant chapter in the Pandas user manual, but should explicitly mention that pd.isna() should always be preferred over np.isnan().

@paobtorres
Copy link
Contributor

Thanks @jngrad for your feedback about this issue. Perfect, then I will make the change to pd.NA. @pm-blanco

@pm-blanco
Copy link
Collaborator Author

@paobtorres I almost forgot about this issue, what is the status?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants