-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Wasserstein Distance Computation Bug #79
Fix Wasserstein Distance Computation Bug #79
Conversation
@MohammadYasinKarbasian What bug does this PR specifically fix? Also, it seems that your fix incorrectly computes the Wasserstein distance when points in the diagram contain an infinite death--see this test failing here. |
If you run this code: data = np.random.random((1000,2)) You see that when we create the matrix UL, values on the diagonal are not zero. Then we place that matrix on D[M: N+M, 0:N], so we have those non-zero values on D[M: N+M, 0:N]. Then we run the Hungarian algorithm using this line "matchi, matchj = optimize.linear_sum_assignment(D)". If you look at matchj as a result, you can see that usually at least for one i from [M: N+M], matchj[i] = i - M. So it means that some values from dgm2 match with diagonal and add some extra distance on matchdist. so this problem is happening here and the computed matchdist is not correct. In the code I have fixed, this issue does not happen because the cost of matching dgm2 data to all points is zero and I only calculate the distance for matching dgm1 points to dgm2 or diagonal. so for every matching on i from [M: N+M], matchj[i] will match to some point that creates no distance and all the distance is for i from [0: M]. This is the real distance because in Wasserstein distance we want to move the points from dgm1 to dgm2. right? So we should not add distance for points that are in dgm2. Also, I have debugged the test error and figured out what happens when we have this error. |
Thanks for the detailed reply. I still think I'm failing to understand what the problem is. Are you saying you believe the code is inefficient or that it is outright wrong? It might be helpful for me if you could give a concrete example (instead of using random persistence diagrams) of an incorrect or extraneous computation being done. However, I think that your explanation is incorrect because it seems you are not treating the two persistence diagrams on equal footing. On your PR branch, I get the following In [1]: import persim
In [2]: import numpy as np
In [3]: dgm1 = np.array([[5,6]])
In [4]: dgm2 = np.array([[6,7],[7,8]])
In [5]: persim.wasserstein(dgm1, dgm2)
Out[5]: 0.7071067811865483
In [6]: persim.wasserstein(dgm2, dgm1)
Out[6]: 1.4142135623730971 Your proposed solution is not symmetric, so it can't be a metric. Also, calculating by hand gives # Switch to master branch
In [7]: persim.wasserstein(dgm1, dgm2)
Out[7]: 2.1213203435596437
In [8]: np.sqrt(2)*3/2
Out[8]: 2.121320343559643 |
Yes, you're right, the code I've modified causes other problems, including that it's no longer symmetrical. I think the problem is due to the difference in our view of Wasserstein's Distance calculation. And with these assumptions that you have, there is probably no problem in the code. |
Thank you very much for bringing this up though, and thank you Michael for
your oversight. I haven't had the cycles to contribute the past week (pun
intended), but I will keep this in my inbox and check it out later.
Message ID: ***@***.***>
… |
This pull request addresses an issue related to the computation of Wasserstein distance in our codebase. Specifically, the UL matrix is removed due to a phenomenon where certain rows after the M-th row in the Hungarian algorithm, match with themselves which creates a distance greater than zero so it creates inaccuracies in the resultant distance calculation.
Changes Made
Removing the UL matrix results in eliminating erroneous values in the Wasserstein distance calculation.
Impact
This fix ensures more accurate computation of Wasserstein distance by eliminating erroneous additions to the distance metric, thereby improving the overall reliability of our codebase.