-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leiden clustering returns different results for Scanpy 1.10.4
and 1.9.3
#3422
Comments
Hm, I need to take a closer look at that. I didn‘t try to reproduce it, but logically, it seems impossible, the code’s behavior looks identical to me: git diff 1.9.3:scanpy/tools/_leiden.py 1.10.4:src/scanpy/tools/_leiden.py |
@flying-sheep I just tested it again and it still seems to be inconsistent between Another possibility is that the implementation of |
@alam-shahul Thanks for the pointers, both the |
I think the culprit lie with our random state handling: #2946 (comment) |
I suspect (at least for #2946) the culprit is not ScanPy directly but one of its dependencies (specifically, a change in the dependency's internals between two versions). I am not sure which dependency specifically at the moment. |
#2536 So this PR caused the change. I'll need to look into why. |
That PR was huge. I did my best to keep it compatible but I’m not surprised that something slipped through. IIRC think the main difference is that after the PR, If it does, the ideal fix would be
On the other hand, people have been using the new code for a year, so going back might break their backwards compatibility. |
Ok @flying-sheep following your suggestion I was able to get the connectivities to match but not the distances. I would assume the culprit lies here: https://github.com/scverse/scanpy/pull/2536/files#diff-934eec0a4b88db7c4f7c099d803a25f6b81ca654579bd1ee84fd28b7858b2de2L380-L423 We don't re-get the self._distances = _get_sparse_matrix_from_indices_distances_umap(
knn_indices, knn_distances, self._adata.shape[0], self.n_neighbors
) seems to fix things up to 5 decimal places. But I am super out of my depth here - I don't really understand why it was removed or there in the first place. Here's the very rough diff --git a/src/scanpy/neighbors/__init__.py b/src/scanpy/neighbors/__init__.py
index 21404372..d4e24d07 100644
--- a/src/scanpy/neighbors/__init__.py
+++ b/src/scanpy/neighbors/__init__.py
@@ -10,6 +10,7 @@ from warnings import warn
import numpy as np
import scipy
from scipy.sparse import issparse
+from sklearn.metrics import pairwise_distances
from sklearn.utils import check_random_state
from .. import _utils
@@ -19,6 +20,7 @@ from .._settings import settings
from .._utils import NeighborsView, _doc_params, get_literal_vals
from . import _connectivity
from ._common import (
+ _get_indices_distances_from_dense_matrix,
_get_indices_distances_from_sparse_matrix,
_get_sparse_matrix_from_indices_distances,
)
@@ -571,8 +573,8 @@ class Neighbors:
self.n_neighbors = n_neighbors
self.knn = knn
X = _choose_representation(self._adata, use_rep=use_rep, n_pcs=n_pcs)
- self._distances = transformer.fit_transform(X)
- knn_indices, knn_distances = _get_indices_distances_from_sparse_matrix(
+ self._distances = pairwise_distances(X, metric=metric, **metric_kwds)
+ knn_indices, knn_distances = _get_indices_distances_from_dense_matrix(
self._distances, n_neighbors
)
if shortcut:
@@ -593,14 +595,42 @@ class Neighbors:
with contextlib.suppress(Exception):
self._rp_forest = _make_forest_dict(index)
start_connect = logg.debug("computed neighbors", time=start_neighbors)
-
if method == "umap":
+
+ def _get_sparse_matrix_from_indices_distances_umap(
+ knn_indices, knn_dists, n_obs, n_neighbors
+ ):
+ rows = np.zeros((n_obs * n_neighbors), dtype=np.int64)
+ cols = np.zeros((n_obs * n_neighbors), dtype=np.int64)
+ vals = np.zeros((n_obs * n_neighbors), dtype=np.float64)
+
+ for i in range(knn_indices.shape[0]):
+ for j in range(n_neighbors):
+ if knn_indices[i, j] == -1:
+ continue # We didn't get the full knn for i
+ if knn_indices[i, j] == i:
+ val = 0.0
+ else:
+ val = knn_dists[i, j]
+
+ rows[i * n_neighbors + j] = i
+ cols[i * n_neighbors + j] = knn_indices[i, j]
+ vals[i * n_neighbors + j] = val
+ import scipy.sparse as sp
+
+ result = sp.coo_matrix((vals, (rows, cols)), shape=(n_obs, n_obs))
+ result.eliminate_zeros()
+ return result.tocsr()
+
self._connectivities = _connectivity.umap(
knn_indices,
knn_distances,
n_obs=self._adata.shape[0],
n_neighbors=self.n_neighbors,
)
+ self._distances = _get_sparse_matrix_from_indices_distances_umap(
+ knn_indices, knn_distances, self._adata.shape[0], self.n_neighbors
+ )
elif method == "gauss":
self._connectivities = _connectivity.gauss(
self._distances, self.n_neighbors, knn=self.knn As for what we should do about this, I am also a bit at a loss. I would tend towards making a public announcement and going back to the original behavior. Certainly, something should be done, or at least stated. |
OK, let’s think about this. What it does should already happen here: scanpy/src/scanpy/neighbors/__init__.py Lines 581 to 584 in 75c246d
Indeed: why and how does
|
I think I am also curious as to why was |
The umap version also deals with -1s which we never feed in. Otherwise I can’t see a difference other than the umap version being dog slow. That’s why it was removed: it’s a slower version of something we have with features we don’t use. But you’re saying switching it out makes a difference so seems like I’m wrong. I just wonder what I’m not seeing |
Ok great to know. I'll have a closer look then. |
Please make sure these conditions are met
What happened?
I am doing some data analysis that depends on Scanpy, and for reproducibility reasons it is important that the results from Scanpy's implementation of Leiden clustering are consistent.
It seems that the changes to the
sc.tl.leiden
implementation inv1.10
change the clustering results, although they are not supposed to (there is a warning that the defaults will change in the future, but that they have not yet).Can this be patched?
Minimal code sample
Error output
Versions
The text was updated successfully, but these errors were encountered: