-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to specify different algo args based on the dataset? #178
Comments
There's no way. I think it's fine to include both parameter sets, although if they are wildly different I'd be a bit nervous they are cherry picked. Looks like on the scatter plot like there's a very very large set of points, so I'd recommend pruning it down to no more than 20-50 different parameters. Very excited about including elastiknn! |
@erikbern I'm still stuck on this and want to propose a solution and get your feedback before PRing it. To recap, basically there is one parameter (the LSH hashing width parameter I've thought about ways to add some early stopping heuristic within Elastiknn, but hate the idea of introducing magic numbers when, IMO, the real solution is to understand the distribution of your data and how it affects parameter choice. I've also documented good parameters here. So, my proposed solution: in the Elastiknn "algorithm" class in this repo, I'll monitor response times for queries. If after 100 queries the mean response time is > 100ms, I just |
@alexklibisz I'm wondering if you cannot figure out In general, it might improve the time to run the full benchmark quite a bit if we were to allow to specific dataset specific settings (e.g., https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml#L545-L553 looks also fishy). For example, the dataset could be added as a third level of the hierarchy in https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml. The standard format would be sth like float -> any -> any, with the option of saying float -> euclidean -> sift-128. Only a few lines in https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/algorithms/definitions.py#L105. I'm split here, because it opens up the window for micro-optimizations to the dataset. (Of course, this window was always open by just adding many run groups to the definition file, as above.) |
Hi @maumueller . Thanks for the input! Generally Elastiknn operates under the assumption that there is no explicit "fitting" or "building" phase. As with regular Elasticsearch documents, you can insert/update/delete vectors as you would any old Elasticsearch document. I'm not sure if a fitting/sampling step would solve this problem, since I still couldn't say "these sampled values are for dataset foo, these others are for dataset bar" I agree it's a tough call whether to allow dataset-specific parameters. I haven't surveyed all of the models, but I'd imagine there are at least a few others with some sensitivity to the values of the data. I think technically there's nothing wrong with having some of the containers run for two hours and then just time out. They don't blow up the whole run. However I hate to waste @erikbern 's money :). |
I think just killing the container after 2h is fine. I might make it 1h actually. Going forward I'm planning to run on a high-RAM machine and run a lot of algorithms in parallel so that it doesn't run for several weeks |
I've been working on integrating elastiknn: https://github.com/alexklibisz/elastiknn
Getting close to making a PR, but I have a case where the good args for the one dataset are different than the good args for another, both with euclidean distance.
I guess I could just include them all, but that seems like a waste of compute resources since I know what's good for one dataset will work poorly for another.
Since ES is on the JVM, data is on disk, and every query is an HTTP request, it's quite a bit slower than the other in-memory C/C++ implementations.
Would hate to bottleneck your updates.
Here's a sneak peek for SIFT (using my own benchmarking plots):
So, is there some way I can setup
algos.yaml
to only use one set of args for dataset A and use another set of args for dataset B?The text was updated successfully, but these errors were encountered: