Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to specify different algo args based on the dataset? #178

Closed
alexklibisz opened this issue Sep 9, 2020 · 5 comments
Closed

Comments

@alexklibisz
Copy link
Contributor

alexklibisz commented Sep 9, 2020

I've been working on integrating elastiknn: https://github.com/alexklibisz/elastiknn
Getting close to making a PR, but I have a case where the good args for the one dataset are different than the good args for another, both with euclidean distance.
I guess I could just include them all, but that seems like a waste of compute resources since I know what's good for one dataset will work poorly for another.
Since ES is on the JVM, data is on disk, and every query is an HTTP request, it's quite a bit slower than the other in-memory C/C++ implementations.
Would hate to bottleneck your updates.

Here's a sneak peek for SIFT (using my own benchmarking plots):
bokeh_plot

So, is there some way I can setup algos.yaml to only use one set of args for dataset A and use another set of args for dataset B?

@erikbern
Copy link
Owner

erikbern commented Sep 10, 2020

There's no way. I think it's fine to include both parameter sets, although if they are wildly different I'd be a bit nervous they are cherry picked. Looks like on the scatter plot like there's a very very large set of points, so I'd recommend pruning it down to no more than 20-50 different parameters.

Very excited about including elastiknn!

@alexklibisz
Copy link
Contributor Author

@erikbern I'm still stuck on this and want to propose a solution and get your feedback before PRing it.

To recap, basically there is one parameter (the LSH hashing width parameter w) which needs to be around 6 or 7 for good performance on the Fashion-MNIST dataset, and around 1 or 2 for the SIFT dataset. There's a pretty intuitive explanation, which I detailed here. If you set w=1 or w=2 for Fashion-MNIST, there's really no issue, the recall is just poor and you move on. If you set w to 6 or 7 for SIFT, each query matches 50-70% of the corpus as approximate candidates and Lucene takes an extremely long time to count up the top k approximate matches. So the run ends up timing out and wasting time/money.

I've thought about ways to add some early stopping heuristic within Elastiknn, but hate the idea of introducing magic numbers when, IMO, the real solution is to understand the distribution of your data and how it affects parameter choice. I've also documented good parameters here.

So, my proposed solution: in the Elastiknn "algorithm" class in this repo, I'll monitor response times for queries. If after 100 queries the mean response time is > 100ms, I just sys.exit(0). It's a bit hacky but seems to be the most reasonable compromise. LMK your thoughts when you get a chance.

@maumueller
Copy link
Collaborator

@alexklibisz I'm wondering if you cannot figure out w from sampling points during index building and base w on the observed distances between points in the sample? There is also always the option to figure out which dataset you are currently running on by looking at the first coordinates of the first vector in the array provided in fit. (This is of course very hacky, but it seems that this is basically what you want to know.)

In general, it might improve the time to run the full benchmark quite a bit if we were to allow to specific dataset specific settings (e.g., https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml#L545-L553 looks also fishy). For example, the dataset could be added as a third level of the hierarchy in https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml. The standard format would be sth like float -> any -> any, with the option of saying float -> euclidean -> sift-128. Only a few lines in https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/algorithms/definitions.py#L105.

I'm split here, because it opens up the window for micro-optimizations to the dataset. (Of course, this window was always open by just adding many run groups to the definition file, as above.)

@alexklibisz
Copy link
Contributor Author

Hi @maumueller . Thanks for the input! Generally Elastiknn operates under the assumption that there is no explicit "fitting" or "building" phase. As with regular Elasticsearch documents, you can insert/update/delete vectors as you would any old Elasticsearch document. I'm not sure if a fitting/sampling step would solve this problem, since I still couldn't say "these sampled values are for dataset foo, these others are for dataset bar"

I agree it's a tough call whether to allow dataset-specific parameters. I haven't surveyed all of the models, but I'd imagine there are at least a few others with some sensitivity to the values of the data.

I think technically there's nothing wrong with having some of the containers run for two hours and then just time out. They don't blow up the whole run. However I hate to waste @erikbern 's money :).

@erikbern
Copy link
Owner

I think just killing the container after 2h is fine. I might make it 1h actually. Going forward I'm planning to run on a high-RAM machine and run a lot of algorithms in parallel so that it doesn't run for several weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants