Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use torch.searchsorted instead of our ad-hoc implementation #19

Open
jmarshrossney opened this issue Sep 16, 2020 · 2 comments
Open

Use torch.searchsorted instead of our ad-hoc implementation #19

jmarshrossney opened this issue Sep 16, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@jmarshrossney
Copy link

Hi!

I compared the searchsorted function implemented here, that does torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1, with the implementation in C++ here -- https://github.com/aliutkus/torchsearchsorted -- and it appears to be a lot slower on CPU at least.

I modified the benchmark.py in torchsearchsorted and just copy-pasted the function from nflows for comparison.
The output was (all on CPU)

Benchmark searchsorted:
- a [5000 x 16]
- v [5000 x 1]
- reporting fastest time of 10 runs
- each run executes searchsorted 100 times

Numpy: 	0.9516626670001642
torchsearchsorted: 	0.009861100999842165
nflows: 	50.19729063499926

i.e. sorting 5000 inputs into 5000 individual sets of 16 bins.

Am I missing something here? If not, it looks like the spline flows could be sped up quite a bit by using torchsearchsorted or something similar?

Cheers.

@arturbekasov
Copy link
Contributor

Hi Joe,

Indeed, our searchsorted implementation is slower than the CUDA implementation that you reference. We've done similar comparisons ourselves at some point. We've decided against using the CUDA implementation for a few reasons:

  1. Custom CUDA kernels can be a headache to use. They need to be compiled at runtime, which means the machine you're running on must have the right compiler and development headers.
  2. Given that both tensorflow and numpy have it, we were quite confident that searchsorted would be merged into PyTorch eventually. And indeed it has since been merged: see the related issue and docs.
  3. You're right in that the difference in performance can be drastic when running searchsorted in isolation. However, bucketization is one of the many operations performed when running a spline flow. As a result, in end-to-end benchmarks we've observed ~30% speed-up when using the custom CUDA kernel. A noticeable improvement, but not an orders-of-magnitude one.

Hope this makes sense. In terms of next steps, now that searchsorted is in PyTorch, a 30% speed-up is well worth replacing our ad-hoc implementation with torch.searchsorted. My only concern would be that we'd have to depend on a very recent version of PyTorch. In fact, the latest stable one (1.6). I don't know how big of a deal this would be.

Thanks,

Artur

@arturbekasov
Copy link
Contributor

For future reference: #9 has been merged which is also using a feature that is only available in PyTorch 1.6 (non-persistent buffers).

@arturbekasov arturbekasov changed the title searchsorted not very efficient? Use torch.searchsorted instead of our ad-hoc implementation Oct 19, 2020
@arturbekasov arturbekasov changed the title Use torch.searchsorted instead of our ad-hoc implementation Use torch.searchsorted instead of our ad-hoc implementation Oct 19, 2020
@arturbekasov arturbekasov added the enhancement New feature or request label Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants