Use torch.searchsorted instead of our ad-hoc implementation #19

jmarshrossney · 2020-09-16T10:29:56Z

Hi!

I compared the searchsorted function implemented here, that does torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1, with the implementation in C++ here -- https://github.com/aliutkus/torchsearchsorted -- and it appears to be a lot slower on CPU at least.

I modified the benchmark.py in torchsearchsorted and just copy-pasted the function from nflows for comparison.
The output was (all on CPU)

Benchmark searchsorted:
- a [5000 x 16]
- v [5000 x 1]
- reporting fastest time of 10 runs
- each run executes searchsorted 100 times

Numpy: 	0.9516626670001642
torchsearchsorted: 	0.009861100999842165
nflows: 	50.19729063499926

i.e. sorting 5000 inputs into 5000 individual sets of 16 bins.

Am I missing something here? If not, it looks like the spline flows could be sped up quite a bit by using torchsearchsorted or something similar?

Cheers.

The text was updated successfully, but these errors were encountered:

arturbekasov · 2020-09-18T10:23:06Z

Hi Joe,

Indeed, our searchsorted implementation is slower than the CUDA implementation that you reference. We've done similar comparisons ourselves at some point. We've decided against using the CUDA implementation for a few reasons:

Custom CUDA kernels can be a headache to use. They need to be compiled at runtime, which means the machine you're running on must have the right compiler and development headers.
Given that both tensorflow and numpy have it, we were quite confident that searchsorted would be merged into PyTorch eventually. And indeed it has since been merged: see the related issue and docs.
You're right in that the difference in performance can be drastic when running searchsorted in isolation. However, bucketization is one of the many operations performed when running a spline flow. As a result, in end-to-end benchmarks we've observed ~30% speed-up when using the custom CUDA kernel. A noticeable improvement, but not an orders-of-magnitude one.

Hope this makes sense. In terms of next steps, now that searchsorted is in PyTorch, a 30% speed-up is well worth replacing our ad-hoc implementation with torch.searchsorted. My only concern would be that we'd have to depend on a very recent version of PyTorch. In fact, the latest stable one (1.6). I don't know how big of a deal this would be.

Thanks,

Artur

arturbekasov · 2020-09-25T13:52:05Z

For future reference: #9 has been merged which is also using a feature that is only available in PyTorch 1.6 (non-persistent buffers).

arturbekasov changed the title ~~searchsorted not very efficient?~~ Use torch.searchsorted instead of our ad-hoc implementation Oct 19, 2020

arturbekasov changed the title ~~Use torch.searchsorted instead of our ad-hoc implementation~~ Use torch.searchsorted instead of our ad-hoc implementation Oct 19, 2020

jmarshrossney mentioned this issue Feb 5, 2021

Messy but working branch used for training models in parameter scan wilsonmr/anvil#54

Merged

vsimkus mentioned this issue Jun 10, 2021

Refactored rational-quadratic spline transforms to run faster #44

Open

arturbekasov added the enhancement New feature or request label Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use torch.searchsorted instead of our ad-hoc implementation #19

Use torch.searchsorted instead of our ad-hoc implementation #19

jmarshrossney commented Sep 16, 2020

arturbekasov commented Sep 18, 2020

arturbekasov commented Sep 25, 2020

Use torch.searchsorted instead of our ad-hoc implementation #19

Use torch.searchsorted instead of our ad-hoc implementation #19

Comments

jmarshrossney commented Sep 16, 2020

arturbekasov commented Sep 18, 2020

arturbekasov commented Sep 25, 2020