Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient way to integrate lossyless into a PyTorch Dataset subclass #40

Open
lbhm opened this issue Jul 26, 2021 · 5 comments
Open

Efficient way to integrate lossyless into a PyTorch Dataset subclass #40

lbhm opened this issue Jul 26, 2021 · 5 comments

Comments

@lbhm
Copy link

lbhm commented Jul 26, 2021

Hey @YannDubs,

I recently discovered your paper and find the idea very interesting. Therefore, I would like to integrate lossyless into a project I am currently working on. However, there are two requirements/presuppositions in my project that your compressor on PyTorch Hub does not cover as far as I understand it:

  • I assume that the training data do not fit into memory so I cannot decompress the entire dataset at once.
  • Because I cannot load the entire data into memory and shuffle them there, I need access to individual samples of the dataset (for random permutations) without touching the rest of the data (or as little as possible).

Basically, I would like to integrate lossyless into a subclass of PyTorch's Dataset that implements the __getitem__(index) interface. Before I start experimenting on my own and potentially overlook something that you already thought about, I wanted to ask you if you already considered approaches how to integrate your idea into a PyTorch Dataset.

Looking forward to a discussion!

@YannDubs
Copy link
Owner

YannDubs commented Aug 8, 2021

Hey Lennart,

The compression function was simply meant to show how to use the model. As you can see in the code it's extremely simple to compress a single batch at a time (see

Z_bytes += self.compress(x.to(self.device).half())
), and decompressing a single batch is also very simple (see
Z_hat[i] = self.decompress([s]).cpu().numpy()
).

The only changes I see that should be made are:

1/ using batch size of 1 when compressing (that will make it slower but is a simple way to ensure that you can perform permutations). I.e., here:

kwargs_dataloader=dict(batch_size=128, num_workers=16),

2/ saving (and loading) each compressed image separately rather than all at once. I.e. the following lines should go in the for loop:

# save representations

Neither of those points are very complex if you want to give it a try.

I'm quite busy right now but if you haven't done it by then I might be able to do it in the next 2-3 weeks :)

@lbhm
Copy link
Author

lbhm commented Aug 12, 2021

Hi Yann,

Thank you for your reply and pointing out the relevant LOCs. I'll see how I can best integrate lossyless into my code.

I already tried something similar with a few models from compressai and the execution time of my data loader was unfortunately rather subpar given that the decoder NN can only process a single image or one mini-batch of images at max in parallel. Would you say it is generally possible to achieve the same decompression speed as "classic" codecs like JPEG with neural decoders? Even if I decode an entire dataset at once, the execution time of the neural codecs I've tried so far still lags behind classic codecs.

@YannDubs
Copy link
Owner

Yes unfortunately removing batch compression will make the compressor very slow. But actually it’s not really needed, you can compress in batch and still save images separately.

Concerning the decompressor right now compressai doesn’t allow batch decompression so that’s why decompressing the entire dataset at once is so slow, I.e, it doesn’t actually take advantage of batches. I don’t think that this is an issue with lossyless though but simply the compressai implementation. In theory lossyless could even be quicker at decompression than standard codes as it doesn’t require reconstructing the image.

The simplest way to make the decoder quicker for now is to at least parallelise decompression of each image.

@lbhm
Copy link
Author

lbhm commented Aug 12, 2021

I see. Maybe converting a lossyless model into a TensorRT engine could also help improve decompression speed. I haven't worked with TensorRT before though so I'm not sure if there is a straightforward way to integrate a TensorRT engine into another training pipeline.

I'll be quite busy throughout the next week but afterwards I'll see if there is anything I can do to achieve decent/better decompression speed without having to mess with the entire underlying software stack.

@lbhm
Copy link
Author

lbhm commented Sep 6, 2021

Hi Yann,

In the last days, I had a chance to work on this topic again and experimented with a few ways to achieve good encoding/decoding performance. I couldn't test my prototypes on server-grade hardware yet but I'd like to share some of the insights I got from running tests on my laptop today. I ran my tests on a small ImageNet subsample with 900 images and 2 MB disk size.

Encoding the dataset with lossyless (beta 0.1) takes about 40 seconds with my MX330 laptop GPU. Using WebP and Python multiprocessing, my CPU processes the same dataset in less than a second. Do these numbers sound reasonable to you? I can post a snippet of my code if you think that lossyless encoding should not be that far off WebP.

Decoding the dataset using a PyTorch data loader with multiple workers takes about 25s on my laptop. WebP decoding takes less than 0.1s on the other hand. One problem with the PyTorch data loader (afaik) is that the separate worker processes do not use multi-threading due to the Python GIL. That makes executing the model rather slow.
Therefore, I also implemented a prototype that uses Python's multiprocessing library instead of a PyTorch data loader. In this scenario, decoding takes about 4-9 seconds. However, it came as a surprise to me that basically the entire time is spent on loading the model (4-8s) and decoding itself is pretty quick (but still slower than WebP). Loading other models from CompressAI takes less than 0.1s though. Does it makes sense that loading a lossyless model takes so much longer than other learned compression models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants