-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resizing images is slow #2223
Comments
It's tempting to just use The drawback is that there seems to be some |
Would it make sense to sacrifice a bit of perf and make the After a quick look it seems like it could still yield decent perf improvements here. |
Yes, that sounds like a very good idea! |
I might do performance rescaling without Here I already have scaling with performace close to fast-image-resize so it is not so hard to made a new one scalar implementation without unsafety. |
Yes, having fast and zero-unsafe resizing would be very beneficial. Then For integer types you should be able to get close to the performance of explicit SIMD using autovectorization. |
This type of convolution is very unlikely to be correctly auto-vectorized especially for very common I would say that when rely on auto-vectorization this convolution of RGBA f32 image performance might be exact or even better sometimes on Ok, I think I'll do so. When question probably I want to ask before: are you expecting this using |
I think it's best to stick to a single-threaded implementation for now. The scalar fallback path in |
So, here some intermediate results with Seems to be good for a safety price. Test were made on laptop so about 10-15% moving performance timing should be considered. Pic scale RGB: Lanczos 3 Fast image resize RGB: Lanczos 3 Pic scale RGBA: Lanczos 3 Fast image resize RGBA: Lanczos 3 Also tests with optional Pic scale RGB: Lanczos 3 Fast image resize RGB: Lanczos 3 Pic scale RGBA: Lanczos 3 Fast image resize RGBA: Lanczos 3 |
Also I'm compiling on |
That looks like a great improvement! As for bounds checks, I've written a very detailed article about avoiding them without unsafe code: |
Yup, I've read that, thanks. Perhaps, I know everything of this :) At the moment state, in single-threading mode on Pic scale RGBA: Lanczos 3
time: [35.462 ms 35.788 ms 36.122 ms]
Fast image resize RGBA: Lanczos 3
time: [49.988 ms 50.339 ms 50.852 ms]
Pic scale RGB: Lanczos 3
time: [46.607 ms 46.942 ms 47.275 ms]
Fast image resize RGB: Lanczos 3
time: [44.320 ms 44.941 ms 45.676 ms]
Pic scale LA: Lanczos 3 time: [37.117 ms 37.350 ms 37.595 ms]
Fast image resize LA: Lanczos 3
time: [43.058 ms 43.459 ms 43.912 ms]
Pic scale Plane: Lanczos 3
time: [26.315 ms 26.593 ms 26.892 ms]
Fast image resize Plane: Lanczos 3
time: [19.720 ms 19.928 ms 20.137 ms] Planar 8-bit images are little suffer, perhaps I will stop on this because this is not very common. |
So, here is final results. Surprusingly, I didn't expect that can be so good sometimes with Tests on I'll publish soon final crate to Here and here still a few bottlenecks with bounds checks, but at the moments this is the best one I have achieved. Perhaps, if there will be some ideas how to completely avoid these bounds checks this will add up to ~10-15% also. If you wish to see there something specific, let me know. |
Package is at Hope, that helps. |
I've tried out @awxkee I could not find a way to scale a grayscale+alpha u16 image, which is one of the formats that the For RGB and RGBA images I've tried on the thumbnailing workload (Nearest to 5x the target size, then Lancsoz3 down to the requested size) the difference in end-to-end benchmarks is within 1% compared to Also the boilerplate I had to write to wire things up was non-intuitive and quite unpleasant, so I'd really like to get this into |
Glad to hear it helps!
Will be done.
I'll think about that to make this convenient for everyone. I often do so because most of users works in 1 or 2 pixel formats and thats all, without sometimes crazy codegen, and compilation time for all formats at once. However, yes, for whom interested in all at one times this is not so good.
Are you intentionally do not using linear or perceptual colorspace? For scaling, blurring, and other almost any convolution ops except maybe edge detection, it is quite preferred since after scaling you completely erasing high intensities, and make dark tones even darker, and as much you downscale, as much blurring are receiving distortion grows. There is also artice in imagemagick how that works. |
Planar16 has been added in |
Also, please, note: scaling as general rendering op as blur and others, requires premultiplied alpha, since Premultiplication is not really cheap also, so it might be reasonable to do fast check if premultiplication is necessary: alpha is not constant or do not equal to a bit-depth maximum Here is the method, or I can integrate it into the library: fn has_non_constant_cap_alpha<
V: Copy + PartialEq + BitXor<V, Output = V> + AddAssign + 'static,
const ALPHA_CHANNEL_INDEX: usize,
const CHANNELS: usize,
>(
store: &[V],
width: usize,
bit_depth: u32,
) -> bool
where
i32: AsPrimitive<V>, u32: AsPrimitive<V>,
{
assert!(ALPHA_CHANNEL_INDEX < CHANNELS);
assert!(CHANNELS <= 4);
if store.is_empty() {
return false;
}
let max = ((1 << bit_depth) - 1).as_();
if max != store[ALPHA_CHANNEL_INDEX] {
return true;
}
let mut row_sums: V = 0.as_();
for row in store.chunks_exact(width * CHANNELS) {
for color in row.chunks_exact(CHANNELS) {
row_sums += color[ALPHA_CHANNEL_INDEX].bitxor(max);
}
if row_sums != 0.as_() {
return true
}
}
let zeros = 0.as_();
row_sums.ne(&zeros)
} |
Here in wrapper premultiplication is made, however, the check above are not. |
Than you!
Yes. In As for alpha pre-multiplication for other image ops - yes, we do have issues with blur due to non-premultiplied alpha. That is something that we would very much like to fix in I see that |
Premultiplied alpha adds its own challenges. If you want to export as PNG or whatever then you'll need to undo the premultiplication before saving. But unfortunately premultiplied alpha it isn't fully reversible: The color data is lost for fully transparent areas, and precision is sharply degraded in the nearly transparent areas. |
That is a good point. And that is why I think premultiplied alpha should not be the default. But both blurring and resizing already alter pixel values significantly, so I don't think loss of precision in nearly transparent areas would be a problem for these operations. I guess could store alpha-premultiplied images in a higher precision to avoid precision loss? For example, we could expand u8 pixels to u16 simply by bit-shifting them (equivalent to multiplying by 256) and then dividing by the originally u8 alpha channel would still keep precision nearly intact. And I guess u16 pixels would have to be upsampled to f32. But that would require 2x more memory and would be quite slow due to the sheer amount of memory we need to read and write, so I don't think it's worth it. |
This will at least double processing cost. If you want to save precision at least at reasonable amount you have to go with |
I think 99.99% of users just want to get correctly blurred/scaled image instead of saved precision of pixel with intensity |
The cases I worry about are resizing solid color images and either having the pixel values in the output image slightly off, or having a fully transparent regions getting their red/green/blue channels all set to zero |
For solid images with constant alpha pre-multiplication can be avoided by method above by checking if alpha is constant and max. This is an easy part. |
I mean, the color of fully transparent regions doesn't matter since they are fully transparent, and preserving it only causes bloat in the final PNG file. That's why e.g. GIMP usually preprocesses the file to strip that color when saving as PNG. For solid color images with no alpha variation, the proposed check on whether premultiplication is needed would indeed take care of it. And for an image with solid color an a gradient in the alpha channel, the precision loss would be exactly identical to what you'd get when blending it anyway. I agree that the loss of precision is unfortunate, and it seems that there is no solution that's both entirely correct and efficient, only trade-offs. But I think it should be less bad than the color bleed artifacts I'm getting when resizing the test image from #2324. That's also the option both imagemagick and GIMP picked. |
For the second case it seems you'll need to keep 2 versions of image if want to preserve it. Not really clear what to do next with that. Perhaps if you'll want to store original rescaled pixels will have to transfer alpha channel from one to another, but this will be almost everytime incorrect. |
I've lost track of what is the actual proposal at this point. Is the idea that |
I'm just going to wire it up to |
Okay, I've wired up pre-multiplication to For comparison, the version using You can check them out and run tests on them. How it worksIf the image has an alpha channel:
What this fixesI tested this image that contains only white and transparency: https://commons.wikimedia.org/wiki/File:Transparent_Clouds_Map_Without_Asia.png Command used for testing: If resized without pre-multiplication, the black from fully transparent pixels bleeds into the clouds, and darkens the image. This is clearly incorrect. View this image on a white background to see that it now darkens it: And this is the correct result when using premultiplied alpha: The behavior with premultiplied alpha matches both imagemagick and GIMP. You can also clearly see color bleed on my synthetic testcase from #2324 What this doesn't breakHere's a test image filled with color Thanks to the heuristic for not premultiplying by alpha when alpha is constant, this image survives completely unscathed and the exact color information is preserved after resizing: What's left to doThe current check for constant alpha is slow. Here's a faster version, but I couldn't get the generics to work out because Code that fails to compile#[must_use]
fn has_constant_alpha<P, Container>(img: &ImageBuffer<P, Container>) -> bool
where
P: Pixel + 'static,
Container: std::ops::Deref<Target = [P::Subpixel]>,
{
let first_pixel_alpha = match img.pixels().next() {
Some(pixel) => pixel.channels().last().unwrap(), // there doesn't seem to be a better way to retrieve the alpha channel :(
None => return true, // empty input image
};
// A naive, slower implementation that branches on every pixel:
//
// img.pixels().map(|pixel| pixel.channels().last().unwrap()).all(|alpha| alpha == first_pixel_alpha);
//
// Instead of doing that we scan every row first with cheap arithmetic instructions
// and only compare the sum of divergences on every row, which should be 0
let mut sum_of_diffs = Default::default(); // primitive numbers default to 0
for row in img.rows() {
row.for_each(|pixel| {
let alpha = pixel.channels().last().unwrap(); // there doesn't seem to be a better way to retrieve the alpha channel :(
sum_of_diffs += alpha.bitxor(first_pixel_alpha);
});
if sum_of_diffs != Default::default() {
return false
}
}
true
} I guess I'll just have to rewrite it in terms of a Where do we go from hereOnce I optimize the alpha check I'd like to upstream this resizing method into I do not want to expose pre-multiplication through the public API. Combined with the check we do for the same alpha in the entire image, exposing these details would make the API quite inwieldy. The drawback is that if someone wants to downscale an image and then immediately blur it, we would do some unnecessary work, but that's the price I'm willing to pay for a much saner API. The open questions about switching to this resize method are:
|
Sure, I almost did that for image crate.
If bitxor won't be added as trait better do always premultiplication because medium tier and low tier, and often even high tier,
Sure, this is my mistake, accumulator should be Something like that for color in row.chunks_exact(CHANNELS) {
row_sums += (color[ALPHA_CHANNEL_INDEX] - max).abs();
}
if row_sums > f32::EPSILON {
return true;
} Or check by ULP's let ones_bits = 1f32.to_bits();
for color in row.chunks_exact(CHANNELS) {
row_ulp += (color[ALPHA_CHANNEL_INDEX].to_bits() as i64 - ones_bits as i64).abs() as u32;
}
// Some noticeable ULP, for ex. 500
if row_ulp > 500 {
return true;
}
Without a @fintelia I can also provide safe replacement for |
Also premultiplication might be completely avoided for |
Good suggestion, thanks! I've applied it in I also got the optimized alpha check to compile by consulting an LLM and using some cursed generic bounds. So as of Shnatsel/wondermagick@05d6916 I believe the implementation is complete, and the performance is representative of what it would be like if integrated into |
As I'm helping with resizing I decided theck this one. So here is definetly Box filter in class named resizeAreaFast_Invoker is clearly that it averages pixel over the area for downscaling. Here is the statement where thet saying that And here they are computing coefficients for So if it makes sense to solve for you I can reproduce this behavior in the lib. |
The image resizing routines in
image
seem to be really slow compared to the competition.For example,
fast_image_resize
crate benchmarks show thatimage
is 5x slower thanlibvips
andfast_image_resize
's fallback implementation, and 20x slower thanfast_image_resize
's AVX2 implementation. It also does runtime CPU feature detection.I have run the RGB benchmarks locally against
image
v0.25.1 to reproduce the results, and I can confirm that they are accurate on my machine - the performance gap is real:The text was updated successfully, but these errors were encountered: