-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232
Conversation
8106b4d
to
bea2e52
Compare
60ae6de
to
b431d7d
Compare
@@ -7103,15 +7103,214 @@ size_t ZSTD_compressSequences(ZSTD_CCtx* cctx, | |||
return cSize; | |||
} | |||
|
|||
|
|||
#if defined(__AVX2__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a constant ZSTD_ARCH_X86_AVX2
to compiler.h
here, and make sure we respect the ZSTD_NO_INTRINSICS
macro.
Lines 225 to 239 in a610550
/* compile time determination of SIMD support */ | |
#if !defined(ZSTD_NO_INTRINSICS) | |
# if defined(__SSE2__) || defined(_M_AMD64) || (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2)) | |
# define ZSTD_ARCH_X86_SSE2 | |
# endif | |
# if defined(__ARM_NEON) || defined(_M_ARM64) | |
# define ZSTD_ARCH_ARM_NEON | |
# endif | |
# | |
# if defined(ZSTD_ARCH_X86_SSE2) | |
# include <emmintrin.h> | |
# elif defined(ZSTD_ARCH_ARM_NEON) | |
# include <arm_neon.h> | |
# endif | |
#endif |
lib/compress/zstd_compress.c
Outdated
if (!repcodeResolution) { | ||
offBase = OFFSET_TO_OFFBASE(inSeqs[seqNb].offset); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impossible
lib/compress/zstd_compress.c
Outdated
size_t blockSize; | ||
size_t litSize; | ||
} BlockSummary; | ||
#if defined(__AVX2__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ZSTD_ARCH_X86_AVX2
lib/compress/zstd_compress.c
Outdated
#if defined(__GNUC__) | ||
# define ALIGNED32 __attribute__((aligned(32))) | ||
#elif defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* C11 */ | ||
# define ALIGNED32 alignas(32) | ||
#else | ||
/* this compiler will require its own alignment instruction */ | ||
# define ALIGNED32 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there are other places we align. Should we unify this into a common macro in compilers.h
?
lib/compress/zstd_compress.c
Outdated
DEBUGLOG(5, "long literals length detected at pos %zu", longl-nbSequences); | ||
assert(longl <= 2* (nbSequences-1)); | ||
cctx->seqStore.longLengthType = ZSTD_llt_literalLength; | ||
cctx->seqStore.longLengthPos = (U32)(longl-nbSequences); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is worth a comment, because there seems to be +1 and -1 cancelling out here, which is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored full equation, for clarity
ZSTD_STATIC_ASSERT(offsetof(SeqDef, mlBase) == 6); | ||
|
||
/* Process 2 sequences per loop iteration */ | ||
for (; i + 1 < nbSequences; i += 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much time you want to spend on this, but it would likely be a bit faster to unroll it once more to process 4 sequences per loop.
If you process 4 sequences per loop you can do half as many cross-lane shuffles. You would keep the first 2 sequences as-is, and then put the second 2 sequences in the top half of each lane. Then use _mm256_blend_epi32()
to blend them together, and you only need the single _mm256_permute4x64_epi64()
for 4 sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better kept as a potential optimization that for a future iteration.
At this stage, we don't know yet if AVX2
will be useful in kernel mode.
Lots of CI issues suddenly, Noticed so far:
If I were to guess, the CI VM probably just got silently updated, |
needs to take care of long lengths > 65535
compressSequencesAndLiterals: fixed long lengths in scalar mode
that were automatically added by the editor without notification
the branch is not in the hot loop
since it depends on a specific definition of ZSTD_Sequence structure.
do not solve the equation, even though some members cancel each other, this is done for clarity, we'll let the compiler do the resolution at compile time.
2f55e2e
to
87f0a4f
Compare
All comments addressed |
This PR improves the speed of
ZSTD_compressSequencesAndLiterals()
, especially when compiled withAVX2
support.For illustration, here are some benchmark numbers, on a i7-9700k @3.6 Ghz (turbo off),
using
enwik5
(100KB) and the sequences produced by level 5,which is an unfavorable scenario because it produces many smaller sequences.
ZSTD_compressSequences()
ZSTD_compressSequencesAndLiterals()
ondev
ZSTD_compressSequencesAndLiterals()
on this PR, scalar modeZSTD_compressSequencesAndLiterals()
withAVX2
enabledenwik5
, level 5gcc
v13.3
enwik5
, level 5clang
v18.1
The improvements to the scalar code path are small but generic.
The
AVX2
code path improves even more, but obviously requires the corresponding vector support, which is not guaranteed or may come with strings attached, for example within kernel mode.The vector code path could be even faster, and is mostly hampered by the need to check for exceptional cases.
There might be ways to improve performance even more by streamlining the path in favorable scenarios.
But note that all this work does is to reduce the overhead of translating the Sequence public format into the internal one,
so there's a limit to how much overhead can be removed, and we are already getting pretty close to this limit after these recent optimizations.
Also:
improved benchmark framework, by adding an (optional) validation function.