Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Merged
merged 18 commits into from
Jan 16, 2025

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Jan 8, 2025

This PR improves the speed of ZSTD_compressSequencesAndLiterals(), especially when compiled with AVX2 support.

For illustration, here are some benchmark numbers, on a i7-9700k @3.6 Ghz (turbo off),
using enwik5 (100KB) and the sequences produced by level 5,
which is an unfavorable scenario because it produces many smaller sequences.

dataset compiler ZSTD_compressSequences() ZSTD_compressSequencesAndLiterals() on dev ZSTD_compressSequencesAndLiterals() on this PR, scalar mode ZSTD_compressSequencesAndLiterals() with AVX2 enabled
enwik5, level 5 gcc v13.3 528 MB/s 598 MB/s 630 MB/s 670 MB/s
enwik5, level 5 clang v18.1 512 MB/s 584 MB/s 601 MB/s 665 MB/s

The improvements to the scalar code path are small but generic.

The AVX2 code path improves even more, but obviously requires the corresponding vector support, which is not guaranteed or may come with strings attached, for example within kernel mode.

The vector code path could be even faster, and is mostly hampered by the need to check for exceptional cases.
There might be ways to improve performance even more by streamlining the path in favorable scenarios.
But note that all this work does is to reduce the overhead of translating the Sequence public format into the internal one,
so there's a limit to how much overhead can be removed, and we are already getting pretty close to this limit after these recent optimizations.

Also:
improved benchmark framework, by adding an (optional) validation function.

@Cyan4973 Cyan4973 self-assigned this Jan 8, 2025
@Cyan4973 Cyan4973 force-pushed the convertSequences_SSE branch from 8106b4d to bea2e52 Compare January 8, 2025 07:36
@Cyan4973 Cyan4973 marked this pull request as ready for review January 8, 2025 19:21
@Cyan4973 Cyan4973 force-pushed the convertSequences_SSE branch 2 times, most recently from 60ae6de to b431d7d Compare January 8, 2025 22:51
@@ -7103,15 +7103,214 @@ size_t ZSTD_compressSequences(ZSTD_CCtx* cctx,
return cSize;
}


#if defined(__AVX2__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a constant ZSTD_ARCH_X86_AVX2 to compiler.h here, and make sure we respect the ZSTD_NO_INTRINSICS macro.

zstd/lib/common/compiler.h

Lines 225 to 239 in a610550

/* compile time determination of SIMD support */
#if !defined(ZSTD_NO_INTRINSICS)
# if defined(__SSE2__) || defined(_M_AMD64) || (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2))
# define ZSTD_ARCH_X86_SSE2
# endif
# if defined(__ARM_NEON) || defined(_M_ARM64)
# define ZSTD_ARCH_ARM_NEON
# endif
#
# if defined(ZSTD_ARCH_X86_SSE2)
# include <emmintrin.h>
# elif defined(ZSTD_ARCH_ARM_NEON)
# include <arm_neon.h>
# endif
#endif

Comment on lines 7354 to 7356
if (!repcodeResolution) {
offBase = OFFSET_TO_OFFBASE(inSeqs[seqNb].offset);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impossible

size_t blockSize;
size_t litSize;
} BlockSummary;
#if defined(__AVX2__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ZSTD_ARCH_X86_AVX2

Comment on lines 7396 to 7403
#if defined(__GNUC__)
# define ALIGNED32 __attribute__((aligned(32)))
#elif defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* C11 */
# define ALIGNED32 alignas(32)
#else
/* this compiler will require its own alignment instruction */
# define ALIGNED32
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are other places we align. Should we unify this into a common macro in compilers.h?

DEBUGLOG(5, "long literals length detected at pos %zu", longl-nbSequences);
assert(longl <= 2* (nbSequences-1));
cctx->seqStore.longLengthType = ZSTD_llt_literalLength;
cctx->seqStore.longLengthPos = (U32)(longl-nbSequences);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is worth a comment, because there seems to be +1 and -1 cancelling out here, which is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored full equation, for clarity

ZSTD_STATIC_ASSERT(offsetof(SeqDef, mlBase) == 6);

/* Process 2 sequences per loop iteration */
for (; i + 1 < nbSequences; i += 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much time you want to spend on this, but it would likely be a bit faster to unroll it once more to process 4 sequences per loop.

If you process 4 sequences per loop you can do half as many cross-lane shuffles. You would keep the first 2 sequences as-is, and then put the second 2 sequences in the top half of each lane. Then use _mm256_blend_epi32() to blend them together, and you only need the single _mm256_permute4x64_epi64() for 4 sequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better kept as a potential optimization that for a future iteration.
At this stage, we don't know yet if AVX2 will be useful in kernel mode.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 15, 2025

Lots of CI issues suddenly,
though they seem uncorrelated to this PR.

Noticed so far:

  • liblzma and lzma.h no longer available
  • clang-14 no longer available
  • -lgcc missing at linking stage (??)

If I were to guess, the CI VM probably just got silently updated,
but I sure am surprised by the nb of tests it broke in the process.

@Cyan4973 Cyan4973 force-pushed the convertSequences_SSE branch from 2f55e2e to 87f0a4f Compare January 16, 2025 01:11
@Cyan4973
Copy link
Contributor Author

All comments addressed

@Cyan4973 Cyan4973 merged commit 33747e2 into dev Jan 16, 2025
94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants