Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Cyan4973 · 2025-01-08T07:33:25Z

This PR improves the speed of ZSTD_compressSequencesAndLiterals(), especially when compiled with AVX2 support.

For illustration, here are some benchmark numbers, on a i7-9700k @3.6 Ghz (turbo off),
using enwik5 (100KB) and the sequences produced by level 5,
which is an unfavorable scenario because it produces many smaller sequences.

dataset	compiler	`ZSTD_compressSequences()`	`ZSTD_compressSequencesAndLiterals()` on `dev`	`ZSTD_compressSequencesAndLiterals()` on this PR, scalar mode	`ZSTD_compressSequencesAndLiterals()` with `AVX2` enabled
`enwik5`, level 5	`gcc` `v13.3`	528 MB/s	598 MB/s	630 MB/s	670 MB/s
`enwik5`, level 5	`clang` `v18.1`	512 MB/s	584 MB/s	601 MB/s	665 MB/s

The improvements to the scalar code path are small but generic.

The AVX2 code path improves even more, but obviously requires the corresponding vector support, which is not guaranteed or may come with strings attached, for example within kernel mode.

The vector code path could be even faster, and is mostly hampered by the need to check for exceptional cases.
There might be ways to improve performance even more by streamlining the path in favorable scenarios.
But note that all this work does is to reduce the overhead of translating the Sequence public format into the internal one,
so there's a limit to how much overhead can be removed, and we are already getting pretty close to this limit after these recent optimizations.

Also:
improved benchmark framework, by adding an (optional) validation function.

terrelln · 2025-01-10T20:47:29Z

lib/compress/zstd_compress.c

@@ -7103,15 +7103,214 @@ size_t ZSTD_compressSequences(ZSTD_CCtx* cctx,
    return cSize;
 }

+
+#if defined(__AVX2__)


We should add a constant ZSTD_ARCH_X86_AVX2 to compiler.h here, and make sure we respect the ZSTD_NO_INTRINSICS macro.

zstd/lib/common/compiler.h

Lines 225 to 239 in a610550

/* compile time determination of SIMD support */

#if !defined(ZSTD_NO_INTRINSICS)

# if defined(__SSE2__) || defined(_M_AMD64) || (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2))

# define ZSTD_ARCH_X86_SSE2

# endif

# if defined(__ARM_NEON) || defined(_M_ARM64)

# define ZSTD_ARCH_ARM_NEON

# endif

#

# if defined(ZSTD_ARCH_X86_SSE2)

# include <emmintrin.h>

# elif defined(ZSTD_ARCH_ARM_NEON)

# include <arm_neon.h>

# endif

#endif

terrelln · 2025-01-10T20:58:49Z

lib/compress/zstd_compress.c

+            if (!repcodeResolution) {
+                offBase = OFFSET_TO_OFFBASE(inSeqs[seqNb].offset);
+            } else {


This is impossible

terrelln · 2025-01-10T20:59:36Z

lib/compress/zstd_compress.c

-    size_t blockSize;
-    size_t litSize;
-} BlockSummary;
+#if defined(__AVX2__)


Use ZSTD_ARCH_X86_AVX2

terrelln · 2025-01-10T21:00:54Z

lib/compress/zstd_compress.c

+#if defined(__GNUC__)
+#  define ALIGNED32 __attribute__((aligned(32)))
+#elif defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* C11 */
+#  define ALIGNED32 alignas(32)
+#else
+   /* this compiler will require its own alignment instruction */
+#  define ALIGNED32
+#endif


I know there are other places we align. Should we unify this into a common macro in compilers.h?

terrelln · 2025-01-10T21:03:44Z

lib/compress/zstd_compress.c

+                DEBUGLOG(5, "long literals length detected at pos %zu", longl-nbSequences);
+                assert(longl <= 2* (nbSequences-1));
+                cctx->seqStore.longLengthType = ZSTD_llt_literalLength;
+                cctx->seqStore.longLengthPos = (U32)(longl-nbSequences);


nit: This is worth a comment, because there seems to be +1 and -1 cancelling out here, which is confusing.

restored full equation, for clarity

terrelln · 2025-01-10T21:10:50Z

lib/compress/zstd_compress.c

+    ZSTD_STATIC_ASSERT(offsetof(SeqDef, mlBase) == 6);
+
+    /* Process 2 sequences per loop iteration */
+    for (; i + 1 < nbSequences; i += 2) {


Not sure how much time you want to spend on this, but it would likely be a bit faster to unroll it once more to process 4 sequences per loop.

If you process 4 sequences per loop you can do half as many cross-lane shuffles. You would keep the first 2 sequences as-is, and then put the second 2 sequences in the top half of each lane. Then use _mm256_blend_epi32() to blend them together, and you only need the single _mm256_permute4x64_epi64() for 4 sequences.

This is better kept as a potential optimization that for a future iteration.
At this stage, we don't know yet if AVX2 will be useful in kernel mode.

Cyan4973 · 2025-01-15T00:07:35Z

Lots of CI issues suddenly,
though they seem uncorrelated to this PR.

Noticed so far:

liblzma and lzma.h no longer available
clang-14 no longer available
-lgcc missing at linking stage (??)

If I were to guess, the CI VM probably just got silently updated,
but I sure am surprised by the nb of tests it broke in the process.

needs to take care of long lengths > 65535

compressSequencesAndLiterals: fixed long lengths in scalar mode

that were automatically added by the editor without notification

the branch is not in the hot loop

since it depends on a specific definition of ZSTD_Sequence structure.

do not solve the equation, even though some members cancel each other, this is done for clarity, we'll let the compiler do the resolution at compile time.

Cyan4973 · 2025-01-16T01:12:30Z

All comments addressed

Cyan4973 self-assigned this Jan 8, 2025

facebook-github-bot added the CLA Signed label Jan 8, 2025

Cyan4973 force-pushed the convertSequences_SSE branch from 8106b4d to bea2e52 Compare January 8, 2025 07:36

Cyan4973 marked this pull request as ready for review January 8, 2025 19:21

Cyan4973 force-pushed the convertSequences_SSE branch 2 times, most recently from 60ae6de to b431d7d Compare January 8, 2025 22:51

terrelln approved these changes Jan 10, 2025

View reviewed changes

Cyan4973 and others added 18 commits January 15, 2025 17:11

initial implementation (incomplete)

8867204

needs to take care of long lengths > 65535

fullbench can run a verification function

d1f0e5f

compressSequencesAndLiterals: fixed long lengths in scalar mode

control long length within AVX2 implementation

8d62164

generalize validation function

bfc58f5

added benchmark for get1BlockSummary()

8eb2587

minor +10% speed improvement for scalar ZSTD_get1BlockSummary()

b6a4d5a

AVX2 version of ZSTD_get1BlockSummary()

ed0a8b8

removed erroneous #includes

cd53924

that were automatically added by the editor without notification

no need for specialized variant

db3d488

the branch is not in the hot loop

fix minor conversion warning

4aaf9ce

removed unused variable

57a4554

added compilation-time checks to ensure AVX2 code is valid

aa2cdf9

since it depends on a specific definition of ZSTD_Sequence structure.

minor code doc update

e3181cf

create new compilation macro ZSTD_ARCH_X86_AVX2

6f8e6f3

removed unused branch

debe3d2

changed code compilation test to employ ZSTD_ARCH_X86_AVX2

2f3ee8b

Alignment instruction ZSTD_ALIGNED() in common/compiler.h

8bff69a

restore full equation

87f0a4f

do not solve the equation, even though some members cancel each other, this is done for clarity, we'll let the compiler do the resolution at compile time.

Cyan4973 force-pushed the convertSequences_SSE branch from 2f55e2e to 87f0a4f Compare January 16, 2025 01:11

Cyan4973 merged commit 33747e2 into dev Jan 16, 2025
94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Cyan4973 commented Jan 8, 2025 •

edited

Loading

terrelln Jan 10, 2025

terrelln Jan 10, 2025

terrelln Jan 10, 2025

terrelln Jan 10, 2025

terrelln Jan 10, 2025

Cyan4973 Jan 14, 2025

terrelln Jan 10, 2025

Cyan4973 Jan 15, 2025

Cyan4973 commented Jan 15, 2025 •

edited

Loading

Cyan4973 commented Jan 16, 2025

	/* compile time determination of SIMD support */
	#if !defined(ZSTD_NO_INTRINSICS)
	# if defined(__SSE2__) \|\| defined(_M_AMD64) \|\| (defined (_M_IX86) && defined(_M_IX86_FP) && (_M_IX86_FP >= 2))
	# define ZSTD_ARCH_X86_SSE2
	# endif
	# if defined(__ARM_NEON) \|\| defined(_M_ARM64)
	# define ZSTD_ARCH_ARM_NEON
	# endif
	#
	# if defined(ZSTD_ARCH_X86_SSE2)
	# include <emmintrin.h>
	# elif defined(ZSTD_ARCH_ARM_NEON)
	# include <arm_neon.h>
	# endif
	#endif

Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Improve speed of ZSTD_compressSequencesAndLiterals() using AVX2 #4232

Conversation

Cyan4973 commented Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cyan4973 commented Jan 15, 2025 • edited Loading

Cyan4973 commented Jan 16, 2025

Cyan4973 commented Jan 8, 2025 •

edited

Loading

Cyan4973 commented Jan 15, 2025 •

edited

Loading