Microarchitecture: Sapphire Rapids
Setting: 2 Sockets x 32 Golden Cove Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AMX_INT8 | MM(s32,s8,s8) | 6.3726 Tops | | AMX_INT8 | MM(s32,s8,u8) | 7.5746 Tops | | AMX_INT8 | MM(s32,u8,s8) | 7.5733 Tops | | AMX_INT8 | MM(s32,u8,u8) | 7.5718 Tops | | AMX_BF16 | MM(f32,bf16,bf16) | 3.7868 Tflops | | AVX512_VNNI | DP4A(s32,u8,s8) | 998.07 Gops | | AVX512_VNNI | DP2A(s32,s16,s16) | 499.07 Gops | | AVX_VNNI | DP4A(s32,u8,s8) | 498.96 Gops | | AVX_VNNI | DP2A(s32,s16,s16) | 249.47 Gops | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 115.16 Gflops | | AVX512_FP16 | FMA(f16,f16,f16) | 499.08 Gflops | | AVX512F | FMA(f32,f32,f32) | 230.28 Gflops | | AVX512F | FMA(f64,f64,f64) | 115.17 Gflops | | FMA | FMA(f32,f32,f32) | 118.35 Gflops | | FMA | FMA(f64,f64,f64) | 62.385 Gflops | | AVX | ADD(MUL(f32,f32),f32) | 91.59 Gflops | | AVX | ADD(MUL(f64,f64),f64) | 45.85 Gflops | | SSE | ADD(MUL(f32,f32),f32) | 46.493 Gflops | | SSE2 | ADD(MUL(f64,f64),f64) | 23.235 Gflops | --------------------------------------------------------------
For 64 cores:
$ ./cpufp --thread_pool=[0-63] Number Threads: 64 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AMX_INT8 | MM(s32,s8,s8) | 390.67 Tops | | AMX_INT8 | MM(s32,s8,u8) | 380.93 Tops | | AMX_INT8 | MM(s32,u8,s8) | 391.32 Tops | | AMX_INT8 | MM(s32,u8,u8) | 380.28 Tops | | AMX_BF16 | MM(f32,bf16,bf16) | 192.47 Tflops | | AVX512_VNNI | DP4A(s32,u8,s8) | 48.114 Tops | | AVX512_VNNI | DP2A(s32,s16,s16) | 24.169 Tops | | AVX_VNNI | DP4A(s32,u8,s8) | 30.818 Tops | | AVX_VNNI | DP2A(s32,s16,s16) | 15.74 Tops | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 7.09 Tflops | | AVX512_FP16 | FMA(f16,f16,f16) | 31.473 Tflops | | AVX512F | FMA(f32,f32,f32) | 14.329 Tflops | | AVX512F | FMA(f64,f64,f64) | 6.5406 Tflops | | FMA | FMA(f32,f32,f32) | 7.4039 Tflops | | FMA | FMA(f64,f64,f64) | 3.9067 Tflops | | AVX | ADD(MUL(f32,f32),f32) | 5.4087 Tflops | | AVX | ADD(MUL(f64,f64),f64) | 2.7339 Tflops | | SSE | ADD(MUL(f32,f32),f32) | 2.9077 Tflops | | SSE2 | ADD(MUL(f64,f64),f64) | 1.4791 Tflops | --------------------------------------------------------------
Architecture: Zen4
Setting: 8 Zen4 Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX512_VNNI | DP4A(s32,u8,s8) | 647.97 GOPS | | AVX512_VNNI | DP2A(s32,s16,s16) | 324.27 GOPS | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 324.92 GFLOPS | | AVX512F | FMA(f32,f32,f32) | 163.58 GFLOPS | | AVX512F | FMA(f64,f64,f64) | 81.786 GFLOPS | | FMA | FMA(f32,f32,f32) | 163.57 GFLOPS | | FMA | FMA(f64,f64,f64) | 81.785 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 157.36 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 79.045 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 80.34 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 40.371 GFLOPS | --------------------------------------------------------------
For 8 cores:
$ ./cpufp --thread_pool=[0-7] Number Threads: 8 Thread Pool Binding: 0 1 2 3 4 5 6 7 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX512_VNNI | DP4A(s32,u8,s8) | 5113.8 GOPS | | AVX512_VNNI | DP2A(s32,s16,s16) | 2559.1 GOPS | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 2551.6 GFLOPS | | AVX512F | FMA(f32,f32,f32) | 1283.6 GFLOPS | | AVX512F | FMA(f64,f64,f64) | 641.21 GFLOPS | | FMA | FMA(f32,f32,f32) | 1271.7 GFLOPS | | FMA | FMA(f64,f64,f64) | 632.3 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 1193.6 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 590.85 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 613.54 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 307.67 GFLOPS | --------------------------------------------------------------
Architecture: Zen3+
Setting: 8 Zen3+ Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | FMA | FMA(f32,f32,f32) | 151.84 GFLOPS | | FMA | FMA(f64,f64,f64) | 75.702 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 150.86 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 75.476 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 75.452 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 37.737 GFLOPS | --------------------------------------------------------------
For 8 cores:
$ ./cpufp --thread_pool=[0,2,4,6,8,10,12,14] Number Threads: 8 Thread Pool Binding: 0 2 4 6 8 10 12 14 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | FMA | FMA(f32,f32,f32) | 1057.8 GFLOPS | | FMA | FMA(f64,f64,f64) | 534.37 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 1037.6 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 516.21 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 518.32 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 258.92 GFLOPS | --------------------------------------------------------------
Product Code Name: Raptor Lake
Setting: 4 Raptor Cove(P-Core) Cores + 8 Gracemont(E-Core) Cores
For single P-Core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 586.84 Gops | | AVX_VNNI | DP2A(s32,s16,s16) | 293.5 Gops | | FMA | FMA(f32,f32,f32) | 146.76 Gflops | | FMA | FMA(f64,f64,f64) | 73.373 Gflops | | AVX | ADD(MUL(f32,f32),f32) | 107.7 Gflops | | AVX | ADD(MUL(f64,f64),f64) | 53.512 Gflops | | SSE | ADD(MUL(f32,f32),f32) | 54.49 Gflops | | SSE2 | ADD(MUL(f64,f64),f64) | 27.243 Gflops | --------------------------------------------------------------
For 4 P-Cores:
$ ./cpufp --thread_pool=[0,2,4,6] Number Threads: 4 Thread Pool Binding: 0 2 4 6 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 2.2454 Tops | | AVX_VNNI | DP2A(s32,s16,s16) | 1.1215 Tops | | FMA | FMA(f32,f32,f32) | 546.31 Gflops | | FMA | FMA(f64,f64,f64) | 267.62 Gflops | | AVX | ADD(MUL(f32,f32),f32) | 356.72 Gflops | | AVX | ADD(MUL(f64,f64),f64) | 176.89 Gflops | | SSE | ADD(MUL(f32,f32),f32) | 183.39 Gflops | | SSE2 | ADD(MUL(f64,f64),f64) | 91.293 Gflops | --------------------------------------------------------------
For single E-Core:
$ ./cpufp --thread_pool=[8] Number Threads: 1 Thread Pool Binding: 8 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 108.5 Gops | | AVX_VNNI | DP2A(s32,s16,s16) | 54.251 Gops | | FMA | FMA(f32,f32,f32) | 54.248 Gflops | | FMA | FMA(f64,f64,f64) | 27.125 Gflops | | AVX | ADD(MUL(f32,f32),f32) | 27.126 Gflops | | AVX | ADD(MUL(f64,f64),f64) | 13.563 Gflops | | SSE | ADD(MUL(f32,f32),f32) | 27.122 Gflops | | SSE2 | ADD(MUL(f64,f64),f64) | 13.561 Gflops | --------------------------------------------------------------
For 8 E-Cores:
$ ./cpufp --thread_pool=[8-15] Number Threads: 8 Thread Pool Binding: 8 9 10 11 12 13 14 15 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 791.36 Gops | | AVX_VNNI | DP2A(s32,s16,s16) | 395.68 Gops | | FMA | FMA(f32,f32,f32) | 395.67 Gflops | | FMA | FMA(f64,f64,f64) | 197.83 Gflops | | AVX | ADD(MUL(f32,f32),f32) | 197.84 Gflops | | AVX | ADD(MUL(f64,f64),f64) | 98.921 Gflops | | SSE | ADD(MUL(f32,f32),f32) | 197.83 Gflops | | SSE2 | ADD(MUL(f64,f64),f64) | 98.916 Gflops | --------------------------------------------------------------
Product Code Name: Alder Lake-N
Setting: 4 Gracemont Cores
For single core:
$ ./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 108.51 GOPS | | AVX_VNNI | DP2A(s32,s16,s16) | 54.244 GOPS | | FMA | FMA(f32,f32,f32) | 54.247 GFLOPS | | FMA | FMA(f64,f64,f64) | 27.128 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 27.128 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 13.564 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 27.126 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 13.563 GFLOPS | --------------------------------------------------------------
For 4 cores:
$ ./cpufp --thread_pool=[0-3] Number Threads: 4 Thread Pool Binding: 0 1 2 3 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX_VNNI | DP4A(s32,u8,s8) | 369.66 GOPS | | AVX_VNNI | DP2A(s32,s16,s16) | 185.09 GOPS | | FMA | FMA(f32,f32,f32) | 185.08 GFLOPS | | FMA | FMA(f64,f64,f64) | 92.55 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 92.546 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 46.269 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 92.546 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 46.27 GFLOPS | --------------------------------------------------------------