Dynamically Determining Where a Rogue Avx-512 Instruction Is Executing

Determine number of AVX-512 FMA units

The Intel® 64 and IA-32 Architectures Optimization Reference Manual, February 2022, Chapter 18.21 titled: Servers with a Single FMA Unit contains assembly language source code that identifies the number of AVX-512 FMA Units per core in an AVX-512 capable processor. See Example 18-25. This works by comparing the timing of two functions: one with FMA instructions and another with both FMA and shuffle instructions.

Intel's optimization manual can be downloaded from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#inpage-nav-8.

The source code from this manual is available at: https://github.com/intel/optimization-manual

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

No, a vpcmpeqb into a mask register does not trigger slow mode if you use a zmm register as one of the comparands, at least on SKX.

This is also true of any of any other instruction (as far as I tested) which only reads the key 512-bit registers (the key registers being zmm0 - zmm15). For example, vpxord zmm16, zmm0, zmm1 also does not dirty the uppers because while it involves zmm1 and zmm0 which are key registers, it only reads from them while writing zmm16 which is not a key register.

I tested this using avx-turbo on a Xeon W-2104, which has a nominal speed of 3.2 GHz, L1 turbo license (AVX2 turbo) of 2.8 GHz, and a L2 license (AVX-512 turbo) of 2.4 GHz. I used the --dirty-upper option to dirty the uppers before each test with vpxord zmm15, zmm14, zmm15. This causes any test that uses any SIMD registers at all (including scalar SSE FP) to run at the slower 2.8 GHz speed, as shown in these results (look at the A/M-MHz column for cpu frequency):

CPUID highest leaf  : [16h]
Running as root : [YES]
MSR reads supported : [YES]
CPU pinning enabled : [YES]
CPU supports AVX2 : [YES]
CPU supports AVX-512: [YES]
cpuid = eax = 2, ebx = 266, ecx = 0, edx = 0
cpu: family = 6, model = 85, stepping = 4
tsc_freq = 3191.8 MHz (from calibration loop)
CPU brand string: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
4 available CPUs: [0, 1, 2, 3]
4 physical cores: [0, 1, 2, 3]
Will test up to 1 CPUs
Cores | ID | Description | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1 | pause_only | pause instruction | 1.000 | 1.000 | 1.000 | 2256 | 0.99 | 3173 | 1.00
1 | ucomis_clean | scalar ucomis (w/ vzeroupper) | 1.000 | 1.000 | 1.000 | 790 | 1.00 | 3192 | 1.00
1 | ucomis_dirty | scalar ucomis (no vzeroupper) | 1.000 | 1.000 | 1.000 | 466 | 0.88 | 2793 | 1.00
1 | scalar_iadd | Scalar integer adds | 1.000 | 1.000 | 1.000 | 3192 | 0.99 | 3165 | 1.00
1 | avx128_iadd | 128-bit integer serial adds | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx256_iadd | 256-bit integer serial adds | 1.000 | 1.000 | 1.000 | 2793 | 0.87 | 2793 | 1.00
1 | avx512_iadd | 512-bit integer adds | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_iadd_t | 128-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 8380 | 0.88 | 2793 | 1.00
1 | avx256_iadd_t | 256-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 8380 | 0.88 | 2793 | 1.00
1 | avx128_mov_sparse | 128-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx256_mov_sparse | 256-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx512_mov_sparse | 512-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2794 | 0.87 | 2793 | 1.00
1 | avx128_merge_sparse | 128-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx256_merge_sparse | 256-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx512_merge_sparse | 512-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_vshift | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx256_vshift | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx512_vshift | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_vshift_t | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 5587 | 0.88 | 2793 | 1.00
1 | avx256_vshift_t | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 5588 | 0.88 | 2793 | 1.00
1 | avx512_vshift_t | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_imul | 128-bit integer muls | 1.000 | 1.000 | 1.000 | 559 | 0.88 | 2793 | 1.00
1 | avx256_imul | 256-bit integer muls | 1.000 | 1.000 | 1.000 | 559 | 0.88 | 2793 | 1.00
1 | avx512_imul | 512-bit integer muls | 1.000 | 1.000 | 1.000 | 559 | 0.88 | 2793 | 1.00
1 | avx128_fma_sparse | 128-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx256_fma_sparse | 256-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx512_fma_sparse | 512-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx128_fma | 128-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 698 | 0.88 | 2793 | 1.00
1 | avx256_fma | 256-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 698 | 0.87 | 2793 | 1.00
1 | avx512_fma | 512-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 698 | 0.88 | 2793 | 1.00
1 | avx128_fma_t | 128-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 4789 | 0.75 | 2394 | 1.00
1 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 4790 | 0.75 | 2394 | 1.00
1 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 2394 | 0.75 | 2394 | 1.00
1 | avx512_vpermw | 512-bit serial WORD permute | 1.000 | 1.000 | 1.000 | 466 | 0.88 | 2793 | 1.00
1 | avx512_vpermw_t | 512-bit parallel WORD permute | 1.000 | 1.000 | 1.000 | 1397 | 0.87 | 2793 | 1.00
1 | avx512_vpermd | 512-bit serial DWORD permute | 1.000 | 1.000 | 1.000 | 931 | 0.87 | 2793 | 1.00
1 | avx512_vpermd_t | 512-bit parallel DWORD permute | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00

The only tests that ran at full speed were Scalar integer adds which has no SSE/AVX register use at all, and scalar ucomis (w/ vzeroupper) which has an explicit vzeroupper before each test so doesn't execute with dirty uppers.

Then, I changed the dirtying instruction to the vpcmpeqb k0, zmm0, [rsp] instruction you are interested in. The new results:

Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1 | pause_only | pause instruction | 1.000 | 1.000 | 1.000 | 2256 | 1.00 | 3192 | 1.00
1 | ucomis_clean | scalar ucomis (w/ vzeroupper) | 1.000 | 1.000 | 1.000 | 790 | 1.00 | 3192 | 1.00
1 | ucomis_dirty | scalar ucomis (no vzeroupper) | 1.000 | 1.000 | 1.000 | 790 | 1.00 | 3192 | 1.00
1 | scalar_iadd | Scalar integer adds | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx128_iadd | 128-bit integer serial adds | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3190 | 1.00
1 | avx256_iadd | 256-bit integer serial adds | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx512_iadd | 512-bit integer adds | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_iadd_t | 128-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 9575 | 1.00 | 3192 | 1.00
1 | avx256_iadd_t | 256-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 9577 | 1.00 | 3192 | 1.00
1 | avx128_mov_sparse | 128-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx256_mov_sparse | 256-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx512_mov_sparse | 512-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx128_merge_sparse | 128-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx256_merge_sparse | 256-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx512_merge_sparse | 512-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2793 | 0.88 | 2793 | 1.00
1 | avx128_vshift | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx256_vshift | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx512_vshift | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_vshift_t | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 6386 | 1.00 | 3192 | 1.00
1 | avx256_vshift_t | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 6386 | 1.00 | 3192 | 1.00
1 | avx512_vshift_t | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00
1 | avx128_imul | 128-bit integer muls | 1.000 | 1.000 | 1.000 | 638 | 1.00 | 3192 | 1.00
1 | avx256_imul | 256-bit integer muls | 1.000 | 1.000 | 1.000 | 639 | 1.00 | 3192 | 1.00
1 | avx512_imul | 512-bit integer muls | 1.000 | 1.000 | 1.000 | 559 | 0.88 | 2793 | 1.00
1 | avx128_fma_sparse | 128-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx256_fma_sparse | 256-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 3193 | 1.00 | 3192 | 1.00
1 | avx512_fma_sparse | 512-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2793 | 0.87 | 2793 | 1.00
1 | avx128_fma | 128-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 798 | 1.00 | 3192 | 1.00
1 | avx256_fma | 256-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 798 | 1.00 | 3192 | 1.00
1 | avx512_fma | 512-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 698 | 0.88 | 2793 | 1.00
1 | avx128_fma_t | 128-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 6384 | 1.00 | 3192 | 1.00
1 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 5587 | 0.87 | 2793 | 1.00
1 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 2394 | 0.75 | 2394 | 1.00
1 | avx512_vpermw | 512-bit serial WORD permute | 1.000 | 1.000 | 1.000 | 466 | 0.87 | 2793 | 1.00
1 | avx512_vpermw_t | 512-bit parallel WORD permute | 1.000 | 1.000 | 1.000 | 1397 | 0.88 | 2793 | 1.00
1 | avx512_vpermd | 512-bit serial DWORD permute | 1.000 | 1.000 | 1.000 | 931 | 0.88 | 2793 | 1.00
1 | avx512_vpermd_t | 512-bit parallel DWORD permute | 1.000 | 1.000 | 1.000 | 2794 | 0.88 | 2793 | 1.00

Most tests now run at full speed. The ones still running at 2.8 GHz (or in one case 2.4 GHz for parallel 512-bit FMAs) are those which actually use 512-bit vectors, or use 256-bit vectors and heavy FP instructions like FMA, as expected.

Enabling AVX512 support on compilation significantly decreases performance

project performance is significantly decreased (by 30% on average)

In code that cannot be easily vectorized sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may like to turn off AVX instructions completely in such scenarios.

See Advanced Vector Extensions, Downclocking:

Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

  • L0 (100%): The normal turbo boost limit.
  • L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
  • L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.
    The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.

Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.

Also, see

  • On the dangers of Intel's frequency scaling.
  • Gathering Intel on Intel AVX-512 Transitions.
  • How to Fix Intel?.

Which linux OS supports AVX-512 VNNI (Vector Neural Network Instruction)?

No kernel support is needed beyond that for AVX-512 (i.e. context switch handling of the new AVX-512 zmm and k registers). AVX-512VNNI instructions just operate on those registers, so there's no new architectural state to save/restore on context switch. https://en.wikichip.org/wiki/x86/avx512_vnni / https://en.wikipedia.org/wiki/AVX-512#VNNI

(Unlike AMX (Advanced Matrix Extensions), new in Sapphire Rapids; that does introduce large new "2D tile" registers, 8x 1KiB, that context-switches need to handle1.)


The other relevant thing for distros are compilers versions, like GCC or clang. https://godbolt.org/z/668rvhWPx shows GCC 8.1 and clang 7.0 (both released in 2018) compiling AVX-512VNNI _mm512_dpbusd_epi32 with -march=icelake-server or -march=icelake-client. Versions before that fail, so those are the minimum versions. (Or clang6.0 for -mavx512vnni, but that doesn't enable other things an IceLake CPU supports, or set tuning options.)

So if you want to use the latest hotness, you need a compiler that's at least somewhat up to date. It's generally a good idea to use a compiler newer than the CPU you're using, so compiler devs have had a chance to tweak tuning settings for it. And code-gen from intrinsics, especially newish instruction-sets like AVX-512, has generally improved over compiler versions, so if you care about performance of the generated code, you typically want a newer compiler version. (Regressions happen for some releases for some loops/functions, and thus for some programs, but on average newer compilers make faster code than old ones. That's a big part of what compiler devs spend time improving.)

You can install a new compiler on an old distro via backport packages or manually. Or you can just use a distro release that isn't old and crusty.


Footnote 1: See also a phoronix article re: non-empty AMX register state keeping the CPU from doing a deep sleep. Normally CPUs fully power down the core in deeper sleep states, stashing registers somewhere that stays powered. I'm guessing that they didn't provide space for AMX tiles to do that, so having state there prevents sleep. So if you're using AMX, you'll want Linux kernel at least 5.19.

AVX-512 Instruction Encoding - {er} Meaning

From Intel SDM Volume 2A, 3.1.1.3 " Instruction Column in the Opcode Summary Table":

{er} — Indicates support for embedded rounding control, which is only applicable to the register-register form
of the instruction. This also implies support for SAE (Suppress All Exceptions).

Section 2.6.8 a bit earlier states that {er}, when applicable, can be encoded in EVEX.L´L:

Static rounding control embedded in the EVEX encoding system applies only to register-to-register flavor of
floating-point instructions with rounding semantic at two distinct vector lengths: (i) scalar, (ii) 512-bit. In both
cases, the field EVEX.L’L expresses rounding mode control overriding MXCSR.RC if EVEX.b is set. When EVEX.b is
set, “suppress all exceptions” is implied. The processor behaves as if all MXCSR masking controls are set.

Horizontal add with __m512 (AVX512)

The INTEL compiler has the following intrinsic defined to do horizontal sums

_mm512_reduce_add_ps     //horizontal sum of 16 floats
_mm512_reduce_add_pd //horizontal sum of 8 doubles
_mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers

However, as far as I can tell these are broken into multiple instructions anyway so I don't think you gain anything more than doing the horizontal sum of the upper and lower part of the AVX512 register.

__m256 low  = _mm512_castps512_ps256(zmm);
__m256 high = _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));

__m256d low = _mm512_castpd512_pd256(zmm);
__m256d high = _mm512_extractf64x4_pd(zmm,1);

__m256i low = _mm512_castsi512_si256(zmm);
__m256i high = _mm512_extracti64x4_epi64(zmm,1);

To get the horizontal sum you then do sum = horizontal_add(low + high).

static inline float horizontal_add (__m256 a) {
__m256 t1 = _mm256_hadd_ps(a,a);
__m256 t2 = _mm256_hadd_ps(t1,t1);
__m128 t3 = _mm256_extractf128_ps(t2,1);
__m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
return _mm_cvtss_f32(t4);
}

static inline double horizontal_add (__m256d a) {
__m256d t1 = _mm256_hadd_pd(a,a);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
return _mm_cvtsd_f64(t3);
}

I got all this information and functions from Agner Fog's Vector Class Library and the Intel Instrinsics Guide online.



Related Topics



Leave a reply



Submit