SDE 9.38.0 regessing GFNI?

Beulich__Jan · ‎05-29-2024

According to my testing with the Xen built-in emulator test harness, at least the 512-bit form of vgf2p8mulb looks to no longer be handled correctly, when 9.33.0 handles the exact same code just fine. The non-EVEX forms, otoh, still appear to be working correctly. Sadly digging out details on what exactly is going wrong isn't easy. First and foremost because I have no idea whether, and if so how, SDE can be run with a debugger (gdb) attached to the process being emulated.

AdyT_Intel · ‎05-30-2024

Intel SDE is using the same emulation routine for the VEX and EVEX forms.

The only difference is the handling of the K-mask operand (supported only in the EVEX form).

Beulich__Jan · ‎05-30-2024

Which may make it less obvious what's going wrong in SDE, but the observation / regression is there nevertheless.

AdyT_Intel · ‎06-06-2024

Can you provide more details on the issue. We can verify it against real HW.

You can use the debug-trace utility to create a instruction trace with emulation and w/o the emulation (when running on a platform that supports the instructions) and compare the two traces. See more details in the Intel SDE user's guide available inside the kit under the doc directory. You can also instruct Intel SDE to force emulation even when the host supports the instructions.

Thanks.

Beulich__Jan · ‎06-11-2024

I can confirm that on an SPR things work unless emulation is forced (the original observation was on SKL/SKX). This log fragment when emulation is active is perhaps already telling enough (stripping the double/float part of the ZMM<n> printing):

Read 3f3e3d3c_3b3a3938_37363534_33323130_
2f2e2d2c_2b2a2928_27262524_23222120_
1f1e1d1c_1b1a1918_17161514_13121110_
0f0e0d0c_0b0a0908_07060504_03020100 = *(UINT512*)0x7ffe9c86ce80
INS 0x0000000000100034 AVX512EVEX vmovdqa64 zmm5, zmmword ptr [rsp-0x78]
ZMM5 := 3f3e3d3c_3b3a3938_37363534_33323130
_2f2e2d2c_2b2a2928_27262524_23222120
_1f1e1d1c_1b1a1918_17161514_13121110
_0f0e0d0c_0b0a0908_07060504_03020100
Read 0x0102040810204080 = *(UINT64*)0x1000e0
INS 0x000000000010003f AVX512EVEX vgf2p8affineinvqb zmm1, zmm5, qword ptr [rip+0x96]{1to8}, 0x0
ZMM1 := 00000000_00000000_00000000_00000000
_00000000_00000000_00000000_00000000
_247a35bd_3efc76e6_7d2cbd52_91ce70c0
_74da8d6c_a2a17deb_d17b52cb_f68d0100

Pretty clearly the upper half of ZMM1 shouldn't be all zero here. For comparison the same fragment when no emulation is in use:

Read 3f3e3d3c_3b3a3938_37363534_33323130_
2f2e2d2c_2b2a2928_27262524_23222120_
1f1e1d1c_1b1a1918_17161514_13121110_
0f0e0d0c_0b0a0908_07060504_03020100 = *(UINT512*)0x7ffcb84e4cc0
INS 0x0000000000100034 AVX512EVEX vmovdqa64 zmm5, zmmword ptr [rsp-0x78]
ZMM5 := 3f3e3d3c_3b3a3938_37363534_33323130
_2f2e2d2c_2b2a2928_27262524_23222120
_1f1e1d1c_1b1a1918_17161514_13121110
_0f0e0d0c_0b0a0908_07060504_03020100
Read 0x0102040810204080 = *(UINT64*)0x1000e0
INS 0x000000000010003f AVX512EVEX vgf2p8affineinvqb zmm1, zmm5, qword ptr [rip+0x96]{1to8}, 0x0
ZMM1 := 1959bb77_6f2035f2_426639f3_6c92452c
_c2a24430_15980ac1_c9a84d55_f15a6e3a
_b2ee40ff_ccfd3f58_5f602b99_4baab474
_c7e5e1b0_c0294fe8_d17b52cb_f68d0100

Notably only the low 64 bits of ZMM1 actually match between both, so it's more than just result truncation. A similar pattern can be observed for a subsequent vgf2p8mulb:

Read 3f3e3d3c_3b3a3938_37363534_33323130_
2f2e2d2c_2b2a2928_27262524_23222120_
1f1e1d1c_1b1a1918_17161514_13121110_
0f0e0d0c_0b0a0908_07060504_03020100 = *(UINT512*)0x7ffe9c86ce80
INS 0x0000000000100055 AVX512EVEX vmovdqa64 zmm6, zmmword ptr [rsp-0x78]
ZMM6 := 3f3e3d3c_3b3a3938_37363534_33323130
_2f2e2d2c_2b2a2928_27262524_23222120
_1f1e1d1c_1b1a1918_17161514_13121110
_0f0e0d0c_0b0a0908_07060504_03020100

...

Read 00000000_00000000_00000000_00000000_
00000000_00000000_00000000_00000000_
247a35bd_3efc76e6_7d2cbd52_91ce70c0_
74da8d6c_a2a17deb_d17b52cb_f68d0100 = *(UINT512*)0x7ffe9c86cf00
INS 0x000000000010006b AVX512EVEX vgf2p8mulb zmm1, zmm6, zmmword ptr [rsp+0x8]
ZMM1 := 00000000_00000000_00000000_00000000
_00000000_00000000_00000000_00000000
_b11b2f78_641bca93_f91e5a04_7bd331b4
_da608be6_9a26b819_01010101_01010100

AdyT_Intel · ‎06-17-2024

I found that the first issue that you report is due to an optimization that didn't handle correctly broadcasting.

I fixed this issue internally, until new Intel SDE version is available you can turn off the optimization by using the knob: -emu-mem-fast 0

I checked the second report (with vgf2p8mulb) and I was not able to see differences with native execution.

Beulich__Jan · ‎06-19-2024

I can confirm that with this extra option things work properly again here. Perhaps the vgf2p8mulb issue was just a knock-on failure. I'll wait though with marking this resolved until I can test with the next SDE version, without the extra command line option.

AdyT_Intel · ‎06-19-2024

Thanks for the update, Ady.