vldmxcsr causing catastrophic slowdown

FastAndFurious · ‎12-21-2023

I'm seeing a catastrophic slowdown in floating point code that appears to be caused by use of vldmxcsr. With some effort I could make a test case, but before doing that I'm wondering whether there is already information somewhere that I can refer to.

My test case has a block of code in a loop with 50-100 instructions, a mixture of floating point instructions (e.g. vsubsd, vfmadd123sd, vstmxcsr) and integer instructions. There is a single vldmxcsr at the start of the loop to initialize FP state. The mxcsr value set by vldmxcsr is the default value 1F80H.

When I remove the one vldmxcsr instruction, the loop runs more than 5x faster. When I run the same code on an (admittedly rather old) AMD processor, I see no difference in performance at all when the instruction is removed.

On Linux, /proc/cpuinfo says this:

vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
stepping : 13
microcode : 0xf0

Is there any documentation that explains why use of this instruction has such a serious effect on performance and what I could try to mitigate it?

Thanks.

McCalpinJohn · ‎01-16-2024

You could start by telling us what values you are loading into the MCXSR register. Non-default values might enable all sorts of interrupts that you don't see with the default settings.

FastAndFurious · ‎01-17-2024

As I said in the original post, the value being loaded is the default 1F80H. The only flag that gets set in the code block is the #P (inexact result), and I want to reset this. The code block is generated by a JIT engine, so it would take a bit of effort to convert it into an equivalent static case. I can do this if absolutely necessary, but wanted to check whether this is a known problem or not first.

Thanks.

McCalpinJohn · ‎01-17-2024

I apologize for missing the numerical value in your original post.... Eyes getting old along with the rest of me....

I don't see any errata that look relevant, so I would start with adding performance counter instrumentation to look for cycles, stalls, uops, etc. Somewhere I read that the RESOURCE_STALLS.ANY event (Event code 0xA2, Umask 0x01) includes stalls for writing the MXCSR register.

FastAndFurious · ‎01-18-2024

Ok, thanks. If nothing is known about this problem then I'll put the effort into making a something equivalent in a simple C test case and investigate that. If I manage to make something representative, I'll post it here.

FastAndFurious · ‎01-29-2024

I've made a C test case that is showing the same behavior as I'm reporting here. It's not doing anything real, but has an inner loop that generates similar code to the JIT engine. I'm using gcc and compiling with -O3.

#include <stdio.h>

static int myfun(int i) {
return (i>1) ? myfun(i-1) + myfun(i-2) : i;
}

int main(int argc, const char *argv[]) {

int i, loops;
int mxcsrOut = 0;

double inputs[20];
double res;

for(i=0; i<20; i++) {
inputs[i] = myfun(i)+1;
}

for(loops=0; loops<100000000; loops++) {

int mxcsrIn = 0x1f80;

asm volatile("vldmxcsr %0" :: "m"(mxcsrIn));

res = 1;

for(i=0; i<20; i++) {
res *= inputs[i];
}

asm volatile("vstmxcsr %0" : "=m"(mxcsrOut));
}

printf("result: %g mxcsrOut=%x\n", res, mxcsrOut);

return 0;
}

Compiling and running this on my i9-9900K, I see:

real 0m5.123s
user 0m5.122s
sys 0m0.001s

If I comment out the vldmxcsr line, I see:

real 0m0.507s
user 0m0.502s
sys 0m0.004s

So more than 10x faster, even though the inner loop contains about 80 instructions in total, including 20 double-precision floating point multiplies. On the (rather elderly) AMD A10-5800K, both versions of the code run in 1.6 seconds.

The same speedup is possible if I comment out the vstmxcsr line as well, so it seems to be an interlock between these instructions. Obviously in this example I could move the vstmxcsr outside the loop, but in the JIT code the result of that is used on every iteration. I'm not surprised that there is some slowdown, but this seems really excessive.

Any insight into this would be welcome.

Thanks.

FastAndFurious · ‎01-30-2024

For clarification, the nested loops of the test compile to this:

400470:c7 44 24 0c 80 1f 00 00 movl $0x1f80,0xc(%rsp)
400478:0f ae 54 24 0c ldmxcsr 0xc(%rsp)
40047d:48 8d 44 24 10 lea 0x10(%rsp),%rax
400482:66 0f 28 c1 movapd %xmm1,%xmm0
400486:66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)

400490:f2 0f 59 00 mulsd (%rax),%xmm0
400494:48 83 c0 08 add $0x8,%rax
400498:48 39 d0 cmp %rdx,%rax
40049b:75 f3 jne 400490

40049d:0f ae 5c 24 08 stmxcsr 0x8(%rsp)
4004a2:83 e9 01 sub $0x1,%ecx
4004a5:75 c9 jne 400470

So the inner loop does 20 double-precision multiplies.