Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

vldmxcsr causing catastrophic slowdown

FastAndFurious
Beginner
1,165 Views

I'm seeing a catastrophic slowdown in floating point code that appears to be caused by use of vldmxcsr. With some effort I could make a test case, but before doing that I'm wondering whether there is already information somewhere that I can refer to.

My test case has a block of code in a loop with 50-100 instructions, a mixture of floating point instructions (e.g. vsubsd, vfmadd123sd, vstmxcsr) and integer instructions. There is a single vldmxcsr at the start of the loop to initialize FP state. The mxcsr value set by vldmxcsr is the default value 1F80H.

When I remove the one vldmxcsr instruction, the loop runs more than 5x faster. When I run the same code on an (admittedly rather old) AMD processor, I see no difference in performance at all when the instruction is removed.

On Linux, /proc/cpuinfo says this:

     vendor_id : GenuineIntel
     cpu family : 6
    model : 158
    model name : Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
    stepping : 13
    microcode : 0xf0

Is there any documentation that explains why use of this instruction has such a serious effect on performance and what I could try to mitigate it?

 

Thanks.

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
1,046 Views

You could start by telling us what values you are loading into the MCXSR register.   Non-default values might enable all sorts of interrupts that you don't see with the default settings.

0 Kudos
FastAndFurious
Beginner
1,028 Views

As I said in the original post, the value being loaded is the default 1F80H. The only flag that gets set in the code block is the #P (inexact result), and I want to reset this. The code block is generated by a JIT engine, so it would take a bit of effort to convert it into an equivalent static case. I can do this if absolutely necessary, but wanted to check whether this is a known problem or not first.

 

Thanks.

0 Kudos
McCalpinJohn
Honored Contributor III
1,015 Views

I apologize for missing the numerical value in your original post....  Eyes getting old along with the rest of me....

I don't see any errata that look relevant, so I would start with adding performance counter instrumentation to look for cycles, stalls, uops, etc.  Somewhere I read that the RESOURCE_STALLS.ANY event (Event code 0xA2, Umask 0x01) includes stalls for writing the MXCSR register.

0 Kudos
FastAndFurious
Beginner
991 Views

Ok, thanks. If nothing is known about this problem then I'll put the effort into making a something equivalent in a simple C test case and investigate that. If I manage to make something representative, I'll post it here.

 

0 Kudos
FastAndFurious
Beginner
902 Views

I've made a C test case that is showing the same behavior as I'm reporting here. It's not doing anything real, but has an inner loop that generates similar code to the JIT engine. I'm using gcc and compiling with -O3.

 

#include <stdio.h>

 

static int myfun(int i) {
    return (i>1) ? myfun(i-1) + myfun(i-2) : i;
}

 

int main(int argc, const char *argv[]) {

    int i, loops;
    int mxcsrOut = 0;

    double inputs[20];
    double res;

    for(i=0; i<20; i++) {
        inputs[i] = myfun(i)+1;
    }

    for(loops=0; loops<100000000; loops++) {

        int mxcsrIn = 0x1f80;

        asm volatile("vldmxcsr %0" :: "m"(mxcsrIn));

        res = 1;

        for(i=0; i<20; i++) {
            res *= inputs[i];
        }

        asm volatile("vstmxcsr %0" : "=m"(mxcsrOut));
    }

    printf("result: %g mxcsrOut=%x\n", res, mxcsrOut);

    return 0;
}

 

Compiling and running this on my i9-9900K, I see:

 

real 0m5.123s
user 0m5.122s
sys 0m0.001s

 

If I comment out the vldmxcsr line, I see:

 

real 0m0.507s
user 0m0.502s
sys 0m0.004s

 

So more than 10x faster, even though the inner loop contains about 80 instructions in total, including 20 double-precision floating point multiplies. On the (rather elderly) AMD A10-5800K, both versions of the code run in 1.6 seconds.

 

The same speedup is possible if I comment out the vstmxcsr line as well, so it seems to be an interlock between these instructions. Obviously in this example I could move the vstmxcsr outside the loop, but in the JIT code the result of that is used on every iteration. I'm not surprised that there is some slowdown, but this seems really excessive.

 

Any insight into this would be welcome.

 

Thanks.

0 Kudos
FastAndFurious
Beginner
878 Views

For clarification, the nested loops of the test compile to this:

 

400470:c7 44 24 0c 80 1f 00 00        movl $0x1f80,0xc(%rsp)
400478:0f ae 54 24 0c                 ldmxcsr 0xc(%rsp)
40047d:48 8d 44 24 10                 lea 0x10(%rsp),%rax
400482:66 0f 28 c1                    movapd %xmm1,%xmm0
400486:66 2e 0f 1f 84 00 00 00 00 00  nopw %cs:0x0(%rax,%rax,1)


400490:f2 0f 59 00                    mulsd (%rax),%xmm0
400494:48 83 c0 08                    add $0x8,%rax
400498:48 39 d0                       cmp %rdx,%rax
40049b:75 f3                          jne 400490


40049d:0f ae 5c 24 08                 stmxcsr 0x8(%rsp)
4004a2:83 e9 01                       sub $0x1,%ecx
4004a5:75 c9                          jne 400470

So the inner loop does 20 double-precision multiplies.

0 Kudos
Reply