Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7833 Discussions

ICC is short on YMM registers for AVX FLOPs test while GCC runs just fine

Adam_Strzelecki
Beginner
204 Views
I wanted to test FLOPs of my CPU using x86 AVX code.
Please find my test tool at https://github.com/nanoant/flops
The code was takes from Mystical at Stack Overflow post.
Please use make CXX=icpc to produce ICC code and make CXX=gcc to produce GCC code.
This syntethic test was created to use all available YMM registers in 64-bit mode. Unfortunately there is something wrong with ICC compiler it does not use all YMM available registers and passes some values via mem back and forth. Moreover the instructions specified as intrinsics are not in the order they are in the source code.
Suprisingly when compiling SSE sample, everything is just fine.
I tested it with ICC 11, 12, and latest 13 Beta both on Mac and Linux to be sure it isn't platform dependent issue. All of them have this problem.
Here is excerpt from objdump -d avxflops-icpc.o

vsubpd %ymm14,%ymm5,%ymm10

vaddpd %ymm12,%ymm2,%ymm2

vaddpd 0x20(%rsp),%ymm12,%ymm3

vaddpd %ymm14,%ymm13,%ymm13

vmulpd 0x40(%rsp),%ymm9,%ymm1

vmovupd %ymm0,0xe0(%rsp)

vmovupd %ymm10,0xa0(%rsp)

vmulpd 0xe5b(%rip),%ymm8,%ymm5

vmulpd 0xe53(%rip),%ymm2,%ymm12

vmulpd 0xe0b(%rip),%ymm11,%ymm8

vsubpd %ymm14,%ymm7,%ymm0

vsubpd 0xe1e(%rip),%ymm4,%ymm7

vsubpd 0xe16(%rip),%ymm5,%ymm11

And here's code generated by GCC 4.7.1, it uses all available registers and all instructions are in the order specified in the code:

vsubpd %ymm2,%ymm13,%ymm13

vmulpd %ymm1,%ymm6,%ymm6

vaddpd %ymm0,%ymm11,%ymm11

vmulpd %ymm3,%ymm4,%ymm4

vsubpd %ymm2,%ymm9,%ymm9

vmulpd %ymm3,%ymm15,%ymm15

vaddpd %ymm0,%ymm7,%ymm7

vmulpd %ymm1,%ymm13,%ymm13

vsubpd %ymm2,%ymm5,%ymm5

vmulpd %ymm3,%ymm11,%ymm11

vaddpd %ymm2,%ymm14,%ymm14

vmulpd %ymm1,%ymm9,%ymm9

vsubpd %ymm0,%ymm12,%ymm12

vmulpd %ymm3,%ymm7,%ymm7

vaddpd %ymm2,%ymm10,%ymm10

vmulpd %ymm1,%ymm5,%ymm5

vsubpd %ymm0,%ymm8,%ymm8

vmulpd %ymm1,%ymm14,%ymm14

vaddpd %ymm2,%ymm6,%ymm6

vmulpd %ymm3,%ymm12,%ymm12

vsubpd %ymm0,%ymm4,%ymm4

vmulpd %ymm1,%ymm10,%ymm10

vaddpd %ymm0,%ymm15,%ymm15

Is there any way to tell ICC to use all available YMM registers and stay away from reordering the AVX code, I tried various optimization levels, none makes the code behave as desired.
0 Kudos
1 Reply
Georg_Z_Intel
Employee
204 Views
Hello,

I took a look at the provided example and can confirm the overhead with superfluous load/store operations. As you're entirely using intrinsics in your loop there should not be any register pressure causing the compiler to save/restore registers. Hence I've filed a defect for engineering (DPD200293901). I'll let you know about the status.

Thank you for your test sample!

Best regards,

Georg Zitzlsberger
0 Kudos
Reply