- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I use Intel 17.0.4 compiler and Intel Xeon E5 2697 V4 (Broadwell) processor. I know that this processor supports fused multiply add instruction.
For this line of code:
yy += (A * B);
If I convert the C++ code to assembly I can see vfmadd231pd 16(%rdx,%r11,8), %xmm6, %xmm1
However, when I use vYY = _mm256_fmadd_pd (vA, vB, vYY) in the C++ code, the compiler uses add and multiply vector instructions only:
vmulpd (%r15,%rsi,8), %ymm4, %ymm5
vaddpd %ymm1, %ymm5, %ymm1
Is there any explanation for this ?
Thanks,
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Explanation would need to refer to a specific example. Intel compiler would optimize a dot product in c code with extra riffling (multiple partial sums) so as to overcome extra latency of fma at the expense of overhead of combining them later. Intrinsics are likely to suppress riffling so that a situation occurs where fma reduces performance. Other compilers handle it differently. Intel compiler riffling tends to be more than optimum, depending on data movement etc.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page