- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The program below shows a HUGE difference in its speed depending on the vpbroadcastq instruction argument. On my computer the program finishes after ~200ms when xmm1 register is used as the destination argument. With ymm register the program execution time increases to about 9 seconds. How is it possible? The CPU is i7-1260P. On another machine, with i3-10110U cpu, the difference is minor.
#include <stdio.h>
unsigned long reverseasm(unsigned long edges) {
unsigned long res;
asm (
"mov $1, %%r8\n"
"movq %%r8, %%xmm0\n"
#if 0
"vpbroadcastq %%xmm0, %%xmm1\n"
#else
"vpbroadcastq %%xmm0, %%ymm1\n"
#endif
"mov $1, %[res]\n"
: [res] "=r" (res)
:
: "r8", "ymm0", "ymm1"
);
return res;
}
int main(int argc, char *argv[])
{
unsigned long sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += reverseasm(i);
printf("sum=%ld\n", sum);
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode. The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER. Many more details can be found in the manual.
Roman
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My findings so far:
- the slowness occurs only on a performance core of the CPU; efficient cores are not affected
- the slowness occurs only when AVX versions of the instructions are mixed with legacy SSE instructions.
In the above code, the movq instruction is a legacy one. When the movq instruction is changed to vmovq, the problem disappears.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode. The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER. Many more details can be found in the manual.
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
I admit, I was not aware that SSE and AVX instructions should not be mixed. I did read the Software Developer Manual and I didn't find any such information. I wondered why all the instructions have two variants - for example the movq/vmovq. Although the vmovq instruction clears the upper part of ymm register, but who cares? Preserving/zeroing the upper ymm part is rarely useful. Programmers can cope without that.
Similarly, I'm wondering why the movupd instruction description states it moves the double precision floating point values? Is it important to have a valid floating point numbers in the registers? Or, can it also be any integer data? Is there any caveat?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another interesting thing, the shld instruction takes about 11 CPU cycles on an efficient core. On a performance core it takes one cycle.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page