Support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

movntdq instruction

I have an application that I believe the intel compiler generated the "movntdq" instruction to clear a memory region to zero. The compiler seems to generate the instruction twice like the following,
pxor xmm0, xmm0
labe1: movntdq XMMWORD PTR [ebx+eax*4+04h] xmm0 24,224
movntdq XMMWORD PTR[ebx+eax*014h] xmm0 2,013
add eax 0x8h 380
cmp eax, edi
jb label1
I dont understand
1. why movntdq was generated twice and
2. why the first one spent more time than the second one from the Vtune analyzer? and
3. what is XMMWORD?
Can someone shed some light on this?
0 Kudos
4 Replies

Hi vtuneuser!

1. The compiler is writing two memory locations per loop trip.

2. Probably a result of "event skid"? I assume you are referring to the Clockticks event and the numbers in your snippet are samples. I can only speculate that the second access incurs less of a time penalty after the first instruction has executed. Perhaps some processor internals work more efficiently when used sequentially, like that?

3.From the processor manual for MOVNTPD:

"Moves the double quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an XMM register, which is assumed to contain two packed double-precision floating-point values. The destination operand is a 128-bit memory location."

So, 'XMMWORD PTR [ebx + eax*4 + 04h]' would mean the 128-bit memory location pointed to by adding eax times 4 to ebx and adding 4 to that.
Therefore, it looks like the compiler has optimized zeroing memory by using SSE/SSE2/SSE3 instructions and dividing the memory block into multiple chunks (since the second memory ptris calculated as'ebx+eax*014h')and doing them both each time through the loop.
Hope that helps!
Thanks for your reply. I read the intel manual (system programming guide) that the double quadwords data type need to be aligned on a 16-byte boundary, A general protection exception is generated if it is not. The souceline from the C application that generated the movntdq instruction was
while (s < e)*s++ =0; /* where s and e are pointers of pointers */
I gotabout 50 lines of assembly code for this single statement from profiling Vtune time-base hot spot. I am still trying tounderstand the assembly it correct the compiler willgenerate the code to ensure the alignment?
I believe that is correct but would direct you to the compiler support team on Intel Premier Support. A lot depends on the processor and compiler switches and they would know how to help you.
If you want us to look into the VTune analyzer issue, it would help for you to submit an issue and provide some sample code (both source and binary) so that we can investigate.

Thanks for your suggestion. I have submitted an issue in the premier support to get some help on my questions.