movntdq instruction

vtuneuser · ‎12-19-2005

I have an application that I believe the intel compiler generated the "movntdq" instruction to clear a memory region to zero. The compiler seems to generate the instruction twice like the following,

pxor xmm0, xmm0

labe1: movntdq XMMWORD PTR [ebx+eax*4+04h] xmm0 24,224

movntdq XMMWORD PTR[ebx+eax*014h] xmm0 2,013

add eax 0x8h 380

cmp eax, edi

jb label1

I dont understand

1. why movntdq was generated twice and

2. why the first one spent more time than the second one from the Vtune analyzer? and

3. what is XMMWORD?

Can someone shed some light on this?

Thanks.

David_A_Intel1 · ‎12-19-2005

Hi vtuneuser!

1. The compiler is writing two memory locations per loop trip.

2. Probably a result of "event skid"? I assume you are referring to the Clockticks event and the numbers in your snippet are samples. I can only speculate that the second access incurs less of a time penalty after the first instruction has executed. Perhaps some processor internals work more efficiently when used sequentially, like that?

3.From the processor manual for MOVNTPD:

"Moves the double quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an XMM register, which is assumed to contain two packed double-precision floating-point values. The destination operand is a 128-bit memory location."

So, 'XMMWORD PTR [ebx + eax*4 + 04h]' would mean the 128-bit memory location pointed to by adding eax times 4 to ebx and adding 4 to that.

Therefore, it looks like the compiler has optimized zeroing memory by using SSE/SSE2/SSE3 instructions and dividing the memory block into multiple chunks (since the second memory ptris calculated as'ebx+eax*014h')and doing them both each time through the loop.

Hope that helps!

vtuneuser · ‎12-20-2005

Thanks for your reply. I read the intel manual (system programming guide) that the double quadwords data type need to be aligned on a 16-byte boundary, A general protection exception is generated if it is not. The souceline from the C application that generated the movntdq instruction was

while (s < e)*s++ =0; /* where s and e are pointers of pointers */

I gotabout 50 lines of assembly code for this single statement from profiling Vtune time-base hot spot. I am still trying tounderstand the assembly code.is it correct the compiler willgenerate the code to ensure the alignment?

David_A_Intel1 · ‎01-04-2006

I believe that is correct but would direct you to the compiler support team on Intel Premier Support. A lot depends on the processor and compiler switches and they would know how to help you.

If you want us to look into the VTune analyzer issue, it would help for you to submit an issue and provide some sample code (both source and binary) so that we can investigate.

Regards,

vtuneuser · ‎01-04-2006

Thanks for your suggestion. I have submitted an issue in the premier support to get some help on my questions.