_mm512_prefetch_i32[ext]gather_ps clarifications

Peter_B_9 · ‎10-07-2013

The documentation at http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-254C3F9D-5DDD-4B27-95E2-B6986B4A852B.htm indicates that "Only the lower eight elements are used as indices. The upper eight elements are not used." Since this is a single-precision gather, shouldn't all 16 elements be used as indices? Is this a documentation error, or does this pretefch really only operate on half of the elements? (Perhaps the prefetch unit is limited to 8 addresses?)

What is the purpose of the conv argument to the prefetch instructions? Presumably the data isn't actually being converted yet. Is this just a hint about how many bytes will be read from each address?

The instruction is documented to prefetch a float32 vector. I assume that it's equally effective to prefetch an int32 vector (or, in fact, a number of int32s which will be read using legacy x86 instructions). Can someone please confirm this?

Kevin_D_Intel · ‎10-09-2013

I inquired w/Development about your questions.

The statement cited from the User Guide is a documentation error. Apparently a mistaken copy-n-paste from a 64-bit indices variant, such as _mm512_i32lo[ext]gather_pd. I notified our Documentation team about this (internal tracking id below) and will update this post once corrected.

Regarding conv, they said "yes, it is a hint about how many bytes will be read from each address."

Regarding prefetching an int32, they concur, "I believe this is true - it's equally effective to prefetch an int32 vector."

(Internal tracking id: DPD200248812)

Alastair_M_ · ‎06-24-2014

Hi Peter and Kevin,

Sorry to bump this thread after so long but I had a related question that doesn't seem to be addressed elsewhere on the forum.

Can the _mm512_prefetch_i32[ext]gather_ps intrinsics be used to prefetch doubles?

My understanding was that each index would prefetch at least one 64 byte cache line, is that correct?

E.g. if I want to prefetch doubles at indices {0,1,2,3,100,101,102,103 etc..} would I need to create an index vector containing each 32 bit portion of the double or is sufficient to prefetch each unique cache line?

I am trying to prefetch each unique cache line at the moment (by doing a modulus operation on the gather indices and scaling appropriately) without success, the performance of the sparse matrix operation is actually degrading.

I can't find any reference elsewhere on how to properly use the prefetch gather intrinsics on 64 bit types.

Best regards,

Alastair