Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Prefetch instructions

bronxzv
New Contributor II
1,662 Views

I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0,prefetchnta,prefetchw,... for modern processors such as Sandy Bridge and Ivy Bridge. I ask because there is nothing about it in the optimization guide [1] apparently. It will be arguably a good thing for developers to know to which cache level data are prefetched with the diverse variants. I'll glad if someone provide a pointer to some detailed explanation.

[1] Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012

 

0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
1,662 Views
>>...I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0, prefetchnta, prefetchw,... >>for modern processors such as Sandy Bridge and Ivy Bridge... There are also some optimization tips in Intel C++ Compiler User and Reference Guides and please take a look. I recently experienced some issue with application of _mm_prefetch on a computer with Intel Core i7-3840QM CPU ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ). A piece of code with prefetching that perfectly works on older computers, for example with Pentium 4 or Atom N270 CPUs, doesn't provide performance gains when used on the computer with Ivy Bridge CPU. I think this is because significantly larger L3, L2 and L1 cache lines and it is clear that I don't fetch data properly in for-loops. Unfortunately, I still didn't have time to investigate it completely ( with VTune ) and _mm_prefetch is commented out for that configuration ( the code works fast with and without prefetching ).
0 Kudos
Bernard
Valued Contributor I
1,662 Views

Hi bonxzv,

sorry for off topic,but it is nice to see you again on IDZ forums:)

0 Kudos
bronxzv
New Contributor II
1,662 Views

iliyapolak wrote:

Hi bonxzv,

sorry for off topic,but it is nice to see you again on IDZ forums:)

Hi iliyapolak,

indeed it was a moment that I didn't come here, thanks for the warm welcome

hey, I see that in the meantime your black belt points have gone through the roof! 

0 Kudos
bronxzv
New Contributor II
1,662 Views

Sergey Kostrov wrote:

>>...I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0, prefetchnta, prefetchw,...
>>for modern processors such as Sandy Bridge and Ivy Bridge...

There are also some optimization tips in Intel C++ Compiler User and Reference Guides and please take a look.

thanks for your feedback Sergey,

I haven't found any processor specific details in the C++ documentation so far, basically I have found:

- the documentation for the "prefetch insertion optimization" /Qopt-prefetch[:n], I have remarked that /Qopt-prefetch requires /O3 so I have to test it again, the last time I tried it I rushed my tests: compiled with /O2 with no visible change to my timings

- minimal explanation for the "Cacheability Support Intrinsics" and the _MM_HINT_T0, etc. hints

 

Sergey Kostrov wrote:

I recently experienced some issue with application of _mm_prefetch on a computer with Intel Core i7-3840QM CPU ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).

A piece of code with prefetching that perfectly works on older computers, for example with Pentium 4 or Atom N270 CPUs, doesn't provide performance gains when used on the computer with Ivy Bridge CPU. I think this is because significantly larger L3, L2 and L1 cache lines and it is clear that I don't fetch data properly in for-loops. Unfortunately, I still didn't have time to investigate it completely ( with VTune ) and _mm_prefetch is commented out for that configuration ( the code works fast with and without prefetching ).

this is pretty much what I'm experiencing too on Ivy Bridge vs. P4 and older CPUs, I have removed a while ago all explicit prefetch in loops (well handled by the hardware prefetchers), I have only a very few cases still with explicit prefetch, I can see at best a 5% speedup in single thread mode, down to 0% in multithread mode (8 threads with hyperthreading enabled on a Core i7 3770K)

btw Linus Torvalds reports serious slowdown in the Linux kernel due to expliict prefetch here: http://www.realworldtech.com/forum/?threadid=132668&curpostid=132772 

0 Kudos
Bernard
Valued Contributor I
1,662 Views

Thanks bronxzv.

Yes I am spending a lot of time on this forum gaining knowledge and sharing my knowledge with other users.

0 Kudos
Reply