- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello firends,
I have successfully done threading and data Parallelization. But, I am really intrested in task Parallelization. How will i do it? How will I start it?
Please guide me.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:Read this article :http://people.redhat.com/drepper/cpumemory.pdf>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>>
C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that.
I will do some research on web in order to find useful information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP (Intel) wrote:That means, we can not use Hardware as well as software prefetch in same application. If, we disable Hardware prefetch from BIOS then it will do effective job obvious, i agree with you. That means there is no way to increase more optimization with using h/w & s/w prefech at same application. suppose in my above code : When inner loop is starting to execute then, CPU not cache any data in CPU Cache Memory . Is it right or wrong? If we apply or tell to prcessor on next loop execution next data is required to process then it will help full to reduce latency. I am trying to reduce latency of memory access. If I am thinking on wrong direction then please tell me how this code executed in CPU. ThanksI'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well.
Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:Hi Sergey, Do you get good result with FastMemCopy128 can you post any full sample code. to study it. because in my code i don't get any good result. Please give me any sample code with memory optimization. Thanks thanksHi everybody,
>>...
>>#pragma omp parallel for reduction (+:sum)
>>for(i=0;i >>{
>>_mm_prefetch(&A[(i+1)*LOOP],3);
>>for(j=0;j >>{
>>sum += A[(i*LOOP)+j];
>>}
>>}
>>...Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good.
We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:SoA layout usage makes more sense in perfectly vectorised data sets.Note: SoA stands for Structure of Arrays
>>...SoA is not applicable...
As you can see in Sighere's example just one block of memory for 1-D array is created:
>>...
>>int *A = ( int * )malloc( sizeof( int ) * SIZE );
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page