If you have this pdf file - Page 2

Swapnil_J_ · ‎12-30-2012

Hello firends,

I have successfully done threading and data Parallelization. But, I am really intrested in task Parallelization. How will i do it? How will I start it?

Please guide me.

Sigehere_S_ · ‎01-07-2013

Hi Sergey Kostrov, First Thanks for immediate response. Yes, Sergey Kostrov Please give me some more detail as early as possible. Because, I am new in Intel XE Composer and to read all document i need lots of time, I know it helpful for me but It required lots of time. I have read in which Intel Optimization manual it's really help full but it hadn't specified and manual CPU cache. If you have already read this documents then, you can tell me which portion of manual is useful It's helpful for me. Thnaks!!!

Bernard · ‎01-07-2013

>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>> C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that. I will do some research on web in order to find useful information.

Bernard · ‎01-07-2013

iliyapolak wrote:
>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>>
C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that.
I will do some research on web in order to find useful information.

Read this article :http://people.redhat.com/drepper/cpumemory.pdf

Bernard · ‎01-07-2013

>>>But, I am interesting in using CPU cache in my own program manually or forcefully Do you have any idea about it? >>> It could be done with smart cache aware programming,that's mean that you need to find and exploit in your program spatial and temporal factors , very good candidate for this are arrays, sadly I can not help you more here I'm not an expert on CPU cache and its optimization. Very good podcast by Scott Meyers :http://skillsmatter.com/podcast/home/cpu-caches-and-why-you-care

SergeyKostrov · ‎01-07-2013

>>... I have read in which Intel Optimization Manual it's really help full but it hadn't specified and manual CPU cache... Do you have April 2012 edition? I wonder if you looked at Chapter 7 'Optimizing Cache Usage'? Here are a couple of tips: - Warm your data before processing ( it is a very simple procedure / very helpful when some data were paged to a Virtual Memory paging file ) - Use a PREFETCH instruction ( it really improves performance when used in MemCpy or StrCpy functions ). Here is a small example: ... RTbool FastMemCopy128( RTvoid *pvDst, RTvoid *pvSrc, RTint iNumOfBytes ) { ... RTint iPageSize = 4096; RTint iCacheLineSize = 32; ... for( RTint i = 0; i < iNumOfBytes; i += iPageSize ) { RTint j; for( j = i + iCacheLineSize; j < ( i + iPageSize ); j += iCacheLineSize ) { _mm_prefetch( ( RTchar * )pvSrc + j, _MM_HINT_NTA ); } ... } ... return ( RTbool )bOk; } ...

Sigehere_S_ · ‎01-07-2013

Thanks, I got it. I hope it will helpful to me.

SergeyKostrov · ‎01-07-2013

>>Thanks, I got it. I hope it will helpful to me. Please take a look at: Forum topic: A problem with 'prefetcht0' instruction ( AT&T inline-assembler syntax ) Web-link: http://software.intel.com/en-us/forums/topic/280798 Also, try to search the Intel forums with a key-word prefetch because there were lots of discussions in the past on that subject.

Sigehere_S_ · ‎01-08-2013

>>Also, try to search the Intel forums with a key-word prefetch Sure, I will read, thanks Sergey.

Sigehere_S_ · ‎01-09-2013

Hi Sergey, I have writen one simple code and implement _mm_prefetch function, the objective of my sample code is just find the aggragte sum of array CODE: #include <stdio.h> #include <omp.h> #include <stdlib.h> #define SIZE 400000000 #define LOOP 20000 struct timeval starttime; // start time function implemantation void startTimer() { gettimeofday(&starttime,0); } // end time function implemantation double endTimer() { struct timeval endtime; gettimeofday(&endtime,0); return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0; } int main () { long int sum=0; int *A = (int *)malloc(sizeof(int)*SIZE); int i,j; for(i=0;i<SIZE;i++) { A=1; } startTimer(); #pragma omp parallel for reduction (+:sum) for(i=0;i<LOOP;i++) { _mm_prefetch(&A[(i+1)*LOOP],3); for(j=0;j<LOOP;j++) { sum += A[(i*LOOP)+j]; } } printf("Result = %ld", sum); printf("Total Time Required = %lf ms\n",endTimer()); return 0; } #shell script: icc -O1 -openmp sum.c -o sum_O1 icc -O2 -openmp sum.c -o sum_O2 icc -O3 -openmp sum.c -o sum_03 icc -O -openmp sum.c -o sum_O icc -Os -openmp sum.c -o sum_Os icc -O0 -openmp sum.c -o sum_O0 icc -fast -openmp sum.c -o sum_fast icc -Ofast -openmp sum.c -o sum_Ofast icc -fno-alias -openmp sum.c -o sum_fno_alias icc -fno-fnalias -openmp sum.c -o sum_fno_fnalias But required time after using _mm_prefetch is same as required time before using _mm_prefetch. Is there any option missing in {icc} command. can you tell me where i am missing some thing? I am using Ubuntu 12.04 (Intel i7/8GB RAM) more specification about my processor is as follow link http://ark.intel.com/products/64899/Intel-Core-i7-3610QM-Processor-6M-Cache-up-to-3_30-GHz can you give me any suggesion to improve this code? Thanks.

TimP · ‎01-09-2013

I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well. Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.

SergeyKostrov · ‎01-09-2013

Hi everybody, >>... >>#pragma omp parallel for reduction (+:sum) >>for(i=0;i>{ >>_mm_prefetch(&A[(i+1)*LOOP],3); >>for(j=0;j>{ >>sum += A[(i*LOOP)+j]; >>} >>} >>... Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good. We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.

Sigehere_S_ · ‎01-09-2013

TimP (Intel) wrote:
I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well.
Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.

That means, we can not use Hardware as well as software prefetch in same application. If, we disable Hardware prefetch from BIOS then it will do effective job obvious, i agree with you. That means there is no way to increase more optimization with using h/w & s/w prefech at same application. suppose in my above code : When inner loop is starting to execute then, CPU not cache any data in CPU Cache Memory . Is it right or wrong? If we apply or tell to prcessor on next loop execution next data is required to process then it will help full to reduce latency. I am trying to reduce latency of memory access. If I am thinking on wrong direction then please tell me how this code executed in CPU. Thanks

Sigehere_S_ · ‎01-09-2013

Sergey Kostrov wrote:
Hi everybody,

>>...
>>#pragma omp parallel for reduction (+:sum)
>>for(i=0;i >>{
>>_mm_prefetch(&A[(i+1)*LOOP],3);
>>for(j=0;j >>{
>>sum += A[(i*LOOP)+j];
>>}
>>}
>>...

Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good.

We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.

Hi Sergey, Do you get good result with FastMemCopy128 can you post any full sample code. to study it. because in my code i don't get any good result. Please give me any sample code with memory optimization. Thanks thanks

Bernard · ‎01-09-2013

@Tim If implementing SoA or hybrid SoA approach to the data layout where it is applicable coupled with the prefetch instruction it is interesting how much such a approach could improve performance.

Bernard · ‎01-09-2013

@Sighere Please try to use SoA and hybrid SoA approach to your data layout.This data layout is more effective for the nicely vectorized data input like 3D or 4D vectors, but you can try it on your data set.Below is very interesting link. http://software.intel.com/en-us/articles/how-to-manipulate-data-structure-to-optimize-memory-use-on-32-bit-intel-architecture

TimP · ‎01-10-2013

The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here. If you wish to combine hardware and software prefetch, you might start by examining what icc does with options such as -xHost -ansi-alias -openmp -opt-prefetch -opt-report If your objective is to cover the initial iterations of a loop by software prefetch, the category under which icc implements that for certain targets is called initial value prefetch. The original Pentium 4 presented some cases where this effect could be accelerated by methods resembling what is shown in this thread, but the interaction of software and hardware prefetch was generally bad and the characteristics of hardware prefetch had to be changed. The point well taken is that the hardware prefetch doesn't become effective until the loop has traversed several cache lines, but that problem should be negligible in a case as large as this.

SergeyKostrov · ‎01-10-2013

>>...Do you get good result with FastMemCopy128 can you post any full sample code... I could post test results with and without prefetch to demonstrate that it works and improves performance.

Bernard · ‎01-10-2013

>>>The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here>>> For @Sighere example SoA is not applicable, but it is for vectorized data set like for example float coordinates of the vertices and float coordinates of the light sources. When allocating memeory for large data sets of the vertices I think prefered option will be hybrid SoA because of the structures beign located in close vicinity from each other.

SergeyKostrov · ‎01-10-2013

Note: SoA stands for Structure of Arrays >>...SoA is not applicable... As you can see in Sighere's example just one block of memory for 1-D array is created: >>... >>int *A = ( int * )malloc( sizeof( int ) * SIZE );

Bernard · ‎01-10-2013

Sergey Kostrov wrote:
Note: SoA stands for Structure of Arrays

>>...SoA is not applicable...

As you can see in Sighere's example just one block of memory for 1-D array is created:

>>...
>>int *A = ( int * )malloc( sizeof( int ) * SIZE );

SoA layout usage makes more sense in perfectly vectorised data sets.

Bernard · ‎01-10-2013

>>>SoA layout usage makes more sense in perfectly vectorised data sets.>>> Altough in Sighere's case I think that for the sake of curiosity hybrid SoA approach could be tested.Packing his data into SoA aligned on 16-bytes boundaries and designing it as 3D or 4D vectores and filling 1D array with such a structures maybe such a data set design could improve performance of the CPU cache.

Task Parallelization