Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Task Parallelization

Swapnil_J_
Beginner
2,397 Views

Hello firends,

    I have successfully done threading and data Parallelization. But, I am really intrested in task Parallelization. How will i do it? How will I start it?

Please guide me.

0 Kudos
48 Replies
Sigehere_S_
Beginner
754 Views
Hi Sergey Kostrov, First Thanks for immediate response. Yes, Sergey Kostrov Please give me some more detail as early as possible. Because, I am new in Intel XE Composer and to read all document i need lots of time, I know it helpful for me but It required lots of time. I have read in which Intel Optimization manual it's really help full but it hadn't specified and manual CPU cache. If you have already read this documents then, you can tell me which portion of manual is useful It's helpful for me. Thnaks!!!
0 Kudos
Bernard
Valued Contributor I
754 Views
>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>> C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that. I will do some research on web in order to find useful information.
0 Kudos
Bernard
Valued Contributor I
754 Views
iliyapolak wrote:

>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>>
C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that.
I will do some research on web in order to find useful information.

Read this article :http://people.redhat.com/drepper/cpumemory.pdf
0 Kudos
Bernard
Valued Contributor I
754 Views
>>>But, I am interesting in using CPU cache in my own program manually or forcefully Do you have any idea about it? >>> It could be done with smart cache aware programming,that's mean that you need to find and exploit in your program spatial and temporal factors , very good candidate for this are arrays, sadly I can not help you more here I'm not an expert on CPU cache and its optimization. Very good podcast by Scott Meyers :http://skillsmatter.com/podcast/home/cpu-caches-and-why-you-care
0 Kudos
SergeyKostrov
Valued Contributor II
754 Views
>>... I have read in which Intel Optimization Manual it's really help full but it hadn't specified and manual CPU cache... Do you have April 2012 edition? I wonder if you looked at Chapter 7 'Optimizing Cache Usage'? Here are a couple of tips: - Warm your data before processing ( it is a very simple procedure / very helpful when some data were paged to a Virtual Memory paging file ) - Use a PREFETCH instruction ( it really improves performance when used in MemCpy or StrCpy functions ). Here is a small example: ... RTbool FastMemCopy128( RTvoid *pvDst, RTvoid *pvSrc, RTint iNumOfBytes ) { ... RTint iPageSize = 4096; RTint iCacheLineSize = 32; ... for( RTint i = 0; i < iNumOfBytes; i += iPageSize ) { RTint j; for( j = i + iCacheLineSize; j < ( i + iPageSize ); j += iCacheLineSize ) { _mm_prefetch( ( RTchar * )pvSrc + j, _MM_HINT_NTA ); } ... } ... return ( RTbool )bOk; } ...
0 Kudos
Sigehere_S_
Beginner
754 Views
Thanks, I got it. I hope it will helpful to me.
0 Kudos
SergeyKostrov
Valued Contributor II
754 Views
>>Thanks, I got it. I hope it will helpful to me. Please take a look at: Forum topic: A problem with 'prefetcht0' instruction ( AT&T inline-assembler syntax ) Web-link: http://software.intel.com/en-us/forums/topic/280798 Also, try to search the Intel forums with a key-word prefetch because there were lots of discussions in the past on that subject.
0 Kudos
Sigehere_S_
Beginner
754 Views
>>Also, try to search the Intel forums with a key-word prefetch Sure, I will read, thanks Sergey.
0 Kudos
Sigehere_S_
Beginner
754 Views
Hi Sergey, I have writen one simple code and implement _mm_prefetch function, the objective of my sample code is just find the aggragte sum of array CODE: #include <stdio.h> #include <omp.h> #include <stdlib.h> #define SIZE 400000000 #define LOOP 20000 struct timeval starttime; // start time function implemantation void startTimer() { gettimeofday(&starttime,0); } // end time function implemantation double endTimer() { struct timeval endtime; gettimeofday(&endtime,0); return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0; } int main () { long int sum=0; int *A = (int *)malloc(sizeof(int)*SIZE); int i,j; for(i=0;i<SIZE;i++) { A=1; } startTimer(); #pragma omp parallel for reduction (+:sum) for(i=0;i<LOOP;i++) { _mm_prefetch(&A[(i+1)*LOOP],3); for(j=0;j<LOOP;j++) { sum += A[(i*LOOP)+j]; } } printf("Result = %ld", sum); printf("Total Time Required = %lf ms\n",endTimer()); return 0; } #shell script: icc -O1 -openmp sum.c -o sum_O1 icc -O2 -openmp sum.c -o sum_O2 icc -O3 -openmp sum.c -o sum_03 icc -O -openmp sum.c -o sum_O icc -Os -openmp sum.c -o sum_Os icc -O0 -openmp sum.c -o sum_O0 icc -fast -openmp sum.c -o sum_fast icc -Ofast -openmp sum.c -o sum_Ofast icc -fno-alias -openmp sum.c -o sum_fno_alias icc -fno-fnalias -openmp sum.c -o sum_fno_fnalias But required time after using _mm_prefetch is same as required time before using _mm_prefetch. Is there any option missing in {icc} command. can you tell me where i am missing some thing? I am using Ubuntu 12.04 (Intel i7/8GB RAM) more specification about my processor is as follow link http://ark.intel.com/products/64899/Intel-Core-i7-3610QM-Processor-6M-Cache-up-to-3_30-GHz can you give me any suggesion to improve this code? Thanks.
0 Kudos
TimP
Honored Contributor III
754 Views
I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well. Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.
0 Kudos
SergeyKostrov
Valued Contributor II
754 Views
Hi everybody, >>... >>#pragma omp parallel for reduction (+:sum) >>for(i=0;i>{ >>_mm_prefetch(&A[(i+1)*LOOP],3); >>for(j=0;j>{ >>sum += A[(i*LOOP)+j]; >>} >>} >>... Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good. We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.
0 Kudos
Sigehere_S_
Beginner
754 Views
TimP (Intel) wrote:

I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well.
Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.

That means, we can not use Hardware as well as software prefetch in same application. If, we disable Hardware prefetch from BIOS then it will do effective job obvious, i agree with you. That means there is no way to increase more optimization with using h/w & s/w prefech at same application. suppose in my above code : When inner loop is starting to execute then, CPU not cache any data in CPU Cache Memory . Is it right or wrong? If we apply or tell to prcessor on next loop execution next data is required to process then it will help full to reduce latency. I am trying to reduce latency of memory access. If I am thinking on wrong direction then please tell me how this code executed in CPU. Thanks
0 Kudos
Sigehere_S_
Beginner
754 Views
Sergey Kostrov wrote:

Hi everybody,

>>...
>>#pragma omp parallel for reduction (+:sum)
>>for(i=0;i >>{
>>_mm_prefetch(&A[(i+1)*LOOP],3);
>>for(j=0;j >>{
>>sum += A[(i*LOOP)+j];
>>}
>>}
>>...

Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good.

We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.

Hi Sergey, Do you get good result with FastMemCopy128 can you post any full sample code. to study it. because in my code i don't get any good result. Please give me any sample code with memory optimization. Thanks thanks
0 Kudos
Bernard
Valued Contributor I
754 Views
@Tim If implementing SoA or hybrid SoA approach to the data layout where it is applicable coupled with the prefetch instruction it is interesting how much such a approach could improve performance.
0 Kudos
Bernard
Valued Contributor I
754 Views
@Sighere Please try to use SoA and hybrid SoA approach to your data layout.This data layout is more effective for the nicely vectorized data input like 3D or 4D vectors, but you can try it on your data set.Below is very interesting link. http://software.intel.com/en-us/articles/how-to-manipulate-data-structure-to-optimize-memory-use-on-32-bit-intel-architecture
0 Kudos
TimP
Honored Contributor III
754 Views
The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here. If you wish to combine hardware and software prefetch, you might start by examining what icc does with options such as -xHost -ansi-alias -openmp -opt-prefetch -opt-report If your objective is to cover the initial iterations of a loop by software prefetch, the category under which icc implements that for certain targets is called initial value prefetch. The original Pentium 4 presented some cases where this effect could be accelerated by methods resembling what is shown in this thread, but the interaction of software and hardware prefetch was generally bad and the characteristics of hardware prefetch had to be changed. The point well taken is that the hardware prefetch doesn't become effective until the loop has traversed several cache lines, but that problem should be negligible in a case as large as this.
0 Kudos
SergeyKostrov
Valued Contributor II
754 Views
>>...Do you get good result with FastMemCopy128 can you post any full sample code... I could post test results with and without prefetch to demonstrate that it works and improves performance.
0 Kudos
Bernard
Valued Contributor I
754 Views
>>>The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here>>> For @Sighere example SoA is not applicable, but it is for vectorized data set like for example float coordinates of the vertices and float coordinates of the light sources. When allocating memeory for large data sets of the vertices I think prefered option will be hybrid SoA because of the structures beign located in close vicinity from each other.
0 Kudos
SergeyKostrov
Valued Contributor II
754 Views
Note: SoA stands for Structure of Arrays >>...SoA is not applicable... As you can see in Sighere's example just one block of memory for 1-D array is created: >>... >>int *A = ( int * )malloc( sizeof( int ) * SIZE );
0 Kudos
Bernard
Valued Contributor I
675 Views
Sergey Kostrov wrote:

Note: SoA stands for Structure of Arrays

>>...SoA is not applicable...

As you can see in Sighere's example just one block of memory for 1-D array is created:

>>...
>>int *A = ( int * )malloc( sizeof( int ) * SIZE );

SoA layout usage makes more sense in perfectly vectorised data sets.
0 Kudos
Bernard
Valued Contributor I
675 Views
>>>SoA layout usage makes more sense in perfectly vectorised data sets.>>> Altough in Sighere's case I think that for the sake of curiosity hybrid SoA approach could be tested.Packing his data into SoA aligned on 16-bytes boundaries and designing it as 3D or 4D vectores and filling 1D array with such a structures maybe such a data set design could improve performance of the CPU cache.
0 Kudos
Reply