a large array on the sum of the optimization problem - Page 2

fyten1985 · ‎03-26-2010

This is a large array on the sum of the optimization problem.

There are two double type array, then the code for this problem is as follows:

#pragma omp parallel for

for (long i=0; i<5000000; i++)

{

array1 += array2;

}

My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.

And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.

The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.

and then i used VC and IC compiled the program,the two result are basiclly the same.

and then i used the funtion of INTEL MKL 10.1 to compute,as follows:

cblas_daxpy(5000000, 1, array2, 1, array1, 1);

the performance of the program have no different.

and the i used other funtino of INTEL MKL 10.1:

vdAdd( n, a, b, y );

Program performance decreased significantly, and only about 80% of the original.

i would like to know what way to optimize this problem by enhancing program performance

TimP · ‎03-28-2010

Intel C++ documentation includes examples of #pragma prefetch usage. Unrolling the loop by itself isn't likely to be effective in a memory limited case; the point is to assure that time isn't wasted issuing by issuing too many extra prefetches.

bigknife · ‎03-28-2010

Quoting tim18

Intel C++ documentation includes examples of #pragma prefetch usage. Unrolling the loop by itself isn't likely to be effective in a memory limited case; the point is to assure that time isn't wasted issuing by issuing too many extra prefetches.

So, Tim, I am looking forward to your source-code solution to this specific problem thirstily!

bigknife · ‎03-29-2010

Quoting Dmitriy Vyukov

It depends.
It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).

"It depends" are political men's words. Just kidding! :)

We just face this specific problem with VC++ Compiler 2008 and Intel C++ 11.0 running on a Dell server with 2 XEON 5420 CPUs and 48G Memory.

I have been doing a lot of work on it, but have not got a good improvement by now.

Dmitry_Vyukov · ‎03-29-2010

Well, simple addition of numbers is memory bandwidth bound, so little can be done here. Single addition operation is just too small amount of work per 16-bytes of memory.
I guess Tim mentioned unrolling just as a good habit.
To get good improvement on this particular problem you must somehow reduce memory bandwidth requirements (http://software.intel.com/en-us/forums/showpost.php?p=114059) or upgrade to a NUMA system.

TimP · ‎03-29-2010

Ideally, #pragma prefetch would take care of issuing a prefetch once per automatically unrolled loop. I'm not in a position to check this, but if this case is an isolated loop (as implied above), and if there's a reason for making this effort for a possible small performance improvement on the old hardware, it may be worth a try.

Grant_H_Intel · ‎03-30-2010

Dmitri wrote:
"I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."

The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:

"A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
1) both loop regions have the same number of loop iterations,
2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
3) both loop regions bind to the same parallel region."

In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.

Hope that helps.

- Grant

Dmitry_Vyukov · ‎03-30-2010

Thank you, Grant!

Humm... can I specify nowait clause for the first loop then? Just curious.

farneville · ‎03-31-2010

Wow I am getting lot's of info's here as a noob I am really learning thanks for the codes this is what I am trying to figure out....

bigknife · ‎04-06-2010

Quoting Grant Haab (Intel)

Dmitri wrote:
"I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."

The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:

"A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
1) both loop regions have the same number of loop iterations,
2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
3) both loop regions bind to the same parallel region."

In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.

Hope that helps.

- Grant

Whatsoever, Grant, could you provide a soure-code solution for this specific problem and make a significant performance improvement?

Grant_H_Intel · ‎04-06-2010

Dmitriy wrote:

> Humm... can I specify nowait clause for the first loop then? Just curious.

Yes! As long as there are no dependences between corresponding chunks of the two loops. (Assuming of course no dependences between different chunks of the same loop in order to run them in parallel as well.)

, !

Dmitry_Vyukov · ‎04-06-2010

> Yes! As long as there are no dependences between corresponding chunks of the two loops.

Ok, got it. Thank you.

> , !
:)
!

Grant_H_Intel · ‎04-06-2010

Bigknife,

Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.

I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.

Is it possible to apply OpenMP to a loop thatcontains the array add loop?

bigknife · ‎04-10-2010

Grant,

Thank you very much for your suggestions.

But whyparallelizing the code at a higher granularity can be better than a simple vector addition?

As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420.

Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it?

Best regards,

Bigknife

Quoting Grant Haab (Intel)

Bigknife,

Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.

I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.

Is it possible to apply OpenMP to a loop thatcontains the array add loop?

Grant_H_Intel · ‎04-12-2010

Quoting bigknife

But whyparallelizing the code at a higher granularity can be better than a simple vector addition?
As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420.
Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it?

Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.

Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.

But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?

bigknife · ‎04-13-2010

Quoting Grant Haab (Intel)

Quoting bigknife

But whyparallelizing the code at a higher granularity can be better than a simple vector addition?
As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420.
Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it?

Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.

Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.

But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?

My code is very simple and the only bottleneck is the addition of large vectors.

Thank you very much, Grant!