- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a large array on the sum of the optimization problem.
There are two double type array, then the code for this problem is as follows:
#pragma omp parallel for
for (long i=0; i<5000000; i++)
{
array1 += array2;
}
My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.
And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.
The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.
and then i used VC and IC compiled the program,the two result are basiclly the same.
and then i used the funtion of INTEL MKL 10.1 to compute,as follows:
cblas_daxpy(5000000, 1, array2, 1, array1, 1);
the performance of the program have no different.
and the i used other funtino of INTEL MKL 10.1:
vdAdd( n, a, b, y );
Program performance decreased significantly, and only about 80% of the original.
i would like to know what way to optimize this problem by enhancing program performance
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, Tim, I am looking forward to your source-code solution to this specific problem thirstily!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).
"It depends" are political men's words. Just kidding! :)
We just face this specific problem with VC++ Compiler 2008 and Intel C++ 11.0 running on a Dell server with 2 XEON 5420 CPUs and 48G Memory.
I have been doing a lot of work on it, but have not got a good improvement by now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess Tim mentioned unrolling just as a good habit.
To get good improvement on this particular problem you must somehow reduce memory bandwidth requirements (http://software.intel.com/en-us/forums/showpost.php?p=114059) or upgrade to a NUMA system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitri wrote:
"I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."
The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:
"A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
1) both loop regions have the same number of loop iterations,
2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
3) both loop regions bind to the same parallel region."
In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.
Hope that helps.
- Grant
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Humm... can I specify nowait clause for the first loop then? Just curious.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wow I am getting lot's of info's here as a noob I am really learning thanks for the codes this is what I am trying to figure out....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitri wrote:
"I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."
The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:
"A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
1) both loop regions have the same number of loop iterations,
2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
3) both loop regions bind to the same parallel region."
In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.
Hope that helps.
- Grant
Whatsoever, Grant, could you provide a soure-code solution for this specific problem and make a significant performance improvement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Humm... can I specify nowait clause for the first loop then? Just curious.
Yes! As long as there are no dependences between corresponding chunks of the two loops. (Assuming of course no dependences between different chunks of the same loop in order to run them in parallel as well.)
, !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, got it. Thank you.
> , !
:)
!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.
I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.
Is it possible to apply OpenMP to a loop thatcontains the array add loop?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.
I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.
Is it possible to apply OpenMP to a loop thatcontains the array add loop?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.
Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.
But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.
Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.
But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?
My code is very simple and the only bottleneck is the addition of large vectors.
Thank you very much, Grant!
![](/skins/images/2E08A100FB92911314A240D1EAFB2828/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »