- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am testing use of OpenMP to make use of multi-core processors. An example to compute e and pi found on the web works and runs faster than the single core equivalent.
But the code snippet below (from a routine to do matrix inversion) appears to work and shows 100% cpu usage on dual processors, but in fact takes longer to run than the single core processing equivalent.
!$omp parallel sections shared(a,f)
!$omp section
do 50 k=ip1,mp
a(j,k)=a(j,k)+f*a(i,k)
50 continue
!$omp end parallel sections
Question: Has anyone else experienced cases where the dual core processing runs slower than the single core equivalent? Any ideas why this is so?
I also tried using OpenMP to speed up a Quicksort algorithm and got the same result (single core is faster than dual core?)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel. If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.
As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond. Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region. The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.
The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the feedback. Your comments provide me something to work on. I can easily construct a test case where I gradually increase the time of the inner loops so that I can evaluate the effect of the overhead. Knowing about the overhead will also provide a better base for evaluating other parts of our software which could be parallelized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel. If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.
As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond. Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region. The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.
The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.
Thanks for the feedback. I am new to the world of parallelization (obviously), so I will look at parallel do as well.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page