Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28508 Discussions

Multi-core example with OpenMP slower than single core?

Tony_the_D
Beginner
1,175 Views

I am testing use of OpenMP to make use of multi-core processors. An example to compute e and pi found on the web works and runs faster than the single core equivalent.

But the code snippet below (from a routine to do matrix inversion) appears to work and shows 100% cpu usage on dual processors, but in fact takes longer to run than the single core processing equivalent.

!$omp parallel sections shared(a,f)
!$omp section
do 50 k=ip1,mp
a(j,k)=a(j,k)+f*a(i,k)
50 continue
!$omp end parallel sections

Question: Has anyone else experienced cases where the dual core processing runs slower than the single core equivalent? Any ideas why this is so?

I also tried using OpenMP to speed up a Quicksort algorithm and got the same result (single core is faster than dual core?)

0 Kudos
4 Replies
Steve_Nuchia
New Contributor I
1,175 Views
There is a lot of overhead in OpenMP. It will run slower unless the looptakes a significant fraction ofa second single-threaded. Tens of milliseconds, minimum.
Also, if the task if memory bandwidth bound rather than compute or cache bound, it will run no faster parallelized, regardless of the API used. At least on most single-socket hardware. You have to know your system architecture here.

0 Kudos
TimP
Honored Contributor III
1,175 Views

In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel. If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.

As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond. Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region. The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.

The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.

0 Kudos
Tony_the_D
Beginner
1,175 Views
There is a lot of overhead in OpenMP. It will run slower unless the looptakes a significant fraction ofa second single-threaded. Tens of milliseconds, minimum.
Also, if the task if memory bandwidth bound rather than compute or cache bound, it will run no faster parallelized, regardless of the API used. At least on most single-socket hardware. You have to know your system architecture here.

Thanks for the feedback. Your comments provide me something to work on. I can easily construct a test case where I gradually increase the time of the inner loops so that I can evaluate the effect of the overhead. Knowing about the overhead will also provide a better base for evaluating other parts of our software which could be parallelized.

0 Kudos
Tony_the_D
Beginner
1,175 Views
Quoting - tim18

In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel. If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.

As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond. Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region. The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.

The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.

Thanks for the feedback. I am new to the world of parallelization (obviously), so I will look at parallel do as well.

0 Kudos
Reply