topic Time domain simulation parallelisation in Intel® Moderncode for Parallel Architectures

Time domain simulation parallelisation

Petros — Sun, 08 May 2011 06:56:22 GMT

Hi,

I am developing a time-domain simulation tool for power systems. The heart of the simulator is this code:

do while(not converged)

CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)

do i=1, "several thousands"
CALL DGETRS(data of i)
enddo

CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)

enddo

The middle loop has no data dependencies and counts for 35% of my total CPU time in the serial version of the program (counted with intel vtune). DGETRS is a function in mkl_lapack95. I made a first parallelisation try by adding a !$omp parallel do directive right before the middle do-loop and playing with the scheduling, number of threads etc to optimise. I received a very small speed-up (marginal). The cpu time spend on DGETRS is being evenly distributed between the threads but suddenly I have a huge cpu consumption from libomp5.so.

I thought this is because the threads are created and killed after each one of do while loops. So, my second approach was this:

!$omp parallel
do while(not converged)

!$omp single
CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)
!$omp end single

!$omp do
do i=1, "several thousands"
CALL DGETRS(data of i)
enddo
!$omp end do

!$omp single
CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)
!$omp end single

enddo
!$omp end parallel

This way I though the threads would stay alive throughout all the loop and have a better speedup. All the numbers come worst. More elapsed time, more cpu time and less concurrency. The time that was awarded to libomp5.so is halved now, but I have a lot of time spend on the !$omp end single (that I have 2).

I can provide any screenshots and other info from vtune or run any profiling you need. I use fortran 95 with the latest intel compiler on a linux (ubuntu) machine.

Any comments (on the problem or in general) how to optimise the parallelisation are welcome! The incentive for parallelising is that the middle do loop is going to become more intensive soon with more detailed models. I expect the job done in there to go over 50% of the total.

Thanks in advance,
Petros

Time domain simulation parallelisation

TimP — Sun, 08 May 2011 10:39:19 GMT

There's nothing visibly wrong with your first version of OpenMP parallel do. An Intel OpenMP thread team stays active for the time interval set by KMP_BLOCKTIME (default 0.200 seconds), so it's difficult to understand what you expected to accomplish differently with the second version. Your openmp-profile report would shed more light on what is happening in libiomp5. You could simply set LD_PRELOAD=/libiompprof5.so Did you consider trying Inspector to see if you have unintended data sharing between threads?
Do you get any parallel speedup from the OpenMP internal to MKL DGETRS when you turn off OpenMP in your calling code?

Time domain simulation parallelisation

jimdempseyatthecove — Sun, 08 May 2011 15:35:31 GMT

Petros,

Some portions of code work best when parallization occurs inside MKL while others work best when parallization is performed outside MKL (with MKL restricted to one thread). Performing parallization both inside and outside MKL usually does not work well. From MKL docs:
WARNING.

It is not recommended to simultaneously parallelize your program and employ the

Intel MKL internal threading because this will slow down the performance. Note that

in case "d" above, DFT computation is automatically initiated in a single-threading

mode.

Try the following experiment

!$omp parallel
call mkl_set_num_threads(1) ! see note below code snip
do while(not converged)

!$omp single
CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)
!$omp end single

!$omp do
do i=1, "several thousands"
CALL DGETRS(data of i)
enddo
!$omp end do

!$omp single
CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)
!$omp end single

enddo
!$omp end parallel

Note, TimP may be able to answer this, what I do not know is if mkl_set_num_threads(n) is global or thread local. Due to this uncertanty, this is the reason for placing " call mkl_set_num_threads(1)" inside the parallel region (it should be benign to reset it repeatedly).

Also note, if you have other sections of code the work better with MKL internally parallized you will have to reset the MKL number of threads. And there are other, more complex, ways of managing threads between your application and MKL.

As to if DGETRS works best with parallization inside MKL or outside MKL I cannot say. The experiment should be easy to run.

From looking at the chart a significant portion of time is spent inside MKL in start_thread. This would seem to indicate that for your usage of DGETRS that parallization outside of MKL might be favorable.

Jim Dempsey

Time domain simulation parallelisation

TimP — Sun, 08 May 2011 19:18:54 GMT

I suppose the reason for supporting MKL_NUM_THREADS as well as OMP_NUM_THREADS is to permit independent specification for MKL functions in an application which also has its own OpenMP parallel. I haven't seen documentation, but I suppose either set_num_threads option takes effect for the next parallel region (next MKL function call counts); I doubt that mkl_set_num_threads is thread local, but it's an interesting question.
As Jim points out, calling threaded MKL inside a parallel region could be problematical, and the MKL team doesn't recommend ways of doing it, not that you couldn't experiment with this form of OMP_NESTED. Among other things, you would want to set omp parallel num_threads and MKL_NUM_THREADS such that the product doesn't oversubscribe the resources. Affinity could be tricky, even with combined use of MKL_AFFINITY and KMP_AFFINITY.
It may be that extreme MKL thread overhead is due to an excessive total number of threads. By setting MKL_NUM_THREADS=1, you should get nearly as good performance as with mkl_sequential library linkage. As Jim said, it's common for parallelism outside MKL to prove superior to inside MKL, but this depends on your data set, how it fits cache, and I suppose for DGETRS whether the number of right hand sides is sufficient to use all your cores efficiently.

Time domain simulation parallelisation

jimdempseyatthecove — Mon, 09 May 2011 14:30:19 GMT

The decision as to use parallelization inside MKL or outside MKL should be taken on a case by case basis as opposed to functionality by functionality basis. Take for example matrix multiply. If you were to have many different matricies to multiply I would postulate that benieth some size parallization would be more suitable outside MKL and larger than some size parallelization inside MKL would be better. Also, on some configurations, say multi-processor/socket a combination were each socket runs a seperate thread that calls MKL which then parallelizes within the socket although I do not know if MKL has that degree of flexibility (it may have).

Jim Dempsey

Time domain simulation parallelisation

Petros — Mon, 09 May 2011 14:52:47 GMT

First of all thank you very much for devoting the time to help me!

I link the project with this:

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_lapack95_lp64 -liomp5 -lma37

(ma37 is a sparse solver used elsewhere).

The size of the individual problems in DGETRS is not that big to use the threaded solvers (even in the serial version of the program, running the threaded MKL doesn't offer anything!).

I run the inspector and it didn't find and data race situations or other problems.

Going into the parallel code I see that I have a big delay on the parallel directive:

Line	Source	CPU Time	Overhead Time	Wait Time
653	!$omp parallel do &	12.054s	49.952ms	14.241s

But also on a totally stupid line:

Line	Source	CPU Time	Overhead Time	Wait Time
687     if(dabs(x(a(j)+k-1)) > 1)f(a(j)+k-1)=bt*f(a(j)+k-1)/(x(a(j)+k-1)*bt2)   7.618s

x,f,a are shared vectors. But each parallelised loop affects a differrent set of these vectors. i.e. the first loop the elements from 1-3, second from 4-8 ... etc

You think this could create the big overhead? Should I break the big vectors to small individual ones?

Petros

Time domain simulation parallelisation

TimP — Mon, 09 May 2011 18:32:44 GMT

If different threads are updating in the same pair of cache lines, it can create a large false sharing overhead. It may be improved if you can arrange (e.g. by setting affinity) that most cache line sharing is local to a CPU.
This is a case where hyperthreading may tolerate 2 threads on the same core updating elements less than 64 bytes apart.