Re: Should I implement parallel MKL along with OpenMP?

BiswasB · ‎09-15-2022

Hello everyone,

I am trying to incorporate MKL in an industrial HPC solver, and I am not sure for which scenarios is MKL faster. It makes sense that MKL will be faster for heavy duty applications such as sparse or BLAS 3 level routines.

But what about when I have to apply level 1 or 2 routines? I read this thread, and it shows that MKL does not always parallelize vector addition/subtraction depending upon the size of the entity. But then if I try to use OMP pragmas to parallelize these small routines, in the same code which is using MKL for higher routines- it won't give optimum performance as was apparent from the comments (by @jimdempseyatthecove & @TimP ) on the same thread (due to thread SpinWaiting). So how am I supposed to use parallelized MKL in this situation? Is there a way to force it? (My problem size is huge, but still I want to ensure it). Also, how do I know if the version of the functions I'm using is threaded or not (in mkl.h)?

Dev

tl;dr: want optimum performance for the lower routines with 1D vectors while using MKL for higher routines

jimdempseyatthecove · ‎09-15-2022

>> But then if I try to use OMP pragmas to parallelize these small routines, in the same code which is using MKL for higher routines- it won't give optimum performance as was apparent from the comments (by @jimdempseyatthecove & @TimP ) on the same thread (due to thread SpinWaiting). So how am I supposed to use parallelized MKL in this situation? Is there a way to force it?

You can set the environment variable KMP_BLOCKTIME=0 (or to some small value with experimentation). Default is 200 or 300(ms).

Alternatively, you can split the hardware threads amongst OpenMP and MKL via MKL_NUM_THREADS and OMP_NUM_THREADS

Alternative 2:

Multi-thread the application using OpenMP and link with the MKL Sequential Library. This will require you to partition your data via OpenMP (or by hand) and then from each OpenMP thread call the MKL procedure of interest. This will give you the best control over threading at the cost of some additional work.

Note, if you were to use the MKL threaded library from within an OMP parallel region, each (outer) omp thread would instantiated an MKL/OpenMP thread team (aka nested OpenMP level and this). Where you do not program OpenMP nested levels directly using OpenMP directives/pragmas, but rather you generally partition the data at an outer level using OpenMP in the application, then sub-partition via MKL. Using nested levels could potentially be the way to go, but this would depend on the overall needs of the application.

I suggest you experiment with each method.

Jim Dempsey

BiswasB · ‎09-16-2022

Thank you for the concise reply.

I will try out Alternative 1, since Alternative 2 sounds a bit improbable, given the scale and size of the solver as of now.

But coming back to the other question: how do I ensure that 1D MKL functions will be threaded? What is the smallest size of vector needed? And although my vector size is huge, is there still a way to force parallelization in MKL functions? I just want the optimum performance, and refrain from using OpenMP at all with MKL if possible.

Devmalya Biswas

jimdempseyatthecove · ‎09-16-2022

>>how do I ensure that 1D MKL functions will be threaded?

You link with the threaded MKL library. MKL will decide when it is beneficial to thread the call.

>>What is the smallest size of vector needed?

MKL (threaded library) decides at what size threading will be beneficial.

>> is there still a way to force parallelization in MKL functions?

I don't know, but (performance-wise) it wouldn't make sense to internally parallelize an MKL call that has been determined to be ineffective to be parallelized.

>> I just want the optimum performance, and refrain from using OpenMP at all with MKL if possible.

Depending on your application this might not be possible. In most cases of substantial code there are highly computational procedures that do not/will no/can not use MKL but will benefit from parallelization. IOW if you want maximum performance - you will have to do some work.

Jim Dempsey

BiswasB · ‎09-16-2022

>> I don't know, but (performance-wise) it wouldn't make sense to internally parallelize an MKL call that has been determined to be ineffective to be parallelized.

That makes sense.

Does the same happen for Level 3 BLAS functions as well?

Calling one function multiple times would mean opening and closing the threads with all the overhead added, be it for 1D functions or 3D. Does MKL consider this, and multithread only if the parallel function + overhead times are in total less than the sequential times (for all levels of BLAS)?

If yes, would it be safe to call all MKL functions only, and not worry about overheads? I have to consider the cons of threading (due to overheads) if I'm calling a inhouse 1D function, does this not hold with MKL?

Devmalya Biswas

jimdempseyatthecove · ‎09-17-2022

>>Does the same happen for Level 3 BLAS functions as well?

If you use the MKL equivalent BLAS functions then yes, using 3rd party BLAS I cannot say.

>>Calling one function multiple times would mean opening and closing the threads with all the overhead added

That is not how it (OpenMP/MKL threading) works. MKL uses OpenMP internally.

OpenMP (by your programming and internal to MKL) uses the concept of thread team(s). Note possibility of plural. A (each) thread team is created but once. This happens (or inhibited) on the first call from within the context of the calling thread. Subsequent calls from the same thread and context will reuse the same threads. On the subsequent call the particular team's non-calling threads may be in a state of spin-wait (intermission shorter than KMP_BLOCKTIME) or may have been suspended (waiting for event/condition variable). When in spin-wait the non-running team member thread has negligible overhead to begin the work. When waiting for even/condition, the O/S thread is already setup, so overhead is relatively low. Also, should no other process have used that same hardware thread, the core's cache levels may hold data for use by the thread. Overhead in this case is considerably less than for starting a new thread.

Now then, additional information. OpenMP permits or denies nested parallel regions. Permission for nesting is enabled via environment variable or function call prior to first call into the OpenMP runtime system (as used by the process). When enabled, each thread has the provisions to maintain a thread team context for each nest level for that thread. Think of this as a tree.

Initially the process starts with a single thread, that thread runs into a parallel region (first time) the team context is examined and noted that there is no thread team established and it incurs the overhead of creating a thread team, then starts the threads working on the parallel region. Subsequent executions of that parallel region from the same thread and thread's nest level will reuse the established thread team, thus saving initiation overhead.

IOW when nesting is enabled and number of nest levels permits this something like this happens:

main process thread prior to OpenMP usage by self or MKL
loop:
  code
  parallel region ! 1st level, create team 1st call, reuse team next call
    parallel code
    parallel region ! 2nd level, create team 1st call, reuse team next call
      nested parallel code
      parallel region ! 3rd level, ...
        nested nested parallel code
        ...

Assume a system has 32 hardware threads available and you desire to code using one nest level. (generally) Optimal thread usages are:
2 1st level, 16 for nest level
3 1st level, 10 for nest level
4 1st level, 8 nest level
5 1st, 6 nest
6 1st, 5 nest
8 1st, 4 nest
10 1st, 3 nest
16 1st, 2 nest

Note how the product of the threads for each level do not exceed the number of hardware threads. For two nest levels the product would have 3 terms (1st level, 1st nest, 2nd nest).

This said, with careful tuning you can have the product exceed the number of hardware threads available. This is something you should not approach lightly.

Jim Dempsey

BiswasB · ‎09-28-2022

Hey Jim,

I apologize for the late reply. I was making myself familiar with the OpenMP execution model.

OpenMP (by your programming and internal to MKL) uses the concept of thread team(s). Note possibility of plural. A (each) thread team is created but once. This happens (or inhibited) on the first call from within the context of the calling thread. Subsequent calls from the same thread and context will reuse the same threads. On the subsequent call the particular team's non-calling threads may be in a state of spin-wait (intermission shorter than KMP_BLOCKTIME) or may have been suspended (waiting for event/condition variable).

As far as I got to know, this is the fork-join model. The thread go into spinwait and are not available to any other non-OpenMP threaded code until timeout. My question is- if I am using Intel's OpenMP library itself, would it not be better to set the KMP_BLOCKTIME to infinite such that the team threads are always in spinwait and there's almost zero overhead whenever the master thread encounters another parallel construct inside some MKL function?

And are there really no other overheads except memory sharing and barriers? If I keep all the threads synchronized and keep opening and closing parallel constructs continuously, will there really be no overhead other than the implicit barriers (which are minimal due to sync)? If you check here, it says to use parallelization on the biggest for loops when not nesting them- why is it so? Purely because of the implicit barrier overheads?

Also, I appreciate taking out so much time to answer these questions.

Regards,

Dev

BiswasB · ‎10-08-2022

@jimdempseyatthecove Tagging you here incase you were not notified of the reply! Please let me know the conclusions to the infinite BlockTtime value and about the overheads and their reduction.

Dev

burnerop · ‎06-11-2023

Yes, you can implement parallel MKL along with OpenMP to optimize performance in your industrial HPC solver. By combining the parallelization capabilities of both MKL and OpenMP, you can potentially achieve improved efficiency and speed in your computations.

To implement this, you can follow these steps:

Identify the parts of your code that involve computationally intensive operations where MKL routines are used.

Determine the sections of the code where you can apply parallelism using OpenMP directives. This could be for parts of the code that involve level 1 or level 2 routines, as you mentioned.

Use OpenMP pragmas or directives to parallelize the identified sections of the code. This will enable multiple threads to execute the computations concurrently.

Ensure proper synchronization and thread management to avoid conflicts or performance bottlenecks. Consider factors such as load balancing, data dependencies, and thread settings to optimize performance.

Test and benchmark your implementation to measure the performance gains achieved by combining parallel MKL with OpenMP. This will help you assess the effectiveness of the parallelization approach for your specific scenarios and make any necessary adjustments.

Note that the performance improvement will depend on various factors such as the nature of your computations, the workload characteristics, and the hardware architecture. It's recommended to experiment, monitor performance, and fine-tune your implementation to achieve the best results for your specific use case.