Convolution transformation using SDCON

Ouissem_B · ‎04-26-2016

Hi all!

I'm using the SDCON routine to perform a convolution transformation with Fortran, and I have to admit that it has (by far) better performances that what we have been writing.

I did a simple profiling of my code to check which part is the heaviest in terms of CPU time. It turned out to be the part including the convolution calculation, essentially because I'm calling the SDCON routine 7 times each time step!

I'm wondering if a parallel (or multi-threaded) version of it exists? I'm looking for enhancing the performances of my code, and it would be really helpful.

Thank you very much for helping me.

With best regards,

Ouissem

Zhang_Z_Intel · ‎04-26-2016

MK does have equivalent functions to SDCON and many other convolution routines. The API and usage model is different, but performance is way better. Take a look at:

https://software.intel.com/en-us/node/521901 ;

Like many other routines in MKL, these routines are mutithreaded. At the run time, MKL chooses just the right number of threads to use based on the array size passed to a call and the number of physical cores on your system. You can control threading by setting MKL_DYNAMIC=0 and setting MKL_NUM_THREADS to a number of your choice. This is about parallel execution of individual routine calls.

If you want to parallelize multiple SDCON calls with multiple threads such that each thread executes a subset of all calls, then it's outside the scope of MKL. You'll need to write user-level multithreading code to spawn OpenMP threads yourself.

Ouissem_B · ‎04-27-2016

Zhang Z. (Intel) wrote:

MK does have equivalent functions to SDCON and many other convolution routines. The API and usage model is different, but performance is way better. Take a look at:

https://software.intel.com/en-us/node/521901

Like many other routines in MKL, these routines are mutithreaded. At the run time, MKL chooses just the right number of threads to use based on the array size passed to a call and the number of physical cores on your system. You can control threading by setting MKL_DYNAMIC=0 and setting MKL_NUM_THREADS to a number of your choice. This is about parallel execution of individual routine calls.

If you want to parallelize multiple SDCON calls with multiple threads such that each thread executes a subset of all calls, then it's outside the scope of MKL. You'll need to write user-level multithreading code to spawn OpenMP threads yourself.

Hello Zhang,

Thank you for your reply. I'll check the MKL_DYNAMIC and MKL_NUM_THREADS parameters.

Best regards,

Ouissem

Ouissem_B · ‎04-29-2016

Hello,

I've tried varying the MKL_NUM_THREADS and performed a couple of tests with different number of threads, and came to the conclusion that my Parallel Elpased CPU time is actually NUM_THREADS * Sequential CPU time, which means that my code seems very poorly scalable!

Ouissem

Ouissem_B · ‎04-29-2016

Hi Zhang,

In my code, I'm using 2 MKL calls: One for a Cubic Spline interpolation and one for a convolution with SDCON.

I tried to run it using only one MKL call (either the Spline interpolation or the SDCON convolution).

The /Qmkl:parallel option is enabled, and I set MKL_Dynamics=1 to make MKL choose the right number of threads to use.

Monitoring the CPU usage with the Task Manager, I found out that when calling the Spline interpolation, all the 4 threads of my CPU are used (almost 100% of CPU usage), while only 1 thread is activated when calling the SDCON convolution (barely 25% of CPU usage).

Does that mean that the SDCON convolution does not actually take advantage of the multi-threading capability of the MKL, or am I forgeztting something in my options?

Thank you for your help.

Ouissem

Zhang Z. (Intel) wrote:

MK does have equivalent functions to SDCON and many other convolution routines. The API and usage model is different, but performance is way better. Take a look at:

https://software.intel.com/en-us/node/521901

Like many other routines in MKL, these routines are mutithreaded. At the run time, MKL chooses just the right number of threads to use based on the array size passed to a call and the number of physical cores on your system. You can control threading by setting MKL_DYNAMIC=0 and setting MKL_NUM_THREADS to a number of your choice. This is about parallel execution of individual routine calls.

If you want to parallelize multiple SDCON calls with multiple threads such that each thread executes a subset of all calls, then it's outside the scope of MKL. You'll need to write user-level multithreading code to spawn OpenMP threads yourself.

Zhang_Z_Intel · ‎04-29-2016

Sorry my statement wasn't accurate in my post. You're correct that MKL 1D convolution routines are not threaded. You'll have to use application level threading to do parallel execution. See https://software.intel.com/en-us/node/521925 for "Usage Examples", Pay attention to the "Using Multiple Threads" section at the end of the page.

Ouissem_B · ‎05-04-2016

Hi Zhang,

I've managed to parallelize, using OpenMP, the part of the code dealing with the SDCON call (actually, it's the most expensive part in terms of CPU). It works just fine, and it's very suitable as a first attempt.

I was wondering if there is an equivalent to the "CALL OMP_SET_NUM_THREADS ()" call, for setting up the MKL_NUM_THREADS within the code, instead of setting the environment variable at runtime?

Thanks alot for your help.

With regards,

Ouissem

Zhang Z. (Intel) wrote:

Sorry my statement wasn't accurate in my post. You're correct that MKL 1D convolution routines are not threaded. You'll have to use application level threading to do parallel execution. See https://software.intel.com/en-us/node/521925 for "Usage Examples", Pay attention to the "Using Multiple Threads" section at the end of the page.