performance of mkl in multithread application in numa memory architecture decreased

mmmmm__hamed · ‎01-10-2015

hi

i have a server with 80 logical core with NUMA memory architecture and windows server 2012.

i want create one thread per each logical core and each thread execute fft, convolution,... Independently.

when i have less 5 thread, cpu usage of each involved core is 100 %

but when I create more than 5 thread in a one numa group, cpu usage start to decrease so that by adding each thread , cpu usage of total cores slightly reduced.

so that in the end , when i have 80 thread , cpu usage of all cores is between 30 to 40 percent.

i test that when i reduce input size of MKL FFT , cpu usage of cores will be a little better.

i do not any memory operation in my code.(only 3 alloc and 3 free). and this problem does not related to memory management or heap sequential behavaior .

my pseudo code in each thread is like this:

void workerthread() {

mkl_complex* A=(mkl_complex*)malloc(sizeof(mkl_complex)* 8092); // get mem for 1st input. once

mkl_complex* B=(mkl_complex*)malloc(sizeof(mkl_complex)* 16000); // get mem for 2st input. once

int ConvResLen=8092+16000-1;

mkl_complex* ConvRes=(mkl_complex*)malloc(sizeof(mkl_complex)* ConvResLen); // get mem for Result. once

for(int i=0;i<1000;i++)

{

execute_mkl_convolution(A,B,ConvRes); //for this i use vslzConvNewTask1D() and vslConvSetStart() and vslzConvExec1D()

}

free(A);free(B);free(ConvRes);

}

i think internal memory management of mkl may incompatible with numa architecture. because I did not encounter this problem on a non NUMA (uma) server(16 core server that dont have numa archtecture).

now is there a solution to enable numa for mkl and use 100 percent of cpu.

thanks

TimP · ‎01-10-2015

You need to read up on the mkl default of 1 thread max per physical core and on kmp_affinity or omp_proc_bind

mmmmm__hamed · ‎01-11-2015

hi dear Tim prince

If you mean is that i assign each thread to one logical core, i do this and set thread affinity so that each thread run on Independent core.

now i have 2 numa group and each group have 2 numa node. and each numa node has 20 logical core.

i distribute threads on goups and nodes correctly and assign 40 thread to each group and 20 thread to each numa node and 1 thread to each logical core.

can you responce me clear and give me more detailed information.

thanks for you Tim.

TimP · ‎01-11-2015

Among the apparent issues would be the default setting of mkl_dynamic environment variable. If you wish to verify performance with other setting you will need to override explicitly. But you will need to understand the distinction between logical and physical cores and how mkl is optimized for full performance at 1 thread per physical core.

mmmmm__hamed · ‎02-08-2015

hi dear tim prince

Excuse me for this long delay.I was involved in something else

if you remember i have problem (low cpu usage)with mkl on numa aware server so that with create each thread , cpu usage of all threads reduced. i assign each thread to independent logical core with set affinity

after your descrption i read more and do this:

1. use mkl_intel_thread_dll.lib instead of mkl_sequential_thread_dll.lib.

2. in the past i use visual c++ compiler . change to intel c++ compiler.

3. create KMP_AFFINITY variable in system variable with this value : granularity=fine,compact,1,0

4. and i use mkl_set_num_threads with multiple input.

But there was no change in performance and cpu usage.

can you elaborate more or give me a link or document to read .

thank you very much

TimP · ‎02-08-2015

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading

To go further with this thread you may need to be less vague about your platform etc.

mmmmm__hamed · ‎02-10-2015

hi tim prince

thanks for your assist

i read link that you send and this links

https://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications

https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application

and find this points :

mkl has internal threading and when my application is multi thread i must turn off Intel MKL threading by either using the sequential library or by setting MKL_NUM_THREADS.

i use sequential library but cpu usage was unchanged.(mkl_sequential_dll.lib)

then i use parallel library (mkl_intel_thread_dll.lib) and befor run own threads in the main of program write below methods to turn of mkl internal threads:

mkl_set_num_threads(1);

omp_set_num_threads(1);

mkl_domain_set_num_threads(1,MKL_DOMAIN_FFT);//( i use mkl CONV)

but no change in cpu usage occured.

Is there anything else that I would consider?

are you sure mkl work correctly on numa architecture?(in uma i dont have problem) can you test mkl on numa server

i attach my small sample program for you and please check that(simple app that only multiple thread and each thread only exec conv) and i grateful from you for this action

I am very involved in this issue, please help me to do more

i use mkl 11.2.

TimP · ‎02-10-2015

Mkl might be expected to observe omp_nested thus not using additional threads when called within omp parallel for. Numa placement of threads depends on omp_place_threads or Kmp_affinity setting preferably at top Level Threaded loop. Will look later at your sample.

TimP · ‎02-11-2015

I didn't see anywhere in your previous posts that you were calling MKL from Windows threads (not using OpenMP except for the omp_num_threads() function and the MKL). I agree that if you wish to do this, /Qmkl:sequential seems to be the way to go. So my comment about omp_nested doesn't apply, but I don't know whether we can resolve your questions about how mkl:parallel will work.

If you must use Windows threads, questions arise which probably aren't topical on this forum and I'm not prepared to learn about.

In order to get started, I commented out your stdafx.h and changed _tmain etc. to standard code, besides removing calls to set number of threads in OpenMP and MKL.

Your code generates too many threads (total number of hyperthread logical processors) on my platform, even with mkl:sequential. This is why I have a strong preference for examples using environment variables in the usual ways to control number of threads and affinity.

I haven't figured out how to shorten your case or find out why it doesn't complete on my platform even with the loop count shortened.

mmmmm__hamed · ‎02-11-2015

hi tim

i thank you for time Spend for me

As you've noticed i use windows threads , Do you think if I use the other threading methods, my mkl problem on numa solved?

my numa server has 80 logical core , because of this i create too many threads(80) even with mkl sequential. each thread has special input and Exec CONV on its input.

in my code i have threadworker() function . inside that i have a loop with 10000 time iteration. if you reduce that program can be complete and finished

my server has two numa group and each group has 40 logical core. and because my server has 80 logical core i create 80 thread and you can change thread number to fits your platform. you can do this by change GroupCount and GroupCore variables.

At all what is your platform ? Do you have access to a numa server for testing the code or not?

and if you suggest i post this topic on other forum , please tell me what is the best forum for my problem.

TimP · ‎02-12-2015

I don't have access to a Windows E7 style box except occasionally at a customer site. I could install Windows on my old dual CPU Intel box, but it's not so interesting.

Among the prerequisites for satisfactory performance on the 4-cpu Intel boxes are avoiding remote memory access by keeping each block of threads local to one CPU. Your implicit concerns about conflicts between what you do in Windows threads and what MKL wants to do in OpenMP seem well taken. If you use /Qmkl:sequential and manage your own threads, MKL will not attempt to deal with NUMA, so it is left up to you.

You definitely need to provide for adjusting your numbers of threads and affinity consistent with the MKL implementation, even if you intend always to run on the same topology and OS. The easy way is to use OpenMP in your own code as well. As MKL is designed for full performance only when running 1 thread per physical core, you must expect worsening efficiency as you go up in numbers of cores when running more than 1 thread per core.

At one time, there were comments that Intel intended to change MKL from OpenMP compatibility to TBB and Cilk(tm) Plus compatibility, but this seems impractical and was not pursued. As the OpenMP for Windows is built on Windows threads, there is potential compatibility, but I don't know if the somewhat obscure features for compatibility between OpenMP and pthreads on linux have counterparts in Windows.

mmmmm__hamed · ‎02-16-2015

Dear Tim

I want to tell you two points:

1. I saw this link " https://software.intel.com/en-us/articles/intel-mkl-numa-notes"

a person whose name is "Victor Pasko" has commented at the end of mentioned address and says "MKL memory manager is not intended for NUMA model" . My understanding from those sentences is mkl doesn't have numa specific implementation and doesn't have any ability to efficient execution on numa architecture. then I send a message to Victor Pasko for asking extra details about it and Victor didn't reply to me. Is my understanding correct ,in your opinion ?.I couldn't communicate with Victor Pasko , would you mind communicate to Victor and give him further detail about mkl on numa? because both of you work in intel and it is easy for you to speak with Victor.

2. Since you doesn't have numa server for test code. I would run teamviewer on my server then you or another expert can connect to that and test code on numa server. Do you or other intel company supporters do this and test your own product on numa server?

Best regards.

TimP · ‎02-16-2015

Remember that the primary target of MKL benchmarks generally is a dual CPU platform, so it should be possible to get full performance on NUMA. Many applications which use MKL are targeting the more cost effective dual CPU platforms, as (e.g. when using MPI), those may be more attractive than the 4- and 8- CPU platforms, but there isn't necessarily any additional functionality needed to support optimum configuration on the bigger machines.

I think the point being made is that MKL memory management by itself doesn't predict or optimize memory locality. If you get data located by first touch (say on a single CPU), then try to access the data in multiple threads running on different CPUs, you could suffer from remote memory access in some threads.

I retired from Intel, so it's not correct to say I work there.

mmmmm__hamed · ‎02-17-2015

do you have an idea to how allocate memory locally?

how I can allocate memory locally to cpu?

special command or other solution.

and I wish to victor pasko also participate in our discussion

thanks for your patience