OMP_NUM_THREADS on Intel HT processors

g_sparrow · ‎08-17-2006

On a previous thread, I asked a question about multi-threading and Vipin replied as follows...

2. You can set the OMP_NUM_THREADS variable to the number of logical processors to get performance improvement.Yes, you are right, You can set it to 2 on a single processor, dual-core machine.For a single processor, with HT turned on, it is again 2, since there are 2 logical processors. In windows, you can set OMP_NUM_THREADS to NUMBER_OF_PROCESSORS, which is a system variable. Again, you can set this to any number, and you need to check which one gives the best performance, i.e again depending on the different functions, and also your data size etc.But, it is good to set it to the number of logical processors.

I have done some performance tests on linear least-squares problems (using dgels) and general matrix inverse problems with a variety of sizes of matrices.

Looking at the results, on a Pentium 4 processor with Hyperthreading switched on, it definitely performs better with OMP_NUM_THREADS set to 1, rather than 2.

When OMP_NUM_THREADS is set to 2, you can see it uses close to 100% CPU - i.e. both logical processors being used to the full. However, the performance is slightly worse than the OMP_NUM_THREADS=1 case. I am guessing that this will certainly not be the casefor a dual core processor or full dual processor system, but for HT processors, it would seem that the optimum setting is OMP_NUM_THREADS=1.

Have other people found this? Does anyone else have some general advice on setting this variable?

Like, I suggested before, maybe there is the possibility of implementing a function in MKL that will return the optimum value for OMP_NUM_THREADS on the specific machine. This would still give the user chance to set it manually, if they want, but would also provide an automated machanism along the lines of the dynamic DLL selection.

Any thoughts?

Thanks,

Graham

g_sparrow · ‎08-17-2006

BTW, I am using MKL 9.0 beta, which seems to be performing well generally :)

pbkenned1 · ‎08-17-2006

Whether or not you will actually get a speedup with multiple threads on HT-enabled processors depends a great deal on the application's working set and cache affinity. Each logical processor in an HT system maintains the full architectural state, and so appears to software as a complete processor. But in fact, many hardware resources are shared between the logical processors, including cache. If your single thread (OMP_NUM_THREAD=1) version performs well because it has high cache affinity, for example, a hit rate of 95%, then it might suffer a slowdown on two threads, because effectively each thread only gets half the cache.

Since this effect might be due primary to the data set the application is using, I don't think it's possible for the MKL runtime to know what the optimum number of threads is in advance. On one data set the optimum number might be one, on another data set (same application) it might be two.

Patrick

Intel Compiler Lab

Developer Products Division

g_sparrow · ‎08-21-2006

Thanks Patrick - this is useful info on the HT processor architecture. However, as to whether MKL should be able to determine the optimum number of processors - it would seem to be that in most cases, it should be in a better position than the user of the library to make these decisions...

In a lot of cases the MKL functions (BLAS / LAPACK) take dense matrix parameters. Obviously the content of these blocks of memory is down the user, but their arrangement in memory is defined - each dense matrix is a (potentially large) contiguoussection of memory. The MKL internally makes a decision about whether and how to split this up so that it can employ a blocked algorithm to make effective use of the processor caches. In the same way, it could decide how many threads would be beneficial. In the case of the HT processor, as you describe, it is important to know more than just the number of logical processors. It is important to know which resources they share and which they don't - logic that I would have thought would be suited to MKL.

In this case, I wonder whether when you switch OMP_NUM_THREAD from 1 to 2 on HT processors, it actually causes the blocksize to become suboptimal? When we have 2 threads, we basically only have half the cache available, so maybe halving the blocksize could allow it to make best use of both cache and execution core?

Graham

gary-oberbrunner · ‎08-23-2006

Hi folks -- somewhat related question. My product is not an executable, it's a plug-in (shared lib, dll). I can't tell users to set OMP_NUM_THREAD.

Can I set the number of threads programmatically instead? And when do I have to do it? What if I change it after calling some MKL functions (or worse, what if the host app that calls me has already called some)? Can I convince MKL to tear down and restart its worker threads?

thanks,
Gary O

TimP · ‎08-23-2006

As MKL uses OpenMP to control number of threads, you can do so yourself with OpenMP facilities. I suppose it would be dangerous to tinker with an already precarious combination of hand threading and OpenMP, if that's what you have. I certainly hope you would not advise your users to mis-spell OpenMP environment variables.

g_sparrow · ‎08-23-2006

The answer is actually in the release notes...

Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. Intel MKL fits neither of these criteria as the threaded portions of the library execute at high efficiencies (using most of the available resources) and perform identical operations on each thread. You may obtain higher performance when using Intel MKL without HT Technology enabled.

But still, it would be great to be able to tell dynamically what is the best value to use.

Graham