severe performance degradation when cpus are shared

kambrian · ‎05-09-2010

When my code runs together with another parallel program both utilizing all the cpus on a 64-cpu server, I find severe performance degradation. It can become 20 times slower than running it alone, even if the other program just uses only one cpu for most of the time.
Any hints about possible problems? the code is too long to be posted here.It mainly consists of some parallel loops with several reduction clauses, and a critical clause plus some light-weighted single clauses of which I'm pretty sure not to block the code. I run the code with OMP_SCHEDULE=dynamic and chunksize=1 for loops, with intel c compilers. And I have enough memory.
Thanks in advance,

Jiaxin

piet_de_weer · ‎05-09-2010

If you're using hyperthreading on your CPU's, you might want to look into the 'Intel 64K cache aliasing' (google that) problem. (Turn hyperthreading off, and check if that helps. Read on if it does...)

http://software.intel.com/en-us/articles/resolve-64k-alias-conflicts-on-hyper-threading-technology-enabled-systems/

In a worst case situation, this problem 'disables' the cache, leading to a huge performance loss.

Dale_S_Intel · ‎05-11-2010

It's hard to say without actually seeing the test case. If it's too big to post the code in a message, you can attach files to the thread. If it's too big for that you can submit it to premier.intel.com, and we'd be interested in looking at it.

Thanks!
Dale

kambrian · ‎05-13-2010

To piet:
My problem is, when the program occupies the machine alone, it runs well. But when another job is submitted which shares some of the cpus with my program, performance degradation occurs. Do you think this may be related to the hyperthreading problem?
The machine is a SGI altix server with intel itanium2, running suse linux.

To Dale:
I'm sorry but the code is both too large and dependent on a large dataset to run. Perhaps posting it wouldn't help. I'll try to test it if any hints or suggestions could be made.

I tried using OMP_DYNAMIC=TRUE but it doesn't seem to help. And in my test it seems OMP_DYNAMIC is able to just reduce the thread numbers by one.

jimdempseyatthecove · ‎05-13-2010

>>When my code runs together with another parallel program both utilizing all the cpus on a 64-cpu server, I find severe performance degradation. It can become 20 times slower than running it alone, even if the other program just uses only one cpu for most of the time.

Look in the applications to see if they are using a shared resource between them. An example of this might be a file.

If one application is running on 64 cpus and the second app is using 1 cpu most of the time .AND. you are observing a 20x performance degradation, then this must be more of an issue of resource conflict as opposed to cache conflict (since presumably the one core mostly app will be affecting 1/64 of the total caches).

Is this (are these) applications I/O bound?

Jim Dempsey

TimP · ‎05-14-2010

Apparently, you are using OpenMP. You could set the affinity and OMP_NUM_THREADS of each job (libiomp environment option KMP_AFFINITY=proclist....) to a set of logical processors on separate sockets. If you don't set NUM_THREADS, the Intel library will default to make each job attempt to use all the logical processors, which would account for your problem.

pbkenned1 · ‎05-14-2010

If you are building your OpenMP codes with the Intel compiler, then you might try setting our environment variable KMP_LIBRARY for turnaround mode:
export KMP_LIBRARY=turnaround

An alternative way to set 'turnaround' mode is to call kmp_set_library_turnaround(), and Intel OpenMP extension routine.

This is designed for use in dedicated parallel (single user) environments. The default is ''throughput', which is for multi-user environments.

Be aware that in 'turnaround' mode, you may over-subscribe the machine if too few processors are available at run time.

Patrick Kennedy
Intel Developer Support

Olga_M_Intel · ‎10-19-2010

Quoting kambrian

My problem is, when the program occupies the machine alone, it runs well. But when another job is submitted which shares some of the cpus with my program, performance degradation occurs. (...)The machine is a SGI altix server with intel itanium2, running suse linux.
(...)
I tried using OMP_DYNAMIC=TRUE but it doesn't seem to help. And in my test it seems OMP_DYNAMIC is able to just reduce the thread numbers by one.

Hello!
Intel Compiler 11.1 Update 7 includes OpenMP library with new implementation (for Linux* and Windows*) of OMP_DYNAMIC=TRUE functionality that should help avoid oversubscription in the system and improve performance.
We would be really glad if you had a possibility to check if it works for you and give us feedback. Feel free to ask any questions.

Thanks and regards,
Olga

jimdempseyatthecove · ‎10-19-2010

>>When my code runs together with another parallel program both utilizing all the cpus on a 64-cpu server, I find severe performance degradation. It can become 20 times slower than running it alone, even if the other program just uses only one cpu for most of the time.

20x slower is the approximate ratio between L2 cache and RAM. Or it could be coincidence of timing with a shared resource interference between applications. Are the applications heavy in the I/O department? i.e. experiencing additional seek times.

If your program is compute bound, then what this sounds like is:

The code in question (both programs) completely evict L1, L2, and L3 cache systems of the other programs data in a short time interval (IOW before the data can get reused in the L2 cache).

This can be aggravated by false sharing as indicated by other poster .OR. by coding characteristic in program (large-ish memcpy's together with SwitchToThread()during synchronization loops). If you are using SwitchToThread(), try replacing it with _mm_pause().

If you have not yet resolved the problem, then consider using Tim's suggestion of KMP_AFFINITY=proclist... to isolate applications by socket. IOW one app uses one socket, the other app uses the other socket, ... Depending on total cache requirements of the application, this may mean each app runs at 2x the time but not 20x the time.

An alternate route is to not use KMP_AFFINITY but instead add code to each program to detetermine the number of such programs running on the system. This determination is made periodically in the serial loop that encapsulates the outer most parallel region(s). Then it sets omp_set_num_threads(std::max(1,maxThreads / numRunningPrograms))

Then hope the O/S optimally places the threads of different apps amongst the various sockets.

Jim Dempsey