OpenMP calls with newer multi core/thread CPUs

Cable__Vaughn · ‎04-01-2015

My new workstation runs RH Release 6.6 with Linux kernel 2.6.32-504.el6.x86_64, with a Xeon CPU & 64 GB RAM, etc. Nothing else is different, however, now my Fortran routines with openMP calls to the BLAS lib (from Intel Libs) aren't maximizing core/thread usage (it looks like core/thread-swapping has gone "nutz") unless I set the number of threads to 1 in my execution script. Have you seen this behavior with upgrades to the multicore/multithread CPUs?

TimP · ‎04-01-2015

The MKL threaded library defaults to setting 1 thread per physical core. If you run inside an omp parallel region with OMP_NESTED, this would over-subscribe, as well as breaking the working of OMP_PROC_BIND or KMP_AFFINITY. So there are a lot of variables to consider. If you have MKL at single thread setting, called inside your omp parallel, your performance may still depend on setting affinity. The MKL articles on software.intel.com as well as the MKL forum may be useful references for you.

Cable__Vaughn · ‎04-01-2015

Tim, thank you for your response. The OMP calls I use are buried in the CBLAS lib that MLK uses. Exactly the same code & data that doesn't work now, has worked in the past; the only difference is that I upgraded my computer from a 6-core Xeon to an 8-core Xeon CPU. A possible difference in handling mult-threads might be the culprit, but I really don't know. In any case, OMP code in the CBLAS lib routine CGESV doesn't work any more on the newer multicore CPUs. OR, & that a big OR, I'm upping my meds...

John_D_6 · ‎04-16-2015

Hi Vaughn,

what do you mean with 'nutz'? Does it hang? We have seen hangs on some of our systems with Haswell CPUs. It happens only on nodes with a combination of Haswell CPUs, a newer kernel (I think pretty much from the version you mentioned) and running a multi-threaded code. The cases where we have seen this is with the linpack benchmark and a hybrid MPI-OpenMP code. The code seems to wait for some lock, which is never released. We do not yet have a fix or a more detailed explanation for this problem.

We noticed that the application would continue if you would 'nudge' it by using the command 'pstack [pid]'. Is that the case for you as well?

Steven_L_Intel1 · ‎04-16-2015

We've seen some reports of issues with Haswell systems that haven't had the firmware fix to disable the TSX instructions (which have known issues on Haswell.) Apparently some newer Linux distributions can use TSX instructions to do thread synchronization and things don't work right all the time. Check with your system vendor to see if you are up to date on firmware, or maybe run the Intel Processor Identification Utility to see if TSX instructions are enabled.

Martyn_C_Intel · ‎04-21-2015

You should also make sure to set KMP_AFFINITY, as Tim implied. That won't solve the sort of things Steve talked about, but it might help, particularly if your new system has hyperthreading enabled. KMP_AFFINITY=scatter may not be the optimal setting, but it's a simple starting point. Perhaps also try setting OMP_NUM_THREADS=8 (the number of physical cores).

Cable__Vaughn · ‎04-21-2015

Thank you, Martyn & all. I appreciate your help with this. I haven't had a chance to work on the problem consistently, but I'm trying your suggestions when I get the chance. The last sugestion was from Jim Dempsey where he wrote

In the Fortran code you list above add

if(OMP_IN_PARALLEL()) THEN
STOP "Break here, you shouldn't be in parallel region here without saying so to me"
endif

Instead of using export OMP_NUM_THREADS=nn, use export MKL_NUM_THREADS=nn

See what happens.

I tried that just now & the compiler choked & said "OMP_IN_PARALLEL()" was wrong logical type in the if context, but I never got around to trying some "logical" fixes like if(OMP_IN_PARALLEL.eq.1) or if(OMP_IN_PARALLEL().eq.TRUE). That's the extent of my FORTRAN knowledge, pretty much. Anyway, I haven't yet seen whether the MLK_NUM_THREADS=nn makes a diff. With your latest suggestion, Martyn, I have a few more things to try & I hope one of the tests reveals something that gets me back up & running big matrix solves.

BTW, The attachment is a shot of the Fedora system monitor showing CPU history immediately after the code enters the CBLAS matrix solve subroutine (with OMP calls). Note the chaotic (nutz) core/thread switching that takes place. This particular case was compiled & running on an i7-4930K CPU with Fedora 20 (64 bit). Something similar, but not exactly the same, happens when same code is compiled & run on a new Xeon with the latest Red Hat Enterprise (64 bit).

-- Vaughn

jimdempseyatthecove · ‎04-21-2015

OMP_IN_PARALLEL() is a LOGICAL function... Returns .TRUE. if called within the dynamic extent of a parallel region executing in parallel; otherwise returns .FALSE..

*** However, the routine in which it is used must also have USE OMP_LIB. The error message indicates you forgot USE OMP_LIB.

And it also implies you are not using IMPLICIT NONE (which would have caught this error).

Jim Dempsey