Poor Multithreaded parallelization efficiency

Hanyou_Chu · ‎10-08-2009

I have a package (dated back before MS VC supported OpenMP) in which I do coarse grain multithreaded parallelization. The core computations rely on Lapack and Blas. When I use 64bit Fortran compiler to compile lapack and blas, the parallelization efficiency became very poor. I have a core 2 system with 8 cores. The efficiency seems to scale very nicely if I use 2 to 3 threads. If I use 8 threads, the computation is not much faster than using a single thread. However if I use MKL it scales very nicely. I also have a switch to use MPI with which everything works as expected.

This kind of problem has occured to me in the past. The first time I encoutered this kind of problem dated back a few years ago when I was experimenting with MKL 5.0 on a Pentium Xeon with 4 CPUs. Later on, I noticed that occasionally if I turn on certain optimizations (I vaguely remember it was related to SSE2 or something) with the Fortran compiler for Lapack and Blas, the same problem appears. To avoid this kind of problem I simply use the default optimizaiton flag. But with the 64bit compiler I can't make the problem go away. The odd thing is that I also have a few routines which are modified from Lapack and they don't seem to be affected.

Why would I not simply use MKL? I have had crashes with MKL which may be related to how I use the threads. I am not using thread pool and I am having trouble releasing the memory even when I call MKL_Release_Buffer() or something before the threads end.

My question is: what is preventing efficiencient parallelization in such a multithreaded environment?

TimP · ‎10-08-2009

I don't think Visual Studio supported 64-bit prior to VS2005, which has its own OpenMP. If it mattered which VS you mean, you aren't giving coherent information. With ifort or MKL, you aren't using the VS OpenMP, so this appears irrelevant.
It's certainly likely that you don't get as good multi-thread scaling of SSE vectorized code as with non-vector code. Since ifort 11.0, vectorization is on by default. One of the most effective routes to high threading performance scale factors is to minimize the single thread performance. Also, the better your performance, on a platform with multiple cache, the more important it is to set appropriate affinity (e.g. KMP_AFFINITY=compact for ifort on non-HT platform).
No doubt people have refrained from offering suggestions in view of the near total uncertainty of what you are looking for.

Hanyou_Chu · ‎10-09-2009

Quoting - tim18

I don't think Visual Studio supported 64-bit prior to VS2005, which has its own OpenMP. If it mattered which VS you mean, you aren't giving coherent information. With ifort or MKL, you aren't using the VS OpenMP, so this appears irrelevant.
It's certainly likely that you don't get as good multi-thread scaling of SSE vectorized code as with non-vector code. Since ifort 11.0, vectorization is on by default. One of the most effective routes to high threading performance scale factors is to minimize the single thread performance. Also, the better your performance, on a platform with multiple cache, the more important it is to set appropriate affinity (e.g. KMP_AFFINITY=compact for ifort on non-HT platform).
No doubt people have refrained from offering suggestions in view of the near total uncertainty of what you are looking for.

I started with 64 bit code using ifort 10. The problem persists with 11.xx.

I thought the problem had something to do with SSE. Here is what I don't understand. The problem dated back before we had multi cores and HT. Don't the current MKLs (10 and 11) also utilize SSE? Why does it work well with multi-process?

TimP · ‎10-09-2009

Quoting - Hanyou Chu

I started with 64 bit code using ifort 10. The problem persists with 11.xx.

I thought the problem had something to do with SSE. Here is what I don't understand. The problem dated back before we had multi cores and HT. Don't the current MKLs (10 and 11) also utilize SSE? Why does it work well with multi-process?

In case you're asking about HT:
MKL recognizes HT and sets 1 thread per core by default, as a single thread uses all SSE resources of a core efficiently. MKL also makes choices of number of threads according to the problem size. On a single socket Core i7, you would not require KMP_AFFINITY for MKL, regardless of HT enabled or disabled.
According to the latest 11.1 documentation, the method to set 1 thread per core with OpenMP (if you don't disable HT in the BIOS setup) is to set KMP_AFFINITY according to the BIOS numbering of your CPU, as well as setting OMP_NUM_THREADS to the number of cores.
If your platform uses logical processor numbers [0,1] for core 0, [2,3] for core 1, .... you would
set KMP_AFFINITY=scatter,1,verbose
or some other setting so as to use logical 0 or 1, 2 or 3, 4 or 5,.....
With the verbose option giving you a screen diagnosis of what was done.
If your platform BIOS numbering runs sequentially through a single logical per core, then repeats with the other logical per core,
set KMP_AFFINITY=scatter,0,verbose (or equivalent)
There was an option KMP_AFFINITY=physical,0,verbose which worked on certain BIOS numbering schemes, with ifort 11.1. If you tried it you would want to read the verbose output to see if it worked correctly on your platform.
On a single socket Core i7, with Windows 7 scheduler, you could hope for good performance with HT enabled or disabled, up to number of threads equal number of cores, without requiring affinity setting. Prior to Windows 7 scheduler, you would depend on MKL_AFFINITY when HT is enabled.

By now you can see that generalities about OpenMP scaling don't cover a wide variety of applications, Windows versions, or CPU families.

Hanyou_Chu · ‎10-09-2009

Quoting - tim18

In case you're asking about HT:
MKL recognizes HT and sets 1 thread per core by default, as a single thread uses all SSE resources of a core efficiently. MKL also makes choices of number of threads according to the problem size. On a single socket Core i7, you would not require KMP_AFFINITY for MKL, regardless of HT enabled or disabled.
According to the latest 11.1 documentation, the method to set 1 thread per core with OpenMP (if you don't disable HT in the BIOS setup) is to set KMP_AFFINITY according to the BIOS numbering of your CPU, as well as setting OMP_NUM_THREADS to the number of cores.
If your platform uses logical processor numbers [0,1] for core 0, [2,3] for core 1, .... you would
set KMP_AFFINITY=scatter,1,verbose
or some other setting so as to use logical 0 or 1, 2 or 3, 4 or 5,.....
With the verbose option giving you a screen diagnosis of what was done.
If your platform BIOS numbering runs sequentially through a single logical per core, then repeats with the other logical per core,
set KMP_AFFINITY=scatter,0,verbose (or equivalent)
There was an option KMP_AFFINITY=physical,0,verbose which worked on certain BIOS numbering schemes, with ifort 11.1. If you tried it you would want to read the verbose output to see if it worked correctly on your platform.
On a single socket Core i7, with Windows 7 scheduler, you could hope for good performance with HT enabled or disabled, up to number of threads equal number of cores, without requiring affinity setting. Prior to Windows 7 scheduler, you would depend on MKL_AFFINITY when HT is enabled.

By now you can see that generalities about OpenMP scaling don't cover a wide variety of applications, Windows versions, or CPU families.

Thanks a lot for your explanations. I must admit that I haven't done much homework on my system. I don't even know if my system has HT or not. It's an E5440. I never heard of KMP_AFFINITY before. I am not using OpenMP since I didn't feel like chaning it at the moment. I will have to do some homework to understand what you said and will let you know the results.

TimP · ‎10-09-2009

Quoting - Hanyou Chu

Thanks a lot for your explanations. I must admit that I haven't done much homework on my system. I don't even know if my system has HT or not. It's an E5440. I never heard of KMP_AFFINITY before. I am not using OpenMP since I didn't feel like chaning it at the moment. I will have to do some homework to understand what you said and will let you know the results.

No, E5440 doesn't have HT, but I assume it has 2 sockets, each with a split cache, so it could benefit from setting KMP_AFFINITY, most likely
set KMP_AFFINITY=compact[,0,verbose]
if you are using all cores and (default) OpenMP static schedule.
I got the impression in the earlier posts that you were evaluating threaded scaling under OpenMP. If that's not the case, answering your questions is all guesswork.

Hanyou_Chu · ‎10-09-2009

Quoting - tim18

No, E5440 doesn't have HT, but I assume it has 2 sockets, each with a split cache, so it could benefit from setting KMP_AFFINITY, most likely
set KMP_AFFINITY=compact[,0,verbose]
if you are using all cores and (default) OpenMP static schedule.
I got the impression in the earlier posts that you were evaluating threaded scaling under OpenMP. If that's not the case, answering your questions is all guesswork.

KMP_AFFINITY is related to OpenMP, so I am running out of luck. I use Win Api SetThreadAffinityMask and there is no difference. I could not find a compiler option to disable SSE so I am basically out of luck and stuck with MPI which parallelizes better even with MKL (which contradicts intuition).

Is there a way to disable SSE in future releases? I remember setting compiler option to utilize SSE for Lapack and Blas never gave me any performance boost. In addition we are encoutering repeatability issues associated with SSE optimizations for 32 bit code which is our main product. MKL is clearly out of the question. So we are also stuck with earlier versions of compilers such as ifort 10.

Steven_L_Intel1 · ‎10-09-2009

I would expect SSE to give you more repeatability, not less. You can use /arch:IA32 for 11.0 and later to disable SSE. SSE is not the default in earlier versions.

Hanyou_Chu · ‎10-09-2009

Quoting - Steve Lionel (Intel)

I would expect SSE to give you more repeatability, not less. You can use /arch:IA32 for 11.0 and later to disable SSE. SSE is not the default in earlier versions.

Thanks Steve. /arch:IA32 indeed solves the repeatability issues.

Here is the link I found that explains the repeatability issues related SSE
http://software.intel.com/en-us/forums/showthread.php?t=57337
What I observed confirms what I read from there because I never saw any repeatability issues with 64 bit code.

Steven_L_Intel1 · ‎10-09-2009

Ah, it's data alignment that is the issue here.

Hanyou_Chu · ‎10-09-2009

Quoting - Steve Lionel (Intel)

Ah, it's data alignment that is the issue here.

It has caused a lot of pain for me and many hours of wasted efforts. I thought there were some bugs in my program. I only found it out a couple of days ago that it is related to SSE. Is there a way to tell either the ifort or Intel C++ compiler to allocate data on a 16 byte boundary? On the C++ side, I should be able to handle it myself.

Steven_L_Intel1 · ‎10-10-2009

Yes, the compiler supports !DEC$ ATTRIBUTES ALIGN for variables, including allocatable arrays.