hybrid MPI+OpenMP

Miah__Wadud · ‎07-21-2015

I am running a hybrid MPI/OpenMP code which is: call MPI_Init( ierr ) call MPI_Comm_rank( MPI_COMM_WORLD, rank, ierr ) call MPI_Comm_size( MPI_COMM_WORLD, size, ierr ) t1 = MPI_Wtime( ) !$omp parallel do private(i, x) reduction(+ : pi_partial) do i = rank, N-1, size x = (dble( i ) + 0.5_DP) * dx pi_partial = pi_partial + f( x ) end do pi_partial = pi_partial * dx call MPI_Reduce( pi_partial, pi_estimate, 1, MPI_DOUBLE, MPI_SUM, 0, & MPI_COMM_WORLD, ierr ) pi_diff = dabs( pi_exact - pi_estimate ) t2 = MPI_Wtime( ) dt = t2 - t1 if ( rank == 0 ) then print *, 'size', size * num_omp_threads print *, 'pi_exact', pi_exact print *, 'pi_estimate', pi_estimate print *, 'pi_diff', pi_diff print *, 'time (seconds)', dt end if call MPI_Finalize( ierr ) When I run it with 16 MPI ranks and 1 OpenMP thread, it runs in 14.8229651451 seconds. When I run it with 1 MPI rank and 16 OpenMP threads, it runs in 22.7322349548 seconds. The node has 32 cores with hyper-threading on, but I only use logical cores 0 to 15 which are physical cores: [cn009 hybrid-fortran]$ lscpu NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Any help will be greatly appreciated. Thanks in advance.

TimP · ‎07-21-2015

I don't know what help you are hoping for. Efficiency would require a simd reduction in each thread. Not much has been published about appropriate style. It might be made explicit with dot_product inside the parallel reduction.

Several current mpi implementations include convenient methods for dividing cores efficiently among ranks. As you appear to have an obscure CPU whose nature you're unwilling to divulge fully , you would need to check the results of the topology discovery of your mpi.

While the new charter of this forum has not been stated, it seems reasonable to expect discussion of some combination of Intel software tools or platforms.

Miah__Wadud · ‎07-22-2015

Hi Tim, Thanks for the reply. The CPU is an Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz: [cn6008 hybrid-fortran]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Stepping: 4 CPU MHz: 2601.000 BogoMIPS: 5186.76 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 With openmpi/1.6.5 and gcc/4.9.1, the OpenMP version is slightly quicker as expected: $ OMP_NUM_THREADS=1 mpirun -n 16 ./pi_mpi time (seconds) 78.108803987503052 $ OMP_NUM_THREADS=16 mpirun -n 1 ./pi_mpi time (seconds) 74.039918899536133 But with intelmpi/5.0.3 and intel/15.0.2, the OpenMP version is much longer: $ OMP_NUM_THREADS=1 mpirun -n 16 ./pi_mpi time (seconds) 15.3900067806244 $ OMP_NUM_THREADS=16 mpirun -n 1 ./pi_mpi time (seconds) 23.6259081363678 This leads me to think that there is a bug in the Intel OpenMP run time library.

TimP · ‎07-22-2015

To the contrary. If you have set gfortran options to make it so slow, it's not surprising that openmp would be slightly faster An interesting comparison would be 2 ranks of 8 threads on each CPU, with simd optimization.

Miah__Wadud · ‎07-22-2015

For embarrassingly parallel code, I think OpenMP should be at least comparable to MPI, but what I am seeing here is that OpenMP is much slower on such a small scale (only 16 cores). I do not want to use simd optimisation because I want to compare the performance with MPI. I am convinced that this is a bug in the OpenMP run time library because the performance difference is so far off.

TimP · ‎07-22-2015

I was thinking of trying to reproduce and optimize this case, but it looks like your point of view is such that such efforts may be meaningless. But I'll repeat that if you allow an mpi case more latitude to optimize than the openmp one, it can hardly be considered a bug if mpi comes out ahead.

My only available dual CPU platform is a westmere, which might support your point of view better than more recent cpus (due to its unique asymmetry in competition among cores when accessing cache).

McCalpinJohn · ‎07-23-2015

It is quite a leap to assume that the difference in performance is due to a "bug" in the OpenMP implementation, and it is more than a bit forced to claim a "bug" when the Intel-compiled OpenMP version is 3.1 times as fast as the gcc-compiled version.

Since you are running on a Linux system, you should at least run the code under the "perf stat" command to see if the jobs are using all the cores. Without explicit process binding (as provided by the KMP_AFFINITY environment variable) Linux schedulers will seldom use all of the cores for OpenMP jobs. It is typical to see 12-13 cores used on a 16-core system without affinity. This is a fault in the Linux scheduler, not in the OpenMP runtime.

Using "mpirun" to launch jobs compiled with different MPI stacks adds another level of uncertainty to the understanding of the results. The logic used to set up the runtime environment is completely different for OpenMPI and Intel MPI, and you need to make sure that this is not responsible for the performance differences.

It is also a very good idea to start with a serial execution (bound to a single core) and verify that the execution time is close to the expected value for the processor under consideration. Assuming that the "f(x)" function is not loading values from memory and the "size" variable is large enough for the body of the loop to take longer than the synchronization operations, this code should be completely compute-bound and the execution time should decrease linearly with the number of cores used (for any combination of OpenMP threads and/or MPI tasks).

Miah__Wadud · ‎07-23-2015

When I print the thread pinning strategy, the MPI ranks and OpenMP threads are using CPU cores 0-15, so no process/thread is oversubscribing the CPU cores. The point I am trying to make is that OpenMP should at least match MPI for embarrassingly parallel code such as the one I am using, if not be even better. This is because thread synchronisation has a lower overhead than process synchronisation. What would be good to see is if you can re-create this situation because the performance of 2, 4, 8 threads/processes is nearly identical but diverges significantly for 16 threads/processes. Please see the attached graph for the discontinuity in the performance from 8 to 16 CPU cores. If you think the function f(x) is causing the performance degradation, then you can inline it. I doubt this is the case as f(x) is so small and is hardly doing anything!

TimP · ‎07-23-2015

If you'd like more informed comments, give us a working source code and your compile and run time options. You leave too much guesswork as to what you might have done.

Most posted versions at all similar to yours use f(x) as an old-fashioned statement function, which would be in-lined by optimizing compilers.

Once again, I don't see the point of trying to avoid simd optimization, particularly if you wish your OpenMP to be competitive. Did you use -O0 for gfortran only? Yes, that's a fairly well known cheat to get maximum threaded performance scaling by setting a low base performance.

There are pitfalls in OpenMP on a NUMA platform if you don't use affinity, some of which John mentioned. One would think that OpenMP tree reduction should be encouraged also to take advantage of affinity.

Miah__Wadud · ‎07-24-2015

Hi, Please find attached source code. The build instructions are: 1. Put both files in the same directory; 2. ifort -c types_mod.F90; 3. mpiifort -c -I. -openmp pi_mpi.F90 -o pi_mpi.o 4. mpiifort -openmp pi_mpi.o -o pi_mpi Let me know if you need further information. Regards,

TimP · ‎07-24-2015

I'm seeing OpenMP performance 3 times MPI performance, without building for the current instruction set. ifort reports a potential speedup due to simd (even the way you report building it) of 1.7, which seems to be realized in practice.

I suppose the MPI is more efficient on your platform than mine. When running 1 rank, there seems little difference in performance between 1 thread per core (with affinity) and 2 per core, but I'm questioning whether MPI is dealing correctly with hyperthreads.

An issue is possible inefficiency of the code used to convert integer(int64) to real(real64). It's not a fast single instruction as it would be for int32 . The divide instruction is slow also and doesn't seem to speed up with AVX2. These seem to be responsible for the low quoted simd speedup.

ifort is rejecting sometimes your redundant inclusion of i in the private list.

I don't have the OpenMPI library for gfortran installed. I suppose gfortran may require some tweaking of options etc. to get the simd speedup, but I'm guessing you may have left it at -O0, It's an old question to what extent compiler default differences between ifort and gfortran might be considered bugs. Perhaps you would like ifort better at -O0 (as implied by -g)?

McCalpinJohn · ‎07-24-2015

Using the instructions above and compiling with the Intel 15.0.2 compiler and the Intel 5.0.2 MPI libraries, I get a runtime of 22.4x seconds for both the 1-thread/16-task and 16-thread/1-task cases. (Xeon E5-2680 "Sandy Bridge EP", running at the max all-core Turbo frequency of 3.1 GHz.)

Using "perf stat" shows that the number of instructions varies by less than 0.3% between the two cases.

I had to modify the launching of the jobs slightly as my environment does not allow me to directly execute the "mpirun" command. I set KMP_AFFINITY to "verbose,compact" and ran the jobs with the local job launcher:

16 Tasks, 1 thread each: export OMP_NUM_THREADS=1; perf stat ibrun ./pi_mpi
1 Task, 16 threads: export OMP_NUM_THREADS=16; perf stat ibrun -n 1 -o 0 ./pi_mpi

Running with 1 task and 8 threads bound to one socket resulted in exactly twice the runtime, as expected.