- Parallel Computing
I don't know what help you are hoping for. Efficiency would require a simd reduction in each thread. Not much has been published about appropriate style. It might be made explicit with dot_product inside the parallel reduction.
Several current mpi implementations include convenient methods for dividing cores efficiently among ranks. As you appear to have an obscure CPU whose nature you're unwilling to divulge fully , you would need to check the results of the topology discovery of your mpi.
While the new charter of this forum has not been stated, it seems reasonable to expect discussion of some combination of Intel software tools or platforms.
I was thinking of trying to reproduce and optimize this case, but it looks like your point of view is such that such efforts may be meaningless. But I'll repeat that if you allow an mpi case more latitude to optimize than the openmp one, it can hardly be considered a bug if mpi comes out ahead.
My only available dual CPU platform is a westmere, which might support your point of view better than more recent cpus (due to its unique asymmetry in competition among cores when accessing cache).
It is quite a leap to assume that the difference in performance is due to a "bug" in the OpenMP implementation, and it is more than a bit forced to claim a "bug" when the Intel-compiled OpenMP version is 3.1 times as fast as the gcc-compiled version.
Since you are running on a Linux system, you should at least run the code under the "perf stat" command to see if the jobs are using all the cores. Without explicit process binding (as provided by the KMP_AFFINITY environment variable) Linux schedulers will seldom use all of the cores for OpenMP jobs. It is typical to see 12-13 cores used on a 16-core system without affinity. This is a fault in the Linux scheduler, not in the OpenMP runtime.
Using "mpirun" to launch jobs compiled with different MPI stacks adds another level of uncertainty to the understanding of the results. The logic used to set up the runtime environment is completely different for OpenMPI and Intel MPI, and you need to make sure that this is not responsible for the performance differences.
It is also a very good idea to start with a serial execution (bound to a single core) and verify that the execution time is close to the expected value for the processor under consideration. Assuming that the "f(x)" function is not loading values from memory and the "size" variable is large enough for the body of the loop to take longer than the synchronization operations, this code should be completely compute-bound and the execution time should decrease linearly with the number of cores used (for any combination of OpenMP threads and/or MPI tasks).
If you'd like more informed comments, give us a working source code and your compile and run time options. You leave too much guesswork as to what you might have done.
Most posted versions at all similar to yours use f(x) as an old-fashioned statement function, which would be in-lined by optimizing compilers.
Once again, I don't see the point of trying to avoid simd optimization, particularly if you wish your OpenMP to be competitive. Did you use -O0 for gfortran only? Yes, that's a fairly well known cheat to get maximum threaded performance scaling by setting a low base performance.
There are pitfalls in OpenMP on a NUMA platform if you don't use affinity, some of which John mentioned. One would think that OpenMP tree reduction should be encouraged also to take advantage of affinity.
I'm seeing OpenMP performance 3 times MPI performance, without building for the current instruction set. ifort reports a potential speedup due to simd (even the way you report building it) of 1.7, which seems to be realized in practice.
I suppose the MPI is more efficient on your platform than mine. When running 1 rank, there seems little difference in performance between 1 thread per core (with affinity) and 2 per core, but I'm questioning whether MPI is dealing correctly with hyperthreads.
An issue is possible inefficiency of the code used to convert integer(int64) to real(real64). It's not a fast single instruction as it would be for int32 . The divide instruction is slow also and doesn't seem to speed up with AVX2. These seem to be responsible for the low quoted simd speedup.
ifort is rejecting sometimes your redundant inclusion of i in the private list.
I don't have the OpenMPI library for gfortran installed. I suppose gfortran may require some tweaking of options etc. to get the simd speedup, but I'm guessing you may have left it at -O0, It's an old question to what extent compiler default differences between ifort and gfortran might be considered bugs. Perhaps you would like ifort better at -O0 (as implied by -g)?
Using the instructions above and compiling with the Intel 15.0.2 compiler and the Intel 5.0.2 MPI libraries, I get a runtime of 22.4x seconds for both the 1-thread/16-task and 16-thread/1-task cases. (Xeon E5-2680 "Sandy Bridge EP", running at the max all-core Turbo frequency of 3.1 GHz.)
Using "perf stat" shows that the number of instructions varies by less than 0.3% between the two cases.
I had to modify the launching of the jobs slightly as my environment does not allow me to directly execute the "mpirun" command. I set KMP_AFFINITY to "verbose,compact" and ran the jobs with the local job launcher:
- 16 Tasks, 1 thread each: export OMP_NUM_THREADS=1; perf stat ibrun ./pi_mpi
- 1 Task, 16 threads: export OMP_NUM_THREADS=16; perf stat ibrun -n 1 -o 0 ./pi_mpi
Running with 1 task and 8 threads bound to one socket resulted in exactly twice the runtime, as expected.