To vectorize or To MPI, for distributed memory parallelism?

Vishnu · ‎05-06-2017

Do I use SIMD or should I use MPI, when I need no "message passing"? Or are they both equivalent?

I have lots of functions and subroutines in modules, which I will have to turn ELEMENTAL (some of them IMPURE), and also bring in lots of structural changes, if I am to use SIMD. Of course, if MPI, I won't really have to change anything.

The IMPURities in procedures mainly come in due to calls to RANDOM_NUMBER. But I could pre-generate arrays of random numbers instead.

So will using SIMD be worth the effort?

jimdempseyatthecove · ‎05-06-2017

SIMD means Single Instruction Multiple Data. The single instruction is executed on one hardware thread of an application. Whereas MPI refers to Message Passing Interface. This is a communication means for use with distributed applications. These are applications that are broken into pieces, each piece a separate process with the separate processes run on one or more systems.

You are fooling yourself if you think you can "won't really have to change anything" in converting a single process application into a multi-process (MPI) application.

Vectorization (use of SIMD instructions), is generally performed transparently by the compiler... provided you write your code and arrange your data favorable to vectorization. Programs that are written in a manner favorable to SIMD vectorization can at times experience a 4x to 10x performance improvement over those that end up using scalar instruction. (Single Instruction Single Data).

You haven't mentioned intra-process parallelization (OpenMP, pthreads, ...).

Best practice usually is

1) Write code that the compiler can make use of the SIMD capability of your modern CPU's. Current gen have 4/8 lanes (double/float), next gen will have 8/16 lanes.
2) Use intra-process parallelization, such as OpenMP.
3) Then if needed, use inter-process parallelization, such as MPI

Jim Dempsey

Vishnu · ‎05-06-2017

@Jim, thanks for the clear advice on parallelization hierarchy; I will follow that from now on.

But I think you missunderstood the context of my question. The situation is one where SPMD (Single Program Multiple Data) is possible. In such a case, do I convert all my variables into arrays of one dimension higher, and all my procedures into ELEMENTAL ones, or do I perform very primitive SPMD MPI.

Since SIMD is the first priority, I assume that it is the best choice. Is that true?

jimdempseyatthecove · ‎05-06-2017

The Fortran compiler is fully capable of vectorizing (SIMD-izing) multi-dimension arrays.

It is more effective for vectorization of data layout that is Structure of Arrays (SOA) than Array of Structures. Structure of Arrays is typically not taught in low numbered CS classes (unless the class is targeted for high performance computing). Array of Structures is typically taught (and aligns with the abstraction of the model). For abstraction purposes, one thinks in terms of array of objects (particles) as opposed to arrays of same properties of different objects (particles). Although it is a little more programming work to code in SOA format, the payback comes in reduced runtimes (you code once, your program may run millions of iterations).

Jim Dempsey

jimdempseyatthecove · ‎05-06-2017

Describe (sketch) you application and how the data are organized. If you can run VTune or other performance analysis program, identify the hot spots and show the (part of) code and possibly how the function/subroutine is called. This will aid the readers of your posts to offer better advice.

Jim Dempsey

Vishnu · ‎05-07-2017

My code is a Markov Chain Monte Carlo simulation that uses the Metropolis-Hastings or Glauber algorithms. I have a system, which I attempt to modify at each Monte Carlo step.

The 'Markov Chain' part means that every step depends on the previous one, so I can't trivially parallelize (although locality of effects affords the possibility). The specific algorithm just specifies the rule set that determines whether or not each proposed change is accepted.

Top Hotspots
Function	Module	CPU Time
for_random_number_single	a.out	0.010s
for__acquire_semaphore_threaded	a.out	0.010s
montecarlo_mp_metropolis_	a.out	0.010s

I ran a VTune, as you suggested, and found, that the biggest hotspot is due to a random number call. If I can SIMDize this somehow, I think it should improve things. i.e., get the compiler to use for_simd_random_number_single instead of for_random_number_single.

My first attempt at that was to turn my MonteCarlo function into an ELEMENTAL IMPURE, so that it accepted arrays of inputs. But upon checking the resulting .o with nm, I saw that it still uses the non-simd version of the random number call. So that means that I will have to tun it into a subroutine. Does that sound right?

jimdempseyatthecove · ‎05-08-2017

There are a number of random number generators to choose from. Some are parallel capable, some are not. You may require reproducibility or you may not. If the computation section is somewhat more complex than the random number generation, you may be able to use a good serial random number generator to generate batches of numbers where the current completed batch can be used in the simulation while you generate the next batch (IOW double-buffer-like).

As for SIMD generator or non-SIMD generator this depends on your requirements of the numbers generated. It is relatively easy to use generator 'X' with different lanes of the SIMD vector starting with a different seed. However, this may or may not fit in your requirements. You may or may not require deterministic results. (same results regardless of number of threads employed)

You need to specify your requirements before you choose a random number generator.

Jim Dempsey

Martyn_C_Intel · ‎05-08-2017

Note that if you simply call the Fortran standard random number generator RANDOM_NUMBER() with an array argument, the random number generation will be vectorized at -O2 and above.

The Intel Math Kernel Library contains a variety of high quality random number generators including ones that are suitable for a threaded environment.