Fortran OpenMP on Intel Xeon Phi

Antonio_R_1 · ‎07-09-2015

My questions are very simple. We have intel visual fortran 2015 and fortran subroutines parallelized with OpenMP directives. Is the compiled code be capable of using all the available threads on the Intel Xeon Phi? Would it be required to modify the code to make it compliant with these new processors?

Our intention is to use already-parallelized code on Intel Xeon Phi or a similar MIC processor. Any suggestions or links on how to do this?

Thanks,

Frances_R_Intel · ‎07-09-2015

The answer is - it depends. If you code is already using OpenMP, the question is how many threads can your code use productively? Are there any barriers or synchronization points that might not cause problems if you have only a few threads but that play havoc with your code's performance if you are running a couple hundred threads? The next question is does your code vectorize? Long vector registers represent a significant part of the performance of the Intel Xeon Phi coprocessor. Finally, does your code compile under Linux? The coprocessor runs only Linux. Using the offload model of programming, you can run part of your program on the host under either Windows or Linux but the part of your program that runs on the coprocessor will be running under Linux.

If your code uses a standard form of parallelization such as OpenMP, and your code scales well as you increase the number of threads and your code vectorizes, you are in pretty good shape. If you are looking for the absolute best performance, you will want to tweak the code, particularly to improve memory and cache performance, but except for tuning, as long as your code compiles under Linux and you did not use any Intel specific intrinsics (I am not talking about the Fortran standard intrinsics - those are ok), the odds are good that your code will compile for and run on the coprocessor without significant code changes.

For more information, see: https://software.intel.com/en-us/articles/is-intelr-xeon-phitm-coprocessor-right-for-you

TimP · ‎07-09-2015

Running an existing openmp application in Mic native mode normally is relatively easy. It's entirely normal to run identical hybrid mpi openmp source code on host and coprocessor under mpi thread funneled , adjusting numbers of ranks and threads to balance and optimize performance.

I don't know what you have in mind when you say all available threads. It's not unusual to see peak performance at 2 or 3 threads per core on 1 or 2 less than total number of cores and that should not disappoint you. 2 threads per core are sufficient for 90% of peak vpu performance. Depending on your application, available ram, cache, and stack may limit you from exceeding optimum num_threads.

As Frances indicated, vectorization is even more important on coprocessor to get full performance.

Antonio_R_1 · ‎07-09-2015

Frances and Tim, thank you for your helpful comments.

I do not think that barriers or synchronization points in my code would damage the performance significantly as the number of threads increases, but I cannot really know at this point. The code is functional on Windows and, using four cores (and 4 threads), the entire simulation speeds up x2.7 approx. I am satisfied with that taking into account that only a small fraction of the code has been parallelized. My idea is to continue to parallelize other subroutines and find out how much I can improve the performance by using more threads. Some key subroutines are prone to being parallelized, so I think there is room for improvement. I will need to look into vector vectorization and how to run it on Linux for the Intel Xeon Phi option.

Maybe, for my purposes, it would be a wiser step to use another HPC processor/co-processor that does not require to run on Linux since I am not familiar with it. My knowledge on this topic is quite limited, so any suggestion as to which processor to use would be very much appreciated. For example, would an 8-core (16 threads) processor be a good intermediate step to see how the code performs? Any other alternative?

Antonio

TimP · ‎07-09-2015

The pros and cons of using Intel(r) Xeon Phi(tm) with Windows vs. linux host aren't big stumbling blocks, in my opinion. With Windows, there is no MPI support for coprocessor, but the programming for OpenMP on coprocessor should look just like a Windows target. The emphasis on offload model does complicate things; if you can use native MIC execution, that allows simpler basic OpenMP. Anyway, one of the advantages of OpenMP is the source code needn't change between Windows and linux, you are insulated from changes in the underlying threading model (although both that and target hardware characteristics may affect the performance issues you mention).

As you increase number of threads, you may expect to need more effective parallelization of more functions in your application. Making it work well on Phi should solve most of the problems of running on smaller numbers of cores.

Many applications don't benefit from hyperthreading on host and you can save wasted effort by sticking to 1 thread per core. You can't count on hyperthreading as a way to get valid experience with more threads, even to the extent you would by using more than optimum number of threads on coprocessor.

James_C_Intel2 · ‎07-10-2015

The code is functional on Windows and, using four cores (and 4 threads), the entire simulation speeds up x2.7 approx.

Let's do a bit of Amdahl analysis.
Taking B as the serial proportion, the Speedup at n threads S(n) = n/(B(n-1)+1)
You say that S(4) == 2.7 which leads us to B ~= 0.16 therefore the most most speedup you can ever get with your current code (on an infinite number of threads) is 1/0.16 ~= 6.

So if you're expecting to get 60x or 240x by moving to the Phi you will be disappointed. Bearing in mind that the processors on the Phi are slower than the host processor until you vectorize, my guess is that you'll see performance on the Phi that may be slower even than your single thread on the host until you vectorize and reduce the amount of serial code.

Recompiling code to use the Phi should be simple, but your code needs to be highly parallel to run well on the Phi.

jimdempseyatthecove · ‎07-10-2015

Antonio,

If you have a Windows host system with installed Xeon Phi, then you may be able to use the offload model with minor code changed *** and possibly no changes to your User Interface code.

As Cownie points out, your application will require a higher degree of parallelization and vectorization. Please keep in mind that parallization efficiency can improve by relocating the parallel regions outwards (up nest levels). And some applications can utilize parallel pipelining techniques. You might want to look at that as well.

Before you go through any conversion effort, look at what you need and compare this to what you have with 4 cores.

Jim Dempsey