[Short version since the

aketh_t_ · ‎05-01-2015

Hi all,

I am running an application known as CESM.

I tried various profiler Intel(itac and also Vtune) and non intel(TAU and others).

However I have not found any suitable profiler which can suggest loops that make for good candidates for offload or vectorization.

the --profile-loops options do not run on parallel application and CESM takes eternity to complete if I try to run it as MPI-serial.

Any suggestions?

thanks in advance.

TimP · ‎05-01-2015

The new vectorization advisor in parallel studio beta test is meant to augment the abilities to suggest vectorization locales and tactics. This analysis would be done on a single process. TAU or itac might help you identify which processes contain the bottlenecks; if the application is well balanced between processes there, you can run VTune and, I suppose, vectorization advisor, on MPI with 1 rank, with a suitably sized test case, rather than going to the trouble of collecting data per process.

It's often advised to find out which program regions respond to vectorization and OpenMP parallel as part of identifying suitability for MIC. MIC MPI parallel performance scaling often falls off beyond 6 MPI processes or so, so it's important to engage OpenMP parallelization as well as vectorization in order to scale to at least 118 or so (preferably 180) total threads. You must also identify a qualifying region of the program where the ratio of processing which can be offloaded is high enough relative to the amount of data to be communicated between host and MIC.

As you seem to be running MPI already, "symmetric" MPI, where each coprocessor acts as an MPI node, is likely to be strong candidate relative to OpenMP offload, keeping the OpenMP local to each process. If your application is suitable for MIC, adding OpenMP hybrid threading may pay off when running on host CPUs of 12 or more cores (6 cores in the case of the old Westmere). Having MPI/OpenMP hybrid working also gives better options for balancing work between nodes on heterogeneous cluster, as (on the current MIC), each MIC core offers only a fraction of the performance of a host core, even under effective vectorization.

Charles_C_Intel1 · ‎05-05-2015

[Short version since the forum just lost 0.75 hours of advice when I pressed submit - grrrr! *ALWAYS* copy to the clipboard before pressing after a long post]

VTune, sampling a single rank or an entire node, and then viewing the result in "Loops only" mode can give you an idea of which loops to thread, vectorize, or offload (the top-down-tree view is good for figuring this out once you select a hot loop in Bottom-up). See the VTune documentation on how to collect hotspots or advanced hotspots under MPI.

Advisor gives similar information, plus matching compiler diagnostics with hot loops to help you figure out whether it was vectorized. It can also provide trip counts to give you an idea of whether or not it is profitable to consider vectorization or threading.

Don't bother to try to offload *well-optimized* Xeon loops that run for less than 2-10 seconds, and making threading or offload improve the runtime of highly tuned MPI code can be challenging.

Charles

Profiler for loops that make good candidates for offload.