My code is still similar to

wbinventor · ‎07-02-2013

Hello,

I'm having trouble getting my good performance with my code running on the MIC. My code is C/C++ and is compiled into a series of shared libraries with Python bindings accessible via SWIG. At this point, I have ported all of my solver routines to the MIC using the explicit memory model (#pragma offload...). Prior to my outer solver iteration loop, I allocate all of my memory arrays on the MIC such that they persist and do not have to be reallocated in between kernel calls. I'm using OpenMP with enough concurrency for hundreds of threads, but am not seeing a performance gain for some reason. In fact, even with 120 or 240 threads, my solver is slower on the MIC than on a single Xeon core. I've run using the OFFLOAD_REPORT=2 environment variable which confirms that my kernels are indeed running on the MIC and are not copying an unexpected amount of data to/from the MIC between kernel invocations (at most 4 KB per call). Since I achieve practically perfect scaling (both strong and weak) using OpenMP on standard Xeons, I'm confused as to what is going wrong in my code.

I am able to get some auto-vectorization and have gone through most of my kernels and added directives (ie, #pragma simd...) to vectorize most regions of my code (at least according to Intel's vec-report). That said, I still have yet to align my data structures which I need to take a look at tomorrow. Most regions of my code can only take advantage of an 8-wide vector unit with normal input data, so I will lose some performance with the 16-wide vector unit on the MIC as a result. Even still, I'm surprised that 60 MIC threads loses to 1 Xeon thread...is the latency to launch a kernel on the MIC significantly more than for a typical NVIDIA GPU?

Frances_R_Intel · ‎07-02-2013

How much time is your code spending in the Python sections?

Any time your code performs better on the Intel(r) Xeon(r) processor than on the Intel(r) Xeon Phi(tm) coprocessor, the first thing I would look at is parallelization and vectorization. From what you say, I take it that the Python sections are in the regions you have parallelized with OpenMP and that your C/C++ code is vectorizing, but I don't know of a version of Python that vectorizes on the coprocessor. (I would need to look around some more to say that with authority, but I believe that is the case.) Also your statement that the vectorized portions can only take advantage of 8 wide vs 16 wide vector registers worries me.

wbinventor · ‎07-07-2013

Practically no time is spent in the Python sections. I've just wrapped all of my C++ code such that a user can build a simulation (geometry, materials data, etc) using Python, and to allow for seamless pre- and post-simulation data processing using Python's utilities. The solver itself is written in C++ so only one call is made from the host in Python which calls a C++ routine on the device which executes a series of function calls in C++ that consumes the vast majority of the runtime.

I've parallelized the solver such that it can easily use hundreds of OpenMP threads (or CUDA threads, though it is performing MUCH better on GPUs than the MIC right now). I have not used Intel's intrinsics to vectorize my code by hand, but the Intel compiler's "vec-report" flag indicates that nearly all of my computationally intensive code is SIMD vectorized. Perhaps the SIMD vector report is falsely claiming vectorization?

It is curious to me that one thread on the MIC is about 3x slower than on a standard Xeon, which is about what I would expect based on the clock frequencies. When to 2 and then 3 threads on the MIC the runtime is reduced by one half. Anything beyond 3 threads - even up to 240 threads - doesn't result in any speedup whatsoever. Since my solver contains many (8) parallel regions, this makes me wonder if it takes an inordinate amount of time to spawn a pool of OpenMP threads on the MIC.

I have been using the explicit memory model for all of my code and have written it such that the entire solver is offloaded on the MIC, with about 8 parallel regions and several regions in between that require synchronization. Would it be better to spawn a large pool of threads upfront and for the extent of the solver, and simply use OpenMP's synchronization mechanisms? Or is my current methodology of compartmentalizing my code into multiple OpenMP parallel regions with explicit synchronization in between just as good on the MIC?

James_C_Intel2 · ‎07-08-2013

wbinventor wrote:

I've parallelized the solver such that it can easily use hundreds of OpenMP threads (or CUDA threads, though it is performing MUCH better on GPUs than the MIC right now). I have not used Intel's intrinsics to vectorize my code by hand, but the Intel compiler's "vec-report" flag indicates that nearly all of my computationally intensive code is SIMD vectorized. Perhaps the SIMD vector report is falsely claiming vectorization?

The report will not lie, however you need to check in detail what it says about remainder loops and so on, If the compilerdoes not know the alignment of the vectors, and they are short, it's possible that the whole loop is executed by the lead-in and lead-out code, and the main, well vectorised loop never really runs.

wbinventor wrote:

It is curious to me that one thread on the MIC is about 3x slower than on a standard Xeon, which is about what I would expect based on the clock frequencies.

If you are only seeing 3x slower you must have at least some vectorisation; on scalar integer code you may see 10x slowdown... (Remember that as well as the clock-speed you're comparing a small, in-order core with a large, out-of order one. )

wbinventor wrote:

When I use 2 and then 3 threads on the MIC the runtime is reduced by one half. Anything beyond 3 threads - even up to 240 threads - doesn't result in any speedup whatsoever. Since my solver contains many (8) parallel regions, this makes me wonder if it takes an inordinate amount of time to spawn a pool of OpenMP threads on the MIC.

I have been using the explicit memory model for all of my code and have written it such that the entire solver is offloaded on the MIC, with about 8 parallel regions and several regions in between that require synchronization. Would it be better to spawn a large pool of threads upfront and for the extent of the solver, and simply use OpenMP's synchronization mechanisms? Or is my current methodology of compartmentalizing my code into multiple OpenMP parallel regions with explicit synchronization in between just as good on the MIC?

I am still not completely clear wat you are doing. Are you using a single offload that then has OpenMP code in it (which in turn has eight omp parallel regions) ? Or are you doing eight oflloads each of which has a single parallel region?

I think you're saying you are doing the former, which seems sensible. From an OpenMP runtime point of view

The first OMP PARALLEL in the whole program is expensive (because it has to create all of the pthreads); you could potentially hide this by offloading an empty asynchronous OMP PARALLEL before doing anything else in main.
Subsequent OMP PARALLELs are much cheaper, since the threads already exist.
Trying to avoid OMP PARALLEL by replacing

So instead of writing

[cpp]

#pragma omp parallel for
for (…)
    …;
// First serial code
#pragma omp parallel for
for (…)
    …;
// Second serial code
[/cpp]

you write

[cpp]

#pragma omp parallel
#pragma omp for
for (…)
    …;
#pragma omp single
   // First serial code
#pragma omp for
for (…)
    …;
#pragma omp single
   // Second serial code
#pragma omp parallel for
[/cpp]

is not likely to improve performance, since you're trading a fork/join per parallel for two full barriers (one at the end of the loop and one at the end of the OMP SINGLE. Since a fork/join is two half-barriers (so about the same as a single full barrier), that's not normally a good tradeoff.

wbinventor · ‎07-08-2013

My code is similar to what you allude to in your first close fragment. I only have one offload and within that section of code are a series of device function calls. Each device function has its own "#pramga omp for ..."

As I understand it from your post, it would be better to spawn my thread pool once upfront using "#pramga omp parallel" and to use "#pragma omp single" for synchronization as shown in your second code fragment? Could this explain the drop in performance with up to 240 threads, or are there any other issues I should be aware of? I'll give this a try and see it leads to any performance boost.

James_C_Intel2 · ‎07-08-2013

You have misread what I said. I do not expect the second style to be faster, but slower. (A fork/join is two half-barriers, so ~= 1 full-barrier, whereas the second example has two full-barriers, so the overhead would be doubled in the second example).

Since you're seeing no scaling, I would

Check that you really are running with the number of threads you expect (print omp_get_num_threads() inside your parallel regions)
Check that you have enough parallelism (I know you said you do, but since you're not seeing the performance you expect it's worth checking; what are the loop counts on your parallel loops?)
Add some simple timing code to see where the time is really going, or use VTune

This is likely a typo, but, if your offload code only has #pragma omp for (as you just wrote), then it's not going parallel. You need #pragma omp parallel for, or a #pragma omp parallel to create parallelism.

Poor Performance on Intel MIC For Offloaded Code - Kernel Launch Latency?