performance issue of operating on overlap area between two arrays

Guangming_T_ · ‎04-13-2014

Hi, My application program needs a lot of operators (copy/accumulate...) on two arrays' overlap. In most cases, the overlap area is not very large. A typical size is 100x100*2. I have made a simple benchmark for these functions in the attached file.

Unfortunately, I found the performance on MIC is much poor than Xeon CPU. For all three cases, one OpenMP thread is the best.

Offload: 0.026 seconds

Native: 0.018 seconds

CPU: 0.0009 seconds

Any suggestion on improving performance on MIC. Can we achieve better performance on MIC than that on CPU?

Thanks!

TimP · ‎04-14-2014

If you will be running only 1 thread, MIC has no chance of being competitive with host.

You probably need to be using on the order of 100 threads effectively, with inner loop counts of at least 2000, for MIC to be competitive.

Did you set this up so that all threads are racing against the same data, which would account for the lack of threaded performance scaling?

Guangming_T_ · ‎04-14-2014

Tim, I have tried more threads on MIC, but the performance is much worse.

Unfortunately, in my application, the inner loop would not be so large as 2000.

jimdempseyatthecove · ‎04-14-2014

At issue is you are mistakenly including the time to construct the MIC's OpenMP thread pool in your timed section. To correct for this, in main(), in the offload section, make the for loop an #pragma omp parallel for loop.

You could alternatively add an innocuous #pragma omp parallel section that does nothing but instantiate the thread pool

#pragma omp parallel
omp_get_wtime(); // anything here that will not get optimized out

Jim Dempsey

Guangming_T_ · ‎04-14-2014

Jim, your suggestion does work. Now the time decreases to 0.009 -- 0.019 seconds. Thanks!

But it is still much slower than CPU.

jimdempseyatthecove · ‎04-14-2014

To take a quote from a Dirty Harry movie "A man's got to know his limitations."

I haven't tried this... (it may not work)

I would suggest experimenting with using an initial asynchronous offload to start a MIC parallel region. The parallel region in the MIC runs a continuous loop waiting for a message (pointer). In the host, after the return from launching the listener parallel region, you call a modified version of ArrayOP, that starts a (now nested/concurrent) parallel region, that packages the arguments into a message, then writes the pointer to the message into an atomic/volatile global variable (replace this with a queue later). The listener parallel region, upon seeing the pointer non-NULL, dispatches to the function, when complete, this nulls the shared pointer, thus signaling the shell function to return. Replace these later with appropriate inter-thread communication objects.

What this will do is permit you to keep the threads running (wasting power at the expense of reduced latency).

The OMP 4.n may provide offload extensions that implement a better (formal) way of doing this, I haven't had a chance to experiment with the newer features.

Jim Dempsey