- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, My application program needs a lot of operators (copy/accumulate...) on two arrays' overlap. In most cases, the overlap area is not very large. A typical size is 100x100*2. I have made a simple benchmark for these functions in the attached file.
Unfortunately, I found the performance on MIC is much poor than Xeon CPU. For all three cases, one OpenMP thread is the best.
Offload: 0.026 seconds
Native: 0.018 seconds
CPU: 0.0009 seconds
Any suggestion on improving performance on MIC. Can we achieve better performance on MIC than that on CPU?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you will be running only 1 thread, MIC has no chance of being competitive with host.
You probably need to be using on the order of 100 threads effectively, with inner loop counts of at least 2000, for MIC to be competitive.
Did you set this up so that all threads are racing against the same data, which would account for the lack of threaded performance scaling?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, I have tried more threads on MIC, but the performance is much worse.
Unfortunately, in my application, the inner loop would not be so large as 2000.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At issue is you are mistakenly including the time to construct the MIC's OpenMP thread pool in your timed section. To correct for this, in main(), in the offload section, make the for loop an #pragma omp parallel for loop.
You could alternatively add an innocuous #pragma omp parallel section that does nothing but instantiate the thread pool
#pragma omp parallel omp_get_wtime(); // anything here that will not get optimized out
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, your suggestion does work. Now the time decreases to 0.009 -- 0.019 seconds. Thanks!
But it is still much slower than CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To take a quote from a Dirty Harry movie "A man's got to know his limitations."
I haven't tried this... (it may not work)
I would suggest experimenting with using an initial asynchronous offload to start a MIC parallel region. The parallel region in the MIC runs a continuous loop waiting for a message (pointer). In the host, after the return from launching the listener parallel region, you call a modified version of ArrayOP, that starts a (now nested/concurrent) parallel region, that packages the arguments into a message, then writes the pointer to the message into an atomic/volatile global variable (replace this with a queue later). The listener parallel region, upon seeing the pointer non-NULL, dispatches to the function, when complete, this nulls the shared pointer, thus signaling the shell function to return. Replace these later with appropriate inter-thread communication objects.
What this will do is permit you to keep the threads running (wasting power at the expense of reduced latency).
The OMP 4.n may provide offload extensions that implement a better (formal) way of doing this, I haven't had a chance to experiment with the newer features.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page