- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I found that when I run the axpy(y = x * a + y) with two separate set of similar data, I got the totally different execution time as following. The attached file is the sample code for axpy.
My assumption is that the first time to run the inout pragma has to spend the time to prepare/preconfigure/preheat the Xeon Phi Coprocessor. If so, is there any official explanation to explain this odd situation? If not, what is the reason? Is there any better way to make a improvement or avoid for this situation? It's really important for the benchmark. Because compare to NVIDIA/INTEL GPU/CPU, this situation never happens.
[liu@fornax Test_offomp]$ ./a.out
Total time for inout1 combined = 0.39732003 sec
Total time for inout2 combined = 0.01132083 sec
Best wishes,
Jiawen
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For offload code, the first time offload has to copy in the MIC code of the application, and then instantiate the OpenMP thread pool.
Whereas the second and later offloads do not have to copy in the MIC code of the application, and also can re-use the existing OpenMP thread pool.
There is an option to specify that the MIC code than be pre-loaded at program start time.
Generally for timing your code you either disregard the first call of your timed region or prior to the timed region you induce an offload region that is not timed (this is once only at application start).
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page