- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there~
I've met some problems and have some questions since I'm new to Intel phi.
Our system has 3 mic cards per nodes. Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.
I've completed the data distribution for mic and CPU and tried to use "#pragma offload" to start mic processes, like this:
It's quite clear that program blocked here waiting for the completion of the mic process before the next offload.
Is there any non-blocking way to do the offload?
Reply would help a lot! Thanks there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, you can enable non-blocking offload by adding a unique signal tag to each #pragma offload using the signal() clause and then either one #pragma offload_wait after the final #pragma offload to wait for the completion indication for all unique tags if you wish, or wait on completion indications for the tags individually.
Make sure each signal variable is initialized to a unique value. The brief discussion About Asynchronous Computation and offload_wait and signal ( tag ) in the User guide have more details.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, you can enable non-blocking offload by adding a unique signal tag to each #pragma offload using the signal() clause and then either one #pragma offload_wait after the final #pragma offload to wait for the completion indication for all unique tags if you wish, or wait on completion indications for the tags individually.
Make sure each signal variable is initialized to a unique value. The brief discussion About Asynchronous Computation and offload_wait and signal ( tag ) in the User guide have more details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Davis
I've tried and it works!
Thank you very much~ I still have a lot to learn...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.
I'm sure you know this, but just in case, you do realize that you can use MPI and have MPI processes on each of the Phis and on the host? (Which would be exactly like MPI since it is MPI :-)).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are a couple of resources relating to James’ feedback.
How to run Intel MPI on Xeon Phi™
Using MPI and Xeon Phi™ Offload Together
Glad to hear the signals worked. Also, for your sample code you posted, the “-1” target number defers the coprocessor selection to the runtime system; however, for greater coprocessor selection/control you could use a specific program variable and assign a unique target number to each specific offload to execute on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James Cownie (Intel) wrote:
Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.
I'm sure you know this, but just in case, you do realize that you can use MPI and have MPI processes on each of the Phis and on the host? (Which would be exactly like MPI since it is MPI :-)).
Sorry but I've given a wrong picture of the parallelism between the mic and CPU. I know that there could be MPI processes on each Phis when Phis worked in the symmetric mode. I just meant that I need to make CPU and the 3 Phis work in parallel. : )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin Davis (Intel) wrote:
Here are a couple of resources relating to James’ feedback.
How to run Intel MPI on Xeon Phi™
Using MPI and Xeon Phi™ Offload TogetherGlad to hear the signals worked. Also, for your sample code you posted, the “-1” target number defers the coprocessor selection to the runtime system; however, for greater coprocessor selection/control you could use a specific program variable and assign a unique target number to each specific offload to execute on.
Thanks~ Because my code is quite simple, setting some different const values for target numbers to the devices can do~
My code works when all MPI ranks are on the host and the computation part is done asynchronously on Phis running in offload mode and CPU, and what left to do is optimization.
Thank you again for your considerate help~
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're welcome. Glad I could help.
![](/skins/images/8B5EA638CA3587CA763EE9EF53643DD4/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page