Software Archive
Read-only legacy content
17061 Discussions

Offload problems. Can I do offload in a non-blocking way?

Liu_N_
Beginner
566 Views

Hi there~

I've met some problems and have some questions since I'm new to Intel phi.

Our system has 3 mic cards per nodes. Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.

I've completed the data distribution for mic and CPU and tried to use "#pragma offload" to start mic processes, like this:

捕获.PNG

 

It's quite clear that program blocked here waiting for the completion of the mic process  before the next offload.

Is there any non-blocking way to do the offload? 

Reply would help a lot!  Thanks there.

0 Kudos
1 Solution
Kevin_D_Intel
Employee
566 Views

Sure, you can enable non-blocking offload by adding a unique signal tag to each #pragma offload using the signal() clause and then either one #pragma offload_wait after the final #pragma offload to wait for the completion indication for all unique tags if you wish, or wait on completion indications for the tags individually.

Make sure each signal variable is initialized to a unique value. The brief discussion About Asynchronous Computation and offload_wait and signal ( tag ) in the User guide have more details.

View solution in original post

0 Kudos
7 Replies
Kevin_D_Intel
Employee
567 Views

Sure, you can enable non-blocking offload by adding a unique signal tag to each #pragma offload using the signal() clause and then either one #pragma offload_wait after the final #pragma offload to wait for the completion indication for all unique tags if you wish, or wait on completion indications for the tags individually.

Make sure each signal variable is initialized to a unique value. The brief discussion About Asynchronous Computation and offload_wait and signal ( tag ) in the User guide have more details.

0 Kudos
Liu_N_
Beginner
566 Views

@Davis

I've tried and it works!

Thank you very much~ I still have a lot to learn...

0 Kudos
James_C_Intel2
Employee
566 Views

Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.

I'm sure you know this, but just in case, you do realize that you can use MPI and have MPI processes on each of the Phis and on the host? (Which would be  exactly like MPI since it is MPI :-)).

0 Kudos
Kevin_D_Intel
Employee
566 Views

Here are a couple of resources relating to James’ feedback.

How to run Intel MPI on Xeon Phi™
Using MPI and Xeon Phi™ Offload Together

Glad to hear the signals worked. Also, for your sample code you posted, the “-1” target number defers the coprocessor selection to the runtime system; however, for greater coprocessor selection/control you could use a specific program variable and assign a unique target number to each specific offload to execute on.

0 Kudos
Liu_N_
Beginner
566 Views

James Cownie (Intel) wrote:

Now I'm trying to make these 3 cards and the host CPU work in parallel, just like what MPI does.

I'm sure you know this, but just in case, you do realize that you can use MPI and have MPI processes on each of the Phis and on the host? (Which would be  exactly like MPI since it is MPI :-)).

Sorry but I've given a wrong picture of the parallelism between the mic and CPU. I know that there could be MPI processes on each Phis when Phis worked in the symmetric mode. I just meant that I need to make CPU and the 3 Phis work in parallel. : )

0 Kudos
Liu_N_
Beginner
566 Views

Kevin Davis (Intel) wrote:

Here are a couple of resources relating to James’ feedback.

How to run Intel MPI on Xeon Phi™
Using MPI and Xeon Phi™ Offload Together

Glad to hear the signals worked. Also, for your sample code you posted, the “-1” target number defers the coprocessor selection to the runtime system; however, for greater coprocessor selection/control you could use a specific program variable and assign a unique target number to each specific offload to execute on.

Thanks~ Because my code is quite simple, setting some different const values for target numbers to the devices can do~

My code works when all MPI ranks are on the host and the computation part is done asynchronously on Phis running in offload mode and CPU, and what left to do is optimization. 

Thank you again for your considerate help~

0 Kudos
Kevin_D_Intel
Employee
566 Views

You're welcome. Glad I could help.
 

0 Kudos
Reply