OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Xeon PHI ignores blocking flag?

moises_v_
Beginner
559 Views

Dear all,

My Xeon PHI is ignoring the blocking flag in the enqueue[Read|Write]Buffer calls. They are always blocking ones. Are you having the same behaviour?

Thanks in advance,

Moisés

 

0 Kudos
4 Replies
Yuri_K_Intel
Employee
559 Views
Hi, Could you please share a simple reproducer which demonstrates this behavior? As well as the logic/reasoning which leads you to such conclusion. Thanks, Yuri
0 Kudos
moises_v_
Beginner
559 Views

For example:

cQ.finish();

gettimeofday();


cQ.enqueueReadBuffer(CL_TRUE)   <--- 1 second


//foo/dummy loop   <----------------------------------------100 seconds


cQ.finish();

gettimeofday();

 

Total = 101 seconds
 

cQ.finish();
gettimeofday();

cQ.enqueueReadBuffer(CL_FALSE)   <--- 1 second

//foo/dummy loop   <----------------------------------------100 seconds

cQ.finish();
gettimeofday();

 

Total = 101 seconds!! (expected, 100 seconds)

 

Thank you so much, probably it is a very simple issue but I don't see where I am doing the mistake.

PS: The non-blocking case time doesn't change even we put a flush inmediately after the enqueueReadBuffer. Additionally, this "bad" behaviour also happens for NVIDIA cards (not for AMD/ATI cards where the calls are truly non-blocking).

 

0 Kudos
Yuri_K_Intel
Employee
559 Views
Sorry, I don't quite follow the logic here. I see only 2 measurement points (after corresponding cQ.finish() calls). In this case the total execution time is expected to be the same in both cases, because the second cQ.finish() waits for the read command to finish (even if it was bon-blocking). So you probably need to insert another measurement point after cQ.enqueueReadBuffer and make sure the buffer is quite large to notice the real data transfer time vs some possible overhead of the enqueueReadBuffer call. And by a reproducer I meant complete code example which might be built and executed. Thanks, Yuri
0 Kudos
moises_v_
Beginner
559 Views

Thank you Yuri,

 

I send you a code that you use to check my problem. You only will have to decompress and compile using the Makefile.

My output is:

Time blocking transfer:0.999292 seconds
Time non-blocking transfer:0.999324 seconds


If you put the "dummy" loop inside a block of comments, I obtain:

Time blocking transfer:0.160764 seconds
Time non-blocking transfer:0.160766 seconds

This is, the memory copy lasts 0.16 seconds for the two cases (obviously)

I guess that my output should be:

Time blocking transfer:0.999292 seconds
Time non-blocking transfer:0.838563 seconds (0.999324-0.160764 seconds)

 

Where is the problem? I repeat that probably I have forgotten something or I have made a mistake, but I don't see where :/

 

Thank you so much again Yuri,

Moisés

 


 

0 Kudos
Reply