Xeon PHI ignores blocking flag?

moises_v_ · ‎01-27-2015

Dear all,

My Xeon PHI is ignoring the blocking flag in the enqueue[Read|Write]Buffer calls. They are always blocking ones. Are you having the same behaviour?

Thanks in advance,

Moisés

Yuri_K_Intel · ‎02-02-2015

Hi, Could you please share a simple reproducer which demonstrates this behavior? As well as the logic/reasoning which leads you to such conclusion. Thanks, Yuri

moises_v_ · ‎02-02-2015

For example:

cQ.finish();

gettimeofday();


cQ.enqueueReadBuffer(CL_TRUE)   <--- 1 second


//foo/dummy loop   <----------------------------------------100 seconds


cQ.finish();

gettimeofday();

Total = 101 seconds

cQ.finish();
gettimeofday();

cQ.enqueueReadBuffer(CL_FALSE)   <--- 1 second

//foo/dummy loop   <----------------------------------------100 seconds

cQ.finish();
gettimeofday();

Total = 101 seconds!! (expected, 100 seconds)

Thank you so much, probably it is a very simple issue but I don't see where I am doing the mistake.

PS: The non-blocking case time doesn't change even we put a flush inmediately after the enqueueReadBuffer. Additionally, this "bad" behaviour also happens for NVIDIA cards (not for AMD/ATI cards where the calls are truly non-blocking).

Yuri_K_Intel · ‎02-03-2015

Sorry, I don't quite follow the logic here. I see only 2 measurement points (after corresponding cQ.finish() calls). In this case the total execution time is expected to be the same in both cases, because the second cQ.finish() waits for the read command to finish (even if it was bon-blocking). So you probably need to insert another measurement point after cQ.enqueueReadBuffer and make sure the buffer is quite large to notice the real data transfer time vs some possible overhead of the enqueueReadBuffer call. And by a reproducer I meant complete code example which might be built and executed. Thanks, Yuri

moises_v_ · ‎02-03-2015

Thank you Yuri,

I send you a code that you use to check my problem. You only will have to decompress and compile using the Makefile.

My output is:

Time blocking transfer:0.999292 seconds
Time non-blocking transfer:0.999324 seconds

If you put the "dummy" loop inside a block of comments, I obtain:

Time blocking transfer:0.160764 seconds
Time non-blocking transfer:0.160766 seconds

This is, the memory copy lasts 0.16 seconds for the two cases (obviously)

I guess that my output should be:

Time blocking transfer:0.999292 seconds
Time non-blocking transfer:0.838563 seconds (0.999324-0.160764 seconds)

Where is the problem? I repeat that probably I have forgotten something or I have made a mistake, but I don't see where :/

Thank you so much again Yuri,

Moisés