Shared Context for CPU GPU application

andradx · ‎04-03-2013

I am developing a CPU/GPU application using the OpenCL SDK 2013. However, I'm having some issues and still have some doubts on how to properly manage a shared context.

Based on information from Intel Optimization Guides, one should define a shared context for distributing a workload across the CPU and GPU. Moreover, the same buffer (cl_mem dataBuffer) should be defined and allocated for storing the data which both CPU and GPU will access. The CPU accesses a portion by a buffer defined as a sub-buffer of dataBuffer (cl_mem cpuSubBuffer) and the GPU by another sub-buffer (cl_mem gpuSubBuffer). I have properly aligned both sub-buffers so that no alignment issues arise.

The first question I have is, how do I initialize data on the dataBuffer? Should I use the dataBuffer on the command queue associated with the CPU or with the GPU? Or should I copy data to cpuSubBuffer through the CPU queue and to gpuSubBuffer through the GPU queue?

The second one is, just to confirm, do I need to define a cl_kernel for CPU execution and another for GPU execution?

TY for you kind help.

[EDIT] I've looked up the NBody simulation example by Intel and found out that a zero-copy strategy is to be used on this approach. Thus, the first question is resolved, while the second is also dealt with. The same kernel can have its arguments redefined so that when passing the cpuSubBuffer and the gpuSubBuffer as arguments, the appropriate data regions on the dataBuffer are executed by each device.

However, I must ask a third question. How can I measure the overall execution time? Suppose I take the execution time as the difference between the end of the event associated with the GPU execution and the start of the event associated with the kernel running on the CPU (CPU is queued first) it gives a too large number for the actual execution time. I guess the command queue "clocks" are not synchronized so the time stamps between a CPU event are not coherent with GPU events. How do I get around this using the profiling information from the events and not a CPU timer?

Because if I take isolated CPU and GPU execution times and add them, the obtained result does not take into account the time both devices were concurrently active, the actual execution time.

TY again.

Raghupathi_M_Intel · ‎04-03-2013

Do you need the total precise time it takes to execute just the kernels or the overall execution of your app? You can look into QueryPerformance counter for pretty accurate execution times. You are correct, you cannot use events in this case as you cannot guarantee when and howlong the CPU and GPU execute concurrently.

andradx · ‎04-03-2013

I'm looking for kernel execution time only.

Although it may not be trivial to synchronize the comand queue profiling counters, isn't it somewhat limited to overestimate the actual execution time through a host timer? Furthermore, the OpenCL profiling events offer a more interesting perspective regarding to command submittal, queueing and running status of an event. Given my preferred non-blocking behavior of queueing functions, the alternative of imposing barriers or fences across the code would set some blocking points, no? Well, I'll give it a shot.

Thank you Raghu.

Raghupathi_M_Intel · ‎04-03-2013

Events do give you exec, command submission, etc time at a much better granularity. But since you are measuring time on CPU and GPU with the work overlapping, I cannot think of a better way to measure time other than use the method I suggested. For what you are trying to do, use of barriers may be ok too.

Thanks,
Raghu

andradx · ‎04-03-2013

If you don't mind me asking another question. I noticed there are some class implementations of timers using the QueryPerformanceCounter function. Using this function one gets clock ticks and for time we need the clock frequency. But given the variable clock frequencies on nowadays Intel CPUs is this accurate? What about when the measured elapsed time spans across different clock frequencies?

Raghupathi_M_Intel · ‎04-03-2013

I am not sure which implementation you are talking about but, yes, you need to take into account the frequency to get accurate timing information. But then comes the question about dynamic frequency. The CPU (and GPU) have a mechanism change frequencies for power savings. You cannot accurately get timing information in this case. One trick people use for performance measurement is by disabling throttling.

Raghu

Rami_J_Intel · ‎04-03-2013

First; as for your quesiton about whether the CPU/GPU events in sync or not; they i would say that they belong to the same clock domain; so it's fine to do calculations like you did in terms of time units.

As for how to meaure kernel execution time; then you will need to decide whether the time between the end of one command and the beginning of the next one (or the dependent one) should be counted for or not! if it should be considered; then i think (i think Raghu has suggested that too) you can query for the CL_RUNNING time of the first command to be executed and subtract from the CL_COMPLETE time of the last command to be executed; either CPU or GPU. if you aren't sure which is first/last; then you can use Marker and UserEvent to synchronize these start/end points. At the other case; then it's much more complicated; i think it's complicated as measuring the execution time only of threads functions at a multi-threaded app.

Thanks, Rami

andradx · ‎04-04-2013

I've settled for a host timer, using the Windows QueryPerformanceCounter. It yields equivalent results to the preliminary ones I'm getting. In a later development phase I'll try and balance the workload across both devices, for now the CPU dominates the execution time. The difference between the CPU execution time measured with OpenCL events and with the host timer is ~10us which is an acceptable difference. I've measured it along these lines (with cpuTimer a class wrapping the QueryPerformanceCounter):

cpuTimer.start();

for(int i=0;i<MAX_ITERATIONS;i++)

clEnqueueNDRangeKernel(cpuQueue,kernel,1,NULL,globalThreads,NULL,0,NULL,&globalEvents);

clFlush(cpuQueue);

//set GPU arguments for kernel

for(int i=0;i<MAX_ITERATIONS;i++)

clEnqueueNDRangeKernel(cpuQueue,kernel,1,NULL,globalThreads,NULL,0,NULL,&globalEvents[MAX_ITERATIONS+i]);

clFlush(gpuQueue);

clWaitForEvents(2*NUM_ITERATIONS,globalEvents);

cpuTimer.stop();

Despite the low error between OpenCL and host measures, since QueryPerformanceCounter is dependent on the clock frequency, if I disable throttling there is still the Turbo Boost to account for, right? Or are these counters provided in Windows.h frequency agnostic, even though one must divided by the queried frequency to convert ticks to secs? So, reformulating the previous question to Raghu, I assume that even reading the tsc would be frequency dependent, no?

Dmitry_K_Intel · ‎04-04-2013

For the modern day Intel CPUs you can assume TSC counter counts at a constant rate. Please note that this issue is different from TSC counting in platform S-states that should be checked independently.

From the Intel Software Developers Manual:

Processor families increment the time-stamp counter differently:
• For Pentium M processors (family [06H], models [09H, 0DH]); for Pentium 4
processors, Intel Xeon processors (family [0FH], models [00H, 01H, or 02H]);
and for P6 family processors: the time-stamp counter increments with every
internal processor clock cycle.
The internal processor clock cycle is determined by the current core-clock to busclock
ratio. Intel® SpeedStep® technology transitions may also impact the
processor clock.
• For Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H and
higher]); for Intel Core Solo and Intel Core Duo processors (family [06H], model
[0EH]); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors
(family [06H], model [0FH]); for Intel Core 2 and Intel Xeon processors (family
[06H], display_model [17H]); for Intel Atom processors (family [06H],
display_model [1CH]): the time-stamp counter increments at a constant rate.
That rate may be set by the maximum core-clock to bus-clock ratio of the
processor or may be set by the maximum resolved frequency at which the
processor is booted. The maximum resolved frequency may differ from the
maximum qualified frequency of the processor, see Section 19.20.5 for more
detail.
The specific processor configuration determines the behavior. Constant TSC
behavior ensures that the duration of each clock tick is uniform and supports the
use of the TSC as a wall clock timer even if the processor core changes frequency.
This is the architectural behavior moving forward.