can single work items kernels run in parallel on same device

Altera_Forum · ‎12-17-2017

Can single work item kernels run in parallel on the same device (i.e. on the same board).

I've been trying to get a very simple example of task parallelism working but have not been able to

to get more than one kernel to run at the same time on the same board.

The kernel computes part of summing equations - let's say it sums numbers from "start" to "end".

In a .cl file there are multiple identical kernels that do this - let's say there at 12 of them.

Single work items kernels have been used to insure that the equation can be pipelined.

The host code creates multiple kernels and multiple contexts in an effort to run more than one in parallel.

After trying many, many things, I've yet to get them to run in parallel. Initially I used just the time profile

to see how much time they take to run. Each kernel takes about the same time (e.g. 25 ms). If 12 kernels

are started, the time is 300 ms.

There are four identical boards in the system. If 12 kernels are used and three are used on each of the four

boards then each one takes 25 ms but each board can run them in parallel so the total time is only 75 ms.

What else is needed to get the kernels to run in parallel on the same board. I've been able to turn on

profiling and can see that each one is started - one after the other.

Everything seems to work (i.e. the correct answer is produced) but the kernels don't run at the same time on

a single board.

Do I need to use NDR range kernels?

Any suggestion would be greatly appreciated! (this should be so hard?!?)

Altera_Forum · ‎12-17-2017

Any type of kernel can run in parallel with another, as long as they are invoked in a separate queue, and no event is used to forcibly sentimentalize them; the key point here is that they must run in a different queue and you should not force the host to wait for each kernel execution separately using commands like clFlush() or clFinish(), or by waiting on events. You can, and probably should, wait for an event associated with each kernel invocation, or use clFinish() on every single queue you have, after invoking all the kernels in the host, to make sure all kernels have finished execution, to be then able to read the data back from the device.

Another way this can be accomplished more efficiently is to use replicated autorun kernels; more details about this are available in "Intel FPGA SDK for OpenCL Programming Guide, Section 11.4".

Finally, I need to emphasize on the fact that since external memory bandwidth is shared between the kernels running in parallel, you should not expect to get linear speed-up by using multiple parallel kernels. In fact, assuming that one of your kernels is memory-bound on its own, you will not see any speed-up at all by replicating it.

P.S. I have done this multiple times, and it certainly works.

Altera_Forum · ‎01-22-2018

I've tried everything I can think of but I still cannot get my kernel tasks to run in parallel.

The kernels are only doing a simple sum of numbers from "start" to "finish". Other than their result, there is no other memory

being used. Sharing memory shouldn't be a problem preventing them from running together.

Each kernel is created in its own queue.

No waiting via cl_Finish() is being done. I get an event from each kernel and use it to determine when that kernel is complete

and print out the time. I can see from that output that they kernels are not running in parallel.

Because then are not running concurrently, but they can, there must be something being shared that is preventing them.

I don't know what though.

Can anyone take a look at the kernel code and C-code and tell me what is keeping them from running concurrently?

Here's the kernel code:

__kernel void sumN1(const double start,
                    const double stop,
                    const double step,
                 __global double *z) {
  //// get index of the work item
  //int index = get_global_id(0);
  //init result
  double sum = 0.0;
  for (double i=start; i<=stop; i+=step) {
    sum += i;
  }
  z = sum;
} // sumN1
... eight identical kernels ('cept the kernel name) - in the same .cl file
__kernel void sumN8(const double start,
                    const double stop,
                    const double step,
                 __global double *z) {
  //// get index of the work item
  //int index = get_global_id(0);
  //init result
  double sum = 0.0;
  for (double i=start; i<=stop; i+=step) {
    sum += i;
  }
  z = sum;
} // sumN8

Here's "most" of the host C code:


////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
// main - sumN
//
// - sumN num 
//
int main(int argc, char *argv) {
      cl_device_id  accDeviceID;                //device ID of first acceleator device
    cl_platform_id  accPlatformID;              //platform ID of platform w/first accelerator device
            cl_mem  answer;                  //array to collect results
       std::string  binary_file;                //name of OpenCL program
        cl_context  context;                    //OpenCL context for this application
    cl_device_type  deviceType;                 //type of a device (CPU, GPU, ACCELERATOR)
          cl_event  event;                      //wait event synchronization handle used by OpenCL API
              bool  foundintel = false;         //indicates that Intel FPGA card was found
              char  info_text;   //value of some returned text information
              bool  isACC = false;              //flag to remember that an accelerator has been found
         cl_kernel  kernel;                  //OpenCL kernal for this applicaiton
               int  kerns = 0;                  //number of kernels to use
           cl_uint  numDevices;                 //number of OpenCL computing devices for a platform
           cl_uint  numPlatforms;               //number of OpenCL platforms (typically 1)
            double  number = 0.0;               //number to compute sum to
        cl_program  program;                    //OpenCL program for this application
  cl_command_queue  queue;                   //OpenCL command queue for this application
            double  result = { 0.0 };        //result of the summation computation
            size_t  size;                       //size of returned information from OpenCL API
            double  start = 1.0;                //start of summing
            cl_int  status;                     //return code used by OpenCL API
            double  step  = 1.0;                //step  of summing
            double  stop  = 1.0;                //end   of summing
            cl_int  task_done;                  //info from event query
          cl_event  task_event;              //events from tasks
.... some code omitted here that handled input args and platform, device setup
  ////////////////////////////////////////
  // OpenCL context
  context = clCreateContext(NULL, 1, &accDeviceID, NULL, NULL, &status);
  exitOnFail(status, "create context");
  ////////////////////////////////////////
  // OpenCL command queue
  for ( int kz=0; kz<kerns; kz++) {
    queue = clCreateCommandQueue(context, accDeviceID, 0, &status);
    exitOnFail(status, "create command queue");
  }
  ////////////////////////////////////////
  // Create the program for all device. Use the first device as the
  // representative device (assuming all device are of the same type).
  binary_file = getBoardBinaryFile("sumN", accDeviceID);
  program = createProgramFromBinary(context, binary_file.c_str(), &accDeviceID, 1);
  ////////////////////////////////////////
  // Build the program that was just created.
  status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  exitOnFail(status, "Failed to build program");
  const double start_time = getCurrentTimestamp();
  ////////////////////////////////////////
  // create the kernel
  // Create the kernel - name passed in here must match kernel name in the
  // original CL file, that was compiled into an AOCX file using the AOC tool
  char kernel_name = "sumNx";  // Kernel name, as defined in the CL file
  for ( int kz=0; kz<kerns; kz++) {
    sprintf(kernel_name, "sumN%d", kz+1);  // generate the Kernel name, as defined in the CL file
    kernel = clCreateKernel(program, kernel_name, &status);
    exitOnFail(status, "Failed to create kernel");
    // Set the kernel argument (argument 0)
    status = clSetKernelArg(kernel, 0, sizeof(cl_double), &start);
    exitOnFail(status, "Failed to set kernel arg 0");
    // Set the kernel argument (argument 1)
    status = clSetKernelArg(kernel, 1, sizeof(cl_double), &stop);
    exitOnFail(status, "Failed to set kernel arg 0");
    // Set the kernel argument (argument 2)
    status = clSetKernelArg(kernel, 2, sizeof(cl_double), &step);
    exitOnFail(status, "Failed to set kernel arg 0");
    // last OpenCL argument: memory buffer object for result
    answer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_double), &result, &status);
    exitOnFail(status, "create buffer for answer");
    // set 4th argument
    status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &answer);
    exitOnFail(status, "set kernel argument answer");
  }
  // Launch the kernels
  for ( int kz=0; kz<kerns; kz++) {
    status = clEnqueueTask(queue, kernel, 0, NULL, &task_event);
    exitOnFail(status, "Failed to launch kernel");
  }
  int total_done = 0;
  int its_done = { 0 };
  while (total_done < kerns) {
    for ( int kz=0; kz<kerns; kz++) {
      if ( its_done == 0 ) {
        status = clGetEventInfo(task_event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &task_done, NULL);
        if (task_done == CL_COMPLETE) {
          printf("Task:%d complete (%0.3f ms)\n", kz, (getCurrentTimestamp() - start_time) * 1.0e3 );
          total_done++;
          its_done = 1;
        } else {
          //printf("Task:%d incomplete\n", kz);
        }
      } // if kz task not done
    } // foreach task event
  } // wait for kernels to complete
//// Wait for command queue to complete pending events
//for ( int kz=0; kz<kerns; kz++) {
//  status = clFinish(queue);
//  exitOnFail(status, "Failed to finish");
//}
  const double end_time = getCurrentTimestamp();
  // Wall-clock time taken.
  printf("\nTime: %0.3f ms (%0.3f ms / kernel)\n", (end_time - start_time) * 1e3, (end_time - start_time) * 1e3 / (double)kerns );
  for ( int kz=0; kz<kerns; kz++) {
    printf("Sum 0-%f (step %f) = %f\n", number, step, result);
  }
  // Free the resources allocated
  cleanup();
  if(kernel) {
    for ( int kz=0; kz<kerns; kz++) {
      clReleaseKernel(kernel);
    }
  }
  if(program) {
    clReleaseProgram(program);
  }
  if(queue) {
    for ( int kz=0; kz<kerns; kz++) {
      clReleaseCommandQueue(queue);
    }
  }
  if(context) {
    clReleaseContext(context);
  }
  exit(0);
} // main

This is the output from a run using four kernels.

There is no parallelism. I've tried using the profiler and it clearly shows that each kernel

runs, one after the other.

$ bin/host 100000 4

Reprogramming device [0] with handle 1

Task:0 complete (3.600 ms)

Task:1 complete (7.096 ms)

Task:2 complete (10.583 ms)

Task:3 complete (14.066 ms)

Time: 14.069 ms (3.517 ms / kernel)

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

Altera_Forum · ‎01-22-2018

Try putting "start_time = getCurrentTimestamp();" before the kernel invocation loop. I have a feeling your kernel run time is so short that total time is being dominated by the clCreateBuffer() call. Note that on most hardware, using host pointer will result in the OpenCL runtime actually allocating and transferring the whole buffer to device memory. Also considering increasing your input size so that your total run time is at least a few seconds.

Altera_Forum · ‎01-22-2018

I did try the longer runs. The result is the same (i.e. no apparent parallelism).

Here's a snapshot w/8 kernels.

$ bin/host 100000000 8

Task:0 complete (3437.999 ms)

Task:1 complete (6875.815 ms)

Task:2 complete (10313.553 ms)

Task:3 complete (13751.281 ms)

Task:4 complete (17189.009 ms)

Task:5 complete (20626.756 ms)

Task:6 complete (24064.509 ms)

Task:7 complete (27502.250 ms)

Time: 27502.254 ms (3437.782 ms / kernel)

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000

Altera_Forum · ‎01-22-2018

And what about moving the "start_time = getCurrentTimestamp()" call?

Altera_Forum · ‎01-22-2018

--- Quote Start ---

And what about moving the "start_time = getCurrentTimestamp()" call?

--- Quote End ---

Same result....

code update snippet....


  const double start_time = getCurrentTimestamp();
  // Launch the kernels
  for ( int kz=0; kz<kerns; kz++) {
    status = clEnqueueTask(queue, kernel, 0, NULL, &task_event);
    exitOnFail(status, "Failed to launch kernel");
  }

results with 8 kernels....

$ bin/host 100000000 8

get_plat_info: Intel(R) FPGA SDK for OpenCL(TM)

Reprogramming device [0] with handle 1

Task:0 complete (3437.884 ms)

Task:1 complete (6875.553 ms)

Task:2 complete (10313.318 ms)

Task:3 complete (13751.042 ms)

Task:4 complete (17188.784 ms)

Task:5 complete (20626.527 ms)

Task:6 complete (24064.256 ms)

Task:7 complete (27501.988 ms)

Time: 27501.995 ms (3437.749 ms / kernel)

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000

and

results with 4 kernels....

$ bin/host 100000000 4

Reprogramming device [0] with handle 1

Task:0 complete (3437.864 ms)

Task:1 complete (6875.626 ms)

Task:2 complete (10313.367 ms)

Task:3 complete (13751.130 ms)

Time: 13751.135 ms (3437.784 ms / kernel)

Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000

I have four boards. In other code, I can launch different numbers of kernels of each of the four boards.

When I do this, I do see the speed up I'm looking for.

e.g.

If I run 1 kernel on each of four boards, it takes time X ms.

But,

When I run 4 kernels on one board, it takes approx. 4 * X ms (as shown above).

Altera_Forum · ‎01-23-2018

I cannot really think of anything else, and I don't see any particular issues in your host code.

Altera_Forum · ‎01-24-2018

What about:

I - Add a clFlush after each clEnqueueTask()

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.

Altera_Forum · ‎01-24-2018

Maybe a suggestion, not sure if it is an issue, but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR which will allow multiple copies of the input data to be generated for each cl_mem object rather than having them all point to the same chunk of allocated memory. Also adding the 'restrict' flag to your global variables in the kernel to let it know that no other pointers to the same data are modifying the data. You will need to do an enqueue read buffer to get the data back out since they aren't mapped.

In my experience, it looks like mapped buffers does the same thing as writeBuffers and readBuffers other than the timing when the kernel reads/writes over PCIe, but it does seem to work well to utilize pinned memory on GPUs.

I am curious if there is an inherit (but unintentional) memory dependency on the global memory.

Altera_Forum · ‎01-24-2018

--- Quote Start ---

What about:

I - Add a clFlush after each clEnqueueTask()

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.

--- Quote End ---

I - I thought this was a "no-no" for parallel operations. I'll try it.

II - I did "compile" with the profile on the original code (the above it a watered down version of the real objective) and it pretty clearly showed no over lapping.

Altera_Forum · ‎01-29-2018

--- Quote Start ---

What about:

I - Add a clFlush after each clEnqueueTask()

--- Quote End ---

This changed the way the kernels ran (each one ran longer) but the over all time was the same.

i.e. It seems like each kernel was started but it couldn't complete until the previous one completed.

e.g. Without the clFlush()

$ bin/host 100000 4

Reprogramming device [0] with handle 1

Task:0 complete (4.189 ms)

Task:1 complete (8.172 ms)

Task:2 complete (12.137 ms)

Task:3 complete (16.093 ms)

Time: 16.099 ms (4.025 ms / kernel)

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

e.g. w/clFlush()

$ bin/host 100000 4

Reprogramming device [0] with handle 1

Task:0 complete (12.253 ms)

Task:1 complete (12.283 ms)

Task:2 complete (12.286 ms)

Task:3 complete (16.191 ms)

Time: 16.197 ms (4.049 ms / kernel)

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

--- Quote Start ---

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.

--- Quote End ---

https://alteraforum.com/forum/attachment.php?attachmentid=14752&stc=1

Altera_Forum · ‎01-29-2018

--- Quote Start ---

but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR

--- Quote End ---

This did not help.

--- Quote Start ---

adding the 'restrict' flag to your global variables in the kernel

--- Quote End ---

This did not help either.

However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.

This did work!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

$ bin/host 100000 4

Reprogramming device [0] with handle 1

Task:2 complete (4.529 ms)

Task:3 complete (4.556 ms)

Task:0 complete (4.559 ms)

Task:1 complete (4.561 ms)

Time: 4.563 ms (1.141 ms / kernel)

Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

https://alteraforum.com/forum/attachment.php?attachmentid=14753&stc=1

Thanks SO MUCH to nicolacdnll and fand for giving some new suggestions that FINALLY lead to a solution!

Altera_Forum · ‎01-29-2018

--- Quote Start ---

However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.

--- Quote End ---

That sounds like the host compiler/runtime was assuming a false dependency between the answer[] buffers, either because the buffers are defined as an array, or because you are using host pointers. I always use CL_MEM_READ_WRITE for the buffers being accessed by parallel kernels, and never had such problem. However, I do not use host pointers.

Altera_Forum · ‎01-29-2018

Thanks HRZ for answering my original post and getting me on a path to a solution!