- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can single work item kernels run in parallel on the same device (i.e. on the same board).
I've been trying to get a very simple example of task parallelism working but have not been able to to get more than one kernel to run at the same time on the same board. The kernel computes part of summing equations - let's say it sums numbers from "start" to "end". In a .cl file there are multiple identical kernels that do this - let's say there at 12 of them. Single work items kernels have been used to insure that the equation can be pipelined. The host code creates multiple kernels and multiple contexts in an effort to run more than one in parallel. After trying many, many things, I've yet to get them to run in parallel. Initially I used just the time profile to see how much time they take to run. Each kernel takes about the same time (e.g. 25 ms). If 12 kernels are started, the time is 300 ms. There are four identical boards in the system. If 12 kernels are used and three are used on each of the four boards then each one takes 25 ms but each board can run them in parallel so the total time is only 75 ms. What else is needed to get the kernels to run in parallel on the same board. I've been able to turn on profiling and can see that each one is started - one after the other. Everything seems to work (i.e. the correct answer is produced) but the kernels don't run at the same time on a single board. Do I need to use NDR range kernels? Any suggestion would be greatly appreciated! (this should be so hard?!?)Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any type of kernel can run in parallel with another, as long as they are invoked in a separate queue, and no event is used to forcibly sentimentalize them; the key point here is that they must run in a different queue and you should not force the host to wait for each kernel execution separately using commands like clFlush() or clFinish(), or by waiting on events. You can, and probably should, wait for an event associated with each kernel invocation, or use clFinish() on every single queue you have, after invoking all the kernels in the host, to make sure all kernels have finished execution, to be then able to read the data back from the device.
Another way this can be accomplished more efficiently is to use replicated autorun kernels; more details about this are available in "Intel FPGA SDK for OpenCL Programming Guide, Section 11.4". Finally, I need to emphasize on the fact that since external memory bandwidth is shared between the kernels running in parallel, you should not expect to get linear speed-up by using multiple parallel kernels. In fact, assuming that one of your kernels is memory-bound on its own, you will not see any speed-up at all by replicating it. P.S. I have done this multiple times, and it certainly works.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've tried everything I can think of but I still cannot get my kernel tasks to run in parallel.
The kernels are only doing a simple sum of numbers from "start" to "finish". Other than their result, there is no other memory being used. Sharing memory shouldn't be a problem preventing them from running together. Each kernel is created in its own queue. No waiting via cl_Finish() is being done. I get an event from each kernel and use it to determine when that kernel is complete and print out the time. I can see from that output that they kernels are not running in parallel. Because then are not running concurrently, but they can, there must be something being shared that is preventing them. I don't know what though. Can anyone take a look at the kernel code and C-code and tell me what is keeping them from running concurrently? Here's the kernel code:__kernel void sumN1(const double start,
const double stop,
const double step,
__global double *z) {
//// get index of the work item
//int index = get_global_id(0);
//init result
double sum = 0.0;
for (double i=start; i<=stop; i+=step) {
sum += i;
}
z = sum;
} // sumN1
... eight identical kernels ('cept the kernel name) - in the same .cl file
__kernel void sumN8(const double start,
const double stop,
const double step,
__global double *z) {
//// get index of the work item
//int index = get_global_id(0);
//init result
double sum = 0.0;
for (double i=start; i<=stop; i+=step) {
sum += i;
}
z = sum;
} // sumN8
Here's "most" of the host C code:
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//
// main - sumN
//
// - sumN num
//
int main(int argc, char *argv) {
cl_device_id accDeviceID; //device ID of first acceleator device
cl_platform_id accPlatformID; //platform ID of platform w/first accelerator device
cl_mem answer; //array to collect results
std::string binary_file; //name of OpenCL program
cl_context context; //OpenCL context for this application
cl_device_type deviceType; //type of a device (CPU, GPU, ACCELERATOR)
cl_event event; //wait event synchronization handle used by OpenCL API
bool foundintel = false; //indicates that Intel FPGA card was found
char info_text; //value of some returned text information
bool isACC = false; //flag to remember that an accelerator has been found
cl_kernel kernel; //OpenCL kernal for this applicaiton
int kerns = 0; //number of kernels to use
cl_uint numDevices; //number of OpenCL computing devices for a platform
cl_uint numPlatforms; //number of OpenCL platforms (typically 1)
double number = 0.0; //number to compute sum to
cl_program program; //OpenCL program for this application
cl_command_queue queue; //OpenCL command queue for this application
double result = { 0.0 }; //result of the summation computation
size_t size; //size of returned information from OpenCL API
double start = 1.0; //start of summing
cl_int status; //return code used by OpenCL API
double step = 1.0; //step of summing
double stop = 1.0; //end of summing
cl_int task_done; //info from event query
cl_event task_event; //events from tasks
.... some code omitted here that handled input args and platform, device setup
////////////////////////////////////////
// OpenCL context
context = clCreateContext(NULL, 1, &accDeviceID, NULL, NULL, &status);
exitOnFail(status, "create context");
////////////////////////////////////////
// OpenCL command queue
for ( int kz=0; kz<kerns; kz++) {
queue = clCreateCommandQueue(context, accDeviceID, 0, &status);
exitOnFail(status, "create command queue");
}
////////////////////////////////////////
// Create the program for all device. Use the first device as the
// representative device (assuming all device are of the same type).
binary_file = getBoardBinaryFile("sumN", accDeviceID);
program = createProgramFromBinary(context, binary_file.c_str(), &accDeviceID, 1);
////////////////////////////////////////
// Build the program that was just created.
status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
exitOnFail(status, "Failed to build program");
const double start_time = getCurrentTimestamp();
////////////////////////////////////////
// create the kernel
// Create the kernel - name passed in here must match kernel name in the
// original CL file, that was compiled into an AOCX file using the AOC tool
char kernel_name = "sumNx"; // Kernel name, as defined in the CL file
for ( int kz=0; kz<kerns; kz++) {
sprintf(kernel_name, "sumN%d", kz+1); // generate the Kernel name, as defined in the CL file
kernel = clCreateKernel(program, kernel_name, &status);
exitOnFail(status, "Failed to create kernel");
// Set the kernel argument (argument 0)
status = clSetKernelArg(kernel, 0, sizeof(cl_double), &start);
exitOnFail(status, "Failed to set kernel arg 0");
// Set the kernel argument (argument 1)
status = clSetKernelArg(kernel, 1, sizeof(cl_double), &stop);
exitOnFail(status, "Failed to set kernel arg 0");
// Set the kernel argument (argument 2)
status = clSetKernelArg(kernel, 2, sizeof(cl_double), &step);
exitOnFail(status, "Failed to set kernel arg 0");
// last OpenCL argument: memory buffer object for result
answer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_double), &result, &status);
exitOnFail(status, "create buffer for answer");
// set 4th argument
status = clSetKernelArg(kernel, 3, sizeof(cl_mem), &answer);
exitOnFail(status, "set kernel argument answer");
}
// Launch the kernels
for ( int kz=0; kz<kerns; kz++) {
status = clEnqueueTask(queue, kernel, 0, NULL, &task_event);
exitOnFail(status, "Failed to launch kernel");
}
int total_done = 0;
int its_done = { 0 };
while (total_done < kerns) {
for ( int kz=0; kz<kerns; kz++) {
if ( its_done == 0 ) {
status = clGetEventInfo(task_event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &task_done, NULL);
if (task_done == CL_COMPLETE) {
printf("Task:%d complete (%0.3f ms)\n", kz, (getCurrentTimestamp() - start_time) * 1.0e3 );
total_done++;
its_done = 1;
} else {
//printf("Task:%d incomplete\n", kz);
}
} // if kz task not done
} // foreach task event
} // wait for kernels to complete
//// Wait for command queue to complete pending events
//for ( int kz=0; kz<kerns; kz++) {
// status = clFinish(queue);
// exitOnFail(status, "Failed to finish");
//}
const double end_time = getCurrentTimestamp();
// Wall-clock time taken.
printf("\nTime: %0.3f ms (%0.3f ms / kernel)\n", (end_time - start_time) * 1e3, (end_time - start_time) * 1e3 / (double)kerns );
for ( int kz=0; kz<kerns; kz++) {
printf("Sum 0-%f (step %f) = %f\n", number, step, result);
}
// Free the resources allocated
cleanup();
if(kernel) {
for ( int kz=0; kz<kerns; kz++) {
clReleaseKernel(kernel);
}
}
if(program) {
clReleaseProgram(program);
}
if(queue) {
for ( int kz=0; kz<kerns; kz++) {
clReleaseCommandQueue(queue);
}
}
if(context) {
clReleaseContext(context);
}
exit(0);
} // main
This is the output from a run using four kernels. There is no parallelism. I've tried using the profiler and it clearly shows that each kernel runs, one after the other. $ bin/host 100000 4 Reprogramming device [0] with handle 1 Task:0 complete (3.600 ms) Task:1 complete (7.096 ms) Task:2 complete (10.583 ms) Task:3 complete (14.066 ms) Time: 14.069 ms (3.517 ms / kernel) Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try putting "start_time = getCurrentTimestamp();" before the kernel invocation loop. I have a feeling your kernel run time is so short that total time is being dominated by the clCreateBuffer() call. Note that on most hardware, using host pointer will result in the OpenCL runtime actually allocating and transferring the whole buffer to device memory. Also considering increasing your input size so that your total run time is at least a few seconds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did try the longer runs. The result is the same (i.e. no apparent parallelism).
Here's a snapshot w/8 kernels. $ bin/host 100000000 8 Task:0 complete (3437.999 ms) Task:1 complete (6875.815 ms) Task:2 complete (10313.553 ms) Task:3 complete (13751.281 ms) Task:4 complete (17189.009 ms) Task:5 complete (20626.756 ms) Task:6 complete (24064.509 ms) Task:7 complete (27502.250 ms) Time: 27502.254 ms (3437.782 ms / kernel) Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And what about moving the "start_time = getCurrentTimestamp()" call?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- And what about moving the "start_time = getCurrentTimestamp()" call? --- Quote End --- Same result.... code update snippet....
const double start_time = getCurrentTimestamp();
// Launch the kernels
for ( int kz=0; kz<kerns; kz++) {
status = clEnqueueTask(queue, kernel, 0, NULL, &task_event);
exitOnFail(status, "Failed to launch kernel");
}
results with 8 kernels.... $ bin/host 100000000 8 get_plat_info: Intel(R) FPGA SDK for OpenCL(TM) Reprogramming device [0] with handle 1 Task:0 complete (3437.884 ms) Task:1 complete (6875.553 ms) Task:2 complete (10313.318 ms) Task:3 complete (13751.042 ms) Task:4 complete (17188.784 ms) Task:5 complete (20626.527 ms) Task:6 complete (24064.256 ms) Task:7 complete (27501.988 ms) Time: 27501.995 ms (3437.749 ms / kernel) Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 and results with 4 kernels.... $ bin/host 100000000 4 Reprogramming device [0] with handle 1 Task:0 complete (3437.864 ms) Task:1 complete (6875.626 ms) Task:2 complete (10313.367 ms) Task:3 complete (13751.130 ms) Time: 13751.135 ms (3437.784 ms / kernel) Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 Sum 0-100000000.000000 (step 1.000000) = 5000000050000000.000000 I have four boards. In other code, I can launch different numbers of kernels of each of the four boards. When I do this, I do see the speed up I'm looking for. e.g. If I run 1 kernel on each of four boards, it takes time X ms. But, When I run 4 kernels on one board, it takes approx. 4 * X ms (as shown above).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cannot really think of anything else, and I don't see any particular issues in your host code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What about:
I - Add a clFlush after each clEnqueueTask() II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe a suggestion, not sure if it is an issue, but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR which will allow multiple copies of the input data to be generated for each cl_mem object rather than having them all point to the same chunk of allocated memory. Also adding the 'restrict' flag to your global variables in the kernel to let it know that no other pointers to the same data are modifying the data. You will need to do an enqueue read buffer to get the data back out since they aren't mapped.
In my experience, it looks like mapped buffers does the same thing as writeBuffers and readBuffers other than the timing when the kernel reads/writes over PCIe, but it does seem to work well to utilize pinned memory on GPUs. I am curious if there is an inherit (but unintentional) memory dependency on the global memory.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- What about: I - Add a clFlush after each clEnqueueTask() II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time. --- Quote End --- I - I thought this was a "no-no" for parallel operations. I'll try it. II - I did "compile" with the profile on the original code (the above it a watered down version of the real objective) and it pretty clearly showed no over lapping.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- What about: I - Add a clFlush after each clEnqueueTask() --- Quote End --- This changed the way the kernels ran (each one ran longer) but the over all time was the same. i.e. It seems like each kernel was started but it couldn't complete until the previous one completed. e.g. Without the clFlush() $ bin/host 100000 4 Reprogramming device [0] with handle 1 Task:0 complete (4.189 ms) Task:1 complete (8.172 ms) Task:2 complete (12.137 ms) Task:3 complete (16.093 ms) Time: 16.099 ms (4.025 ms / kernel) Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 e.g. w/clFlush() $ bin/host 100000 4 Reprogramming device [0] with handle 1 Task:0 complete (12.253 ms) Task:1 complete (12.283 ms) Task:2 complete (12.286 ms) Task:3 complete (16.191 ms) Time: 16.197 ms (4.049 ms / kernel) Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 --- Quote Start --- II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time. --- Quote End --- https://alteraforum.com/forum/attachment.php?attachmentid=14752&stc=1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR --- Quote End --- This did not help. --- Quote Start --- adding the 'restrict' flag to your global variables in the kernel --- Quote End --- This did not help either. However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY. This did work!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! $ bin/host 100000 4 Reprogramming device [0] with handle 1 Task:2 complete (4.529 ms) Task:3 complete (4.556 ms) Task:0 complete (4.559 ms) Task:1 complete (4.561 ms) Time: 4.563 ms (1.141 ms / kernel) Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 Sum 0-100000.000000 (step 1.000000) = 5000050000.000000 https://alteraforum.com/forum/attachment.php?attachmentid=14753&stc=1 Thanks SO MUCH to nicolacdnll and fand for giving some new suggestions that FINALLY lead to a solution!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY. --- Quote End --- That sounds like the host compiler/runtime was assuming a false dependency between the answer[] buffers, either because the buffers are defined as an array, or because you are using host pointers. I always use CL_MEM_READ_WRITE for the buffers being accessed by parallel kernels, and never had such problem. However, I do not use host pointers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks HRZ for answering my original post and getting me on a path to a solution!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page