Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,372 Views

Which OpenCL APIs load kernels to FPGA?

Hello, 

 

I'm running a hello world OpenCL example. I notice that there are two API functions: clCreateProgramWithBinary and clEnqueueNDRangeKernel. I tried to measure the overhead of configuring the fpga through PCIe with this example. I found that the time to run clCreateProgramWithBinary is ~2.5s while it's ~20ms to run clEnqueueNDRangeKernel. I'm wondering which API really program fpga and write the configuration bit stream? Another question is that: if I enqueue hello world kernel serveral times in the same OpenCL program, will the bit stream be written once or every time the kernel is running? Thanks a lot in advance!
0 Kudos
9 Replies
Altera_Forum
Honored Contributor I
51 Views

The clCreateProgramWithBinary call configures the FPGA; you can put a printf before and after it and look for the "Reprogramming device [0] with handle 1" line in stdout and its position compared to the printf outputs to make sure. If you are measuring time, you should make sure to use clFinish() before reading the end time because many of OpenCL's call functions are non-blocking. Obviously, unless you call the clCreateProgramWithBinary multiple times, the FPGA will not be reconfigured multiple times, and you can call clEnqueue with the same kernel as many times as you want after the initial configuration.

Altera_Forum
Honored Contributor I
51 Views

@HRZ, since clCreateProgramWithBinary() comes earlier than the clEnqueueNDRangeKernel() how does it know how many instances of the kernel to configure in the fpga ? 

In addition, if I queue two different NDRange kernels one after the other in the same command queue, and the clFinish() is after the second one, will they reside in the fpga at once or will the second one be loaded after the first had finished ?
Altera_Forum
Honored Contributor I
51 Views

What kernels will reside on the FPGA is determined by what kernel(s) was(were) put in the ".cl" file in the first place. If you put all of your kernels in one ".cl" file and compile that and program the FPGA with that binary, all of those kernels, whether they are used or not during execution, will physically reside on the FPGA (I don't mean running, I mean the circuit is programmed on the FPGA) at the same time. The order of execution or the queues or basically anything that you put in the host code will not affect what kernel physically resides on the FPGA and what doesn't; only the binary file that is being loaded does. Remember, the host and the kernel are compiled separately, the kernel compiler has no idea about what is happening in the host code. 

 

Anyway, always there will only be one instance (physical circuit) of each kernel on the FPGA, unless you manually replicate the code in the same ".cl" file and use a different name for the second kernel. Even with num_compute_units you will still have one "instance" of the kernel but multiple copies of the pipeline which are automatically handled by the run-time scheduler (no user control). 

 

If you have two different kernels in the same queue, even though I haven't tried doing it myself, I am pretty sure they would run serially (even without a clFinish in-between), even though both kernels have their own separate circuit on the FPGA. This is because the OpenCL run-time has to guarantee global memory consistency for kernels in the same queue (but not kernels between different queues). If you want to run two kernels in parallel, you have to run them from two different queues and either use OpenCL events to synchronize them or use clFinish on both queues. I have done this one and it works; this is also the standard procedure for when you are connecting two kernels to each other via channels.
Altera_Forum
Honored Contributor I
51 Views

Thanks. 

It seems that there is some basic behavior that I don't understand: in compile time, the compiler does not know how many instances of a kernel I would be willing to launch since this is declared only during the call of clEnqueueNDRangeKernel() with global and local sizes. The only clue the compiler has is a kernel _attribute__((reqd_work_group_size()) and this is not mandatory. Therefore I understood, the actual wiring takes place when calling clEnqueueNDRangeKernel(). 

 

I am working on an image processing project and I am concerned by the time the wiring/launching takes (fractions of a second ?) and whether the wiring will occur time after time if I make such a loop:  

for(i = ....) 

Cpu write image to InputBuffer,  

clSetKernelArg(...&InputBuffer),  

clSetKernelArg(...&OutputBuffer),  

clEnqueueNDRangeKernel(... global size, local size, ...),  

clFinish(),  

clEnqueueMapBuffer(...OutputBuffer) 

Cpu read processed image 

 

in such scenario, when will the actual wiring occur and will it occur only once ? 

Is there a better structure of program flow to meet my needs ? 

 

thanks.
Altera_Forum
Honored Contributor I
51 Views

The "wiring" certainly takes place with clCreateProgramWithBinary() and takes around a few hundred milliseconds for Stratix V and a couple hundred milliseconds on Arria 10, which would be outside of your loop in this case and it will only occur once. Run-time Global and Local size have no effect on the circuit or wiring on the FPGA, they only affect "scheduling" which is software-based and happens at run-time. 

 

The important distinction here is that threads from the same work-group do NOT run in parallel on the FPGA, unless you use SIMD (which requires _attribute__((reqd_work_group_size())). The threads are instead pipelined depending on the scheduler's behavior and possible local or global memory access contention. This is the major difference between the way things work on an Altera FPGA with the OpenCL SDK and a standard GPU. Because of this, regardless of what your Global or Local size is, the exact same circuit can be used. The major change that supplying _attribute__((reqd_work_group_size()) results in is that it allows the compiler to optimize area usage and memory accesses for that specific Work Group size, rather than assume the worst-case scenario which might not happen at run-time and result in resources being underutilized. 

 

In your code, you can safely remove the clFinish call because even though clEnqueueNDRangeKernel is not blocking, clEnqueueMapBuffer will always start after clEnqueueNDRangeKernel and global memory consistency is guaranteed at the end of kernel execution in OpenCL. You should just make sure to use a blocking clEnqueueMapBuffer if you are going to use the data on the host right away.
Altera_Forum
Honored Contributor I
51 Views

OK, thanks. I am starting to understand. 

 

Is there a place to read about the compiler/optimizer policy ? i.e. how many threads of each kernel are programmed in each work-group, when it knows the reqd_work_group_size() or not, and if there are enough resources on the fpga whether it adds some work-groups of each kernel although it does not have information about that, and so on . 

 

Is there a way to know the resulted internal structure after the compilation ? 

 

The above information can help developers in the optimization process.
Altera_Forum
Honored Contributor I
51 Views

OK, thanks. I am starting to understand. 

 

Is there a place to read about the compiler/optimizer policy ? i.e. how many threads of each kernel are programmed in each work-group, when it knows the reqd_work_group_size() or not, and if there are enough resources on the fpga whether it adds some work-groups of each kernel although it does not have information about that, and so on . 

 

Is there a way to know the resulted internal structure after the compilation ? or some summary. 

 

Where can I find explanation on how to read the resulted compiler log ?
Altera_Forum
Honored Contributor I
51 Views

You should read the Intel FPGA SDK for OpenCL Programming Guide and Intel FPGA SDK for OpenCL Best Practices Guide. Some basic information about the way the hardware is created and basic optimization techniques and their effects are detailed there. Those guides also detail the reports and how they should be interpreted by the user. At the end of the day you will have to run a lot of experiments yourself to be able to better understand the effect of different optimizations and code modifications.

Altera_Forum
Honored Contributor I
51 Views

Thank you.

Reply