Multiple kernels in a single OpenCL file

Altera_Forum · ‎07-06-2015

I'm concerned about area usage if multiple kernels are used. If I understand correctly, if an OpenCL source file contains multiple kernel functions, all of the kernels will be instantiated simultaneously on the FPGA, even if only one of them may be used at one time. Is this correct?

Altera_Forum · ‎07-06-2015

Yes, this is correct.

Altera_Forum · ‎07-07-2015

Thanks okebz. So, suppose I have hundreds of OpenCL kernels, and each of which may be used at the run time. Since all of them may be used, they need to be instantiated on the FPGA hardware, albeit only some of them may be actually used. This sounds like a terribly inefficient application design, but coming from GPU computing background, I can think of these types of applications do exist in CPU-GPU hybrid HPC applications. What alternative approaches would be recommended?

Altera_Forum · ‎07-07-2015

This is an interesting question you pose. I'm not too familiar with GPU computing background, but I feel like these applications are not well suited in terms of HPC. If you have hundreds of kernels and only some of them may actually be used, it sounds like essentially, you have a library of kernels and depending on the specific application you are trying to run, you would run specific kernels similar to choosing a specific function and connecting them up (correct me if I'm wrong. If you have an example problem or application it would help if I can see what type of application you are targeting or what exactly you are trying to do). In this case, this is more of a generic approach and modular such that it's more focused on portability rather than performance. Moreover, I would think that to get as much performance out of an HPC, you would fine tune and create kernels that best accelerates your application.

On the other hand, to answer your question, if you have hundreds of OpenCL kernels that may be used at run time, and only some of them may actually be used, I would put all of the ones that need to be used in one OpenCL file and disregard the ones not being used (however, you may not be able to fit the required kernels on a device still). Another approach is to group the kernels that are commonly used together such that it fits, but create and compile a set of binaries that is composed of all the different kernels. This way, during run time, you can essentially reprogram the fpga such that the kernel of interest is loaded onto the fpga. But note that the overhead in reprogramming an fpga is on the order of seconds, so if you're constantly reprogramming the fpga, it becomes very inefficient. Ideally you would just target the FPGA to accelerate a specific application rather than having essentially a library of kernels that you can potentially use depending on the application in question.

Altera_Forum · ‎07-07-2015

Thanks again okbz. One example area is climate simulations, where typically an application is packaged with dozens of independent functions that simulate certain climate physics, such as precipitation. On GPU, you can implement each function as a GPU kernel, and depending on application input data and configuration, a subset of available physics functions are used at run time. Also, there are a large number of such physics functions. I'm not a climate scientist, but this is what I see in some of the well-known simulation codes in this field.

Furthermore, basically, many of large simulation code with long history would have such a collection of routines in order to cover a wide range of simulation conditions. Even if most of them are in fact used, I'd imagine they are usually not used at the same time, thus at any point in its execution, only a few of the routines would be exercised, occupying only a fraction of logics.

I'm new to FPGA, and I think this was not a big issue previously since having a large number of functions run on FPGAs was itself a huge challenge without OpenCL.

Altera_Forum · ‎07-07-2015

You're very welcome. I guess most applications that I have seen with FPGAs target a very specific application. So in terms of simulation, it would be targeted to a specific simulation that you want to be accelerated and now the questions are where is the bottleneck in the simulation, can an FPGA be a good fit as to accelerated it, and if so, how to best deconstruct the problem such that it can take advantage of the properties and power of an FPGA. So you would essentially choose the functions that are required in the current simulation, optimize and tune it, and target it that way.

Of course if you start scaling up to larger HPCs, now you're talking about hundreds of FPGAs and so you can have specific FPGAs loaded with certain kernels and etc. The thing to also keep in mind is that with a lot of these kernels that implement physics functions, i'm not sure how complicated each kernel is, but something to keep in mind is the data movement from one kernel to the other which can become an issue. The best practice guide recommends combining small kernels into a larger one.

Yes you are correct, FPGA's are usually targeted toward a set application or rather, have a set logic. There are cases where you can do partial reconfiguration where there is a region on the FPGA that can reconfigure the logic to what you need on the fly or to rapidly assemble the bitstream with the required function. But to my knowledge, Altera doesn't support partial reconfiguration with OpenCL.

Altera_Forum · ‎07-07-2015

Thanks so much for your helpful comments.