Re: SLAM on FPGA using Altera OpenCL

Altera_Forum · ‎04-17-2017

Hello everyone,

My final year project masters project is about evaluating the performance of a SLAM (Simultaneous localisation and mapping) algorithm on an FPGA and compare it's performance with GPUs and CPUs.

The whole algorithm has roughly 10 kernels and I spent a considerable amount of time to write the kernels on C++ so that I can map and analyse the performance "per single kernel" on the platforms in the first place (while having the rest running on host CPU). My single kernels are ready now and I can now successfully run the kernel on GPU,CPU and an FPGA (emulator so far!) with OpenCL on the servers my university has given access to me.

The FPGA on the server is Nallatech 385 with Stratix V D5 FPGA which is what I have compiled my kernels for.
running aocl version on the server I get ( I know its old and have told my network administrator to upgrade it, if they don't I might have to stick with this ): aocl 15.0.0.145 (Altera SDK for OpenCL, Version 15.0.0 Build 145, Copyright (C) 2015 Altera Corporation)
Querying platform info on my host code I get: OpenCL 1.0 Altera SDK for OpenCL, Version 15.0

I have a few questions and problems that I would be extremely grateful if someone can help me with.

1) Using the emulator, I timed my kernels for the FPGA. The performance of the FPGA is really really far from the GPU and CPU and since my kernels are "NDRange" I realised that the emulator does not execute the work-items in parallel to give me the correct timings. So I thought executing it on the actual FPGA for correct timing. However I get the following error:

PLL Error: Read invalid pll setting for 0.000000 MHz. Make sure read access too acl_kernel_clk is functioning and the post-quartus-script succeeded

PLL: Invalid settings f=0 m=0 n=0 k=0 c0=0 c1=0 r=0 cp=0 div=0

acl_hal_mmd.c:766:assert failure: Failed to read PLL confighost: acl_hal_mmd.c:766: l_try_device: Assertion `0' failed.

2) My supervisor told me to change the kernels from NDrange to single-work-items (since the code is originally written for a GPU with OpenCL). Having done that (by declaring for loops in the kernel itself), I got a better timing performance on an emulator but still very far from GPU and CPU. Unlike the NDrange, Is the single work item timing I am getting right via the emulator? What is the best way to avoid hours of kernel compilation time while getting a reasonable timing estimate?

3) I am interested in knowing what myNDrange performance will be on an actual FPGA and how it differs from my single work item results. Is changing the kernel from NDrange to single-work-item the right approach? if not what are your suggestions? Am I going to expect a better timing performance on an FPGA compared to GPU and CPU if I apply the optimisation methods in the guides provided?

Thanks for your help in advance.

Altera_Forum · ‎04-18-2017

1) Don't waste your time with Altera's emulator unless if you want to use it for debugging functionality when using the channels extension. For debugging other cases, just run your code on a standard CPU/GPU. Apart from this, Altera's emulator is purely functional and the run time you get under the emulator has nothing to do with the actual run time on the FPGA. Kernels getting faster or slower in the emulator doesn't mean they will get faster or slower on the actual FPGA either.

The errors you are getting are most likely caused by incorrect set up of the board; you should definitely update to the latest version of Nallatech's BSP and Quartus and AOC and make sure you are using the same BSP for compilation as the one that is used on the machine with the FPGA. Nallatech also sometime releases firmware updates for their boards which must be applied. Finally, whoever is responsible for setting up the board must read Nallatech's documents and make sure all steps have been done correctly, and then test the board with "aocl diagnose" before running any actual kernels on it.

2) As mentioned above, timing results from the emulator mean nothing. Unfortunately, there is no way to get correct timing (or even an estimation of it) without place and routing the kernel (only if I had a nickel for the number of times I have told Altera that providing a clock-accurate emulator should be at the top of their list of priorities). You can, however, use Altera's compilation report and area report (which have been significantly improved in v16.0 and 16.1) to get some idea of how to improve your kernels to achieve better performance; you must fully read Altera's Programming Guide and Best Practices Guide for OpenCL to understand how to interpret these reports.

3) Based on my experience, using single work-item is the correct approach in 80% of the cases. For cases where un-pipelinable loops exist in the kernel (e.g. nested loops with variable exit conditions) or kernels where memory accesses are random or not consecutive, NDRange will probably work better. Determining which kernel type to use needs a lot of experience, there is no fixed formula for this.

Using Altera's optimization techniques will definitely help, but probably not enough to get comparable results to a proper CPU and GPU; you will likely need to re-design your algorithm for the specific architecture of FPGAs to get comparable performance.

Regarding CPU with GPU comparison, if you are comparing against a proper CPU and GPUs (rather than extremely old or under-powered ones which a lot of people unfortunately do), you can very likely expect better performance against the CPU, but not the GPU. It is very hard to beat modern GPUs with current FPGAs, due to extremely low off-chip memory performance of the latter.

Since you are doing academic work, you should consider reading the relevant related work; there are a lot of recent papers with OpenCL on FPGAs. Consider searching for "OpenCL Altera" or "OpenCL FPGA" in google scholar.

This paper in particular has some examples of the performance difference between single work-item and NDRange kernels with different optimization levels, and also comparison with CPU and GPU of the same age as the Stratix V FPGA:

http://dl.acm.org/citation.cfm?id=3014951

Altera_Forum · ‎04-18-2017

Thanks very much for your great and helpful response. I really appreciate it.

Altera_Forum · ‎04-19-2017

Thanks again for your help.

I have now executed one of my kernels on an actual FPGA. However, I do not get any speed up or slow down when I change my kernel from NDrange to single-work-item (which I was not expecting at least for this simple kernel). The kernel I am executing on the FPGA contains the following:

uint2 pixel = (uint2) (get_global_id(0),get_global_id(1));

depth= ...

and to change it to single-work-item I changed it this way (Also replacing clEnqueueNDRangeKernel to clEnqueueTask in the host).

for(uint pixel_y=0;pixel_y<240;pixel_y++){

for(uint pixel_x=0;pixel_x<320;pixel_x++){

depth=....

}

Is this the correct way of changing to single-work-item kernel (I did not find a proper method anywhere)? If not how should do it?, what is your suggestion on to improve the execution time of this kernel? How about for a more complex kernel in my application like this. Shall I change to single work item (like above) and follow the optimization report or follow the guide on "how to improve NDRange kernels"?

const uint2 pos = (uint2) (get_global_id(0),get_global_id(1));

const uint2 size = (uint2) (get_global_size(0),get_global_size(1));

const float center = in[pos.x + size.x * pos.y];

if ( center == 0 ) {

out[pos.x + size.x * pos.y] = 0;

return;

}

for(int i = -r; i <= r; ++i) {

for(int j = -r; j <= r; ++j) {

const uint2 curPos = (uint2)(clamp(pos.x + i, 0u, size.x-1), clamp(pos.y + j, 0u, size.y-1));

const float curPix = in[curPos.x + curPos.y * size.x];

if(curPix > 0) {

sum += factor;

}

out[pos.x + size.x * pos.y] = t / sum;

The reason I am asking is that for the moment, I have to stick with the current old version of AOCL.I want to get a feeling of what to think about and follow the correct optimization path from the start while not having to wait a few hours for each method and approach to compile for me to see the timing results.

Thank you very much.

Altera_Forum · ‎04-20-2017

--- Quote Start ---

I have now executed one of my kernels on an actual FPGA. However, I do not get any speed up or slow down when I change my kernel from NDrange to single-work-item (which I was not expecting at least for this simple kernel). The kernel I am executing on the FPGA contains the following:

--- Quote End ---

This isn't necessarily surprising. If the kernel is simple and straightforward, NDRange and single work-item will perform very similarly.

--- Quote Start ---

Is this the correct way of changing to single-work-item kernel (I did not find a proper method anywhere)? If not how should do it?, what is your suggestion on to improve the execution time of this kernel?

--- Quote End ---

Wrapping the NDRange kernel in for loops over the work group dimensions is certainly the correct way to convert NDRange to single work-item; still, an NDRange kernel regularly has multiple barriers that are used to ensure local memory consistency. These barriers are not needed in single work-item and it is very likely that you would be able to combine the regions above and below a barrier into one loop in single work-item. I personally prefer to start from a baseline sequential implementation to create single work-item kernels, rather than converting an existing NDRange kernel to single work-item and manually merging all the loops. Assuming that the innermost loop is fully-pipelined in this case (iteration interval (II) of one reported by the compilation report), the most obvious optimization would be to partially unroll the innermost loop using# pragma unroll *factor*.

--- Quote Start ---

How about for a more complex kernel in my application like this. Shall I change to single work item (like above) and follow the optimization report or follow the guide on "how to improve NDRange kernels"?

--- Quote End ---

I personally start from single work-item, see how far I can get and how well I can achieve full-pipelining for the loops in the kernel, and if my attempts where not successful, I will switch to NDRange. The compilation report for single work-item helps considerably, while the report for NDRange is pretty much useless. The area report is much more useful for optimizing NDRange kernels, but you are not going to get the necessary info with the report generated by Quartus 14.0

For your specific code, you probably need to use the shift register-based optimization for floating-point reduction for the "sum += factor" operation. Check Altera's documents for how to implement this optimization. Assuming that this optimization allows you to get an II of one for both of the for loops, then you should start unrolling the loop on j to achieve best performance. There will be some parameter tuning involved in this case which needs timing after full kernel compilation to determine which value is best.

In contrast, if the compiler reports that some loops cannot be pipelined due to variable exist conditions, then you should probably stick to NDRange and use SIMD or num_compute_units to achieve higher performance. This would be after you apply the basic optimization like using restrict or reqd_work_group_size.

I strongly recommend fully reading and understanding Altera's OpenCL documents before experimenting with the compiler.