About branch optimization

Altera_Forum · ‎03-13-2014

Hi,

I'm developing opencl both on Altera FPGA and on Intel GPU.

According to the Intel GPU optimization guide. To improve the performance, a branch should be written as

bool comparison;
comparison= /*calculation of the condition*/
if(comparison){
/*do something*/
}

However, when I run the this optimized kernel on FPGA, the performance gets worse.

I understand that the GPU optimization technique is not supposed to work on FPGA. But I would like to know how this change affects the FPGA performance?

Any help will be appreciated!

Altera_Forum · ‎03-13-2014

Would you please provide the original code?

Altera_Forum · ‎03-13-2014

Thanks for your reply.

The original code would be very long since I replaced all the "if" with the "optimal code". And if this is a problem concerning my code, can I assume that theoretically this kind of change should not affect the FPGA's performance?

Altera_Forum · ‎03-13-2014

Without seeing the kernel I can only think of a few reasons for the performance degregation but it would be very low level information that I don't think you'll benefit from so I'll just say you do not need this particular optimization when targetting the Altera device.

I'm not sure of the reason why that's an optimization on a GPU but I'm guessing it is branch prediction related. Often GPU optimizations are not necessary when targetting the FPGA since the hardware is being generated accordingly for the kernel, instead of the other way around where you are trying to make your kernel fit in the underlining archeticture for GPUs and other ASICs. In general my recommendation is any time you work with a kernel that has been optimized for a GPU and you recognize these optimizations try undoing them when you target the FPGA since optimizations for one device may not necessarily help on another device (same is true if you ported a kernel from a FPGA to a GPU).

Altera_Forum · ‎03-14-2014

I think it may be best to describe what you've observed with a simple example:

bool comparison;

comparison = get_global_id(0) % 2 == 0 || get_global_id(0) % 2 == 1; // this is always true

if (comparison) {...} // conditional 1 and statement 1

if (get_global_id(0) % 2 == 0 || get_global_id(0) % 2 == 1) {...} // conditional 2 and statement 2

The two conditional tests above are not equivalent because of possible early exits, i.e., if the global id is even then you can ignore the second logical test and immediately execute statement 2, while the odd global ids have to perform both logical tests before executing statement 2. This may cause a branch divergence for GPUs given their SIMD architecture and reduce performance by half in the above example. The two statements seem like they would perform equally well for FPGAs. Are you using any kernel attributes, like num_simd_work_items?

Altera_Forum · ‎03-14-2014

Thanks for your reply.

Actually, I'm not trying to optimize the performance of a certain device. Just try to see the effect of applying the GPU optimization on the FPGA card.