Optimization for complex algorithms

Altera_Forum · ‎09-03-2017

Hi everyone. I'm working on a project where I have to port this "big" code to FPGA using OpenCL.

The last version of the code did fit the FPGA, this time we have a more complex version and I can't fit it on the FPGA.

I am trying to optimize the code as much as I can. I'll share some of the strategies I am using:

No loop unrolling
FastPow instead of Pow (power math function)
Big conditions of if statements evaluated offside them and possibly offside loops
Other branching\if statement optimization by trial and error
Possibly all variables in global memory

Now I have finished all my ideas and I don't know what to do. I started with 900% logic utilization and went down slowly. This is my current result:

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource ----------------------------- Usage ;

+----------------------------------------+---------------------------+

; Logic utilization --------------------- 143% ;

; ALUTs -------------------------------- 69% ;

; Dedicated logic registers ----------- 77% ;

; Memory blocks --------------------- 153% ;

; DSP blocks -------------------------- 100% ;

+----------------------------------------+---------------------------;

What is Memory Blocks referring to?!?

Altera_Forum · ‎09-03-2017

Are you targeting Stratix V or Arria 10? Since the DSPs in Stratix V do not natively support floating-point operations, these operations will consume a lot of on-chip memory and logic. That could explain part of your problem. You can alleviate this issue by adding --fpc and --fp-relaxed to the kernel compilation options which can greatly reduce your area utilization at the cost of some loss in accuracy.

If you are using Arria 10, do not rely on the compiler's memory and logic estimation; at least for me, it is always off by up to 50%. Your kernel might actually fit as it is if you are targeting Arria 10.

Make sure you perform all mathematical operations that are done only once per kernel invocation, on the host, and instead pass the output to the kernel.

At the end of the day, if all else fails, you can still split your computation into two or more kernels, and call and compute the kernels sequentially while reconfiguring the FPGA in-between for each new kernel. As long as your run time is high enough, the reconfiguration overhead will be negligible.

Regarding "Memory Blocks", that shows the number of Block RAMs on the FPGA with at least one used port. Each block has two ports and to support multiple parallel accesses to the same on-chip buffers, the compiler has to replicate such buffers onto multiple Block RAMs to obtain enough ports for supporting all the accesses without stalling. Because of this, in many cases, you will actually run out of ports (Memory blocks) sooner than you run out of memory space (Memory bits).

Altera_Forum · ‎09-03-2017

--- Quote Start ---

Are you targeting Stratix V or Arria 10? Since the DSPs in Stratix V do not natively support floating-point operations, these operations will consume a lot of on-chip memory and logic. That could explain part of your problem. You can alleviate this issue by adding --fpc and --fp-relaxed to the kernel compilation options which can greatly reduce your area utilization at the cost of some loss in accuracy.

If you are using Arria 10, do not rely on the compiler's memory and logic estimation; at least for me, it is always off by up to 50%. Your kernel might actually fit as it is if you are targeting Arria 10.

Make sure you perform all mathematical operations that are done only once per kernel invocation, on the host, and instead pass the output to the kernel.

At the end of the day, if all else fails, you can still split your computation into two or more kernels, and call and compute the kernels sequentially while reconfiguring the FPGA in-between for each new kernel. As long as your run time is high enough, the reconfiguration overhead will be negligible.

Regarding "Memory Blocks", that shows the number of Block RAMs on the FPGA with at least one used port. Each block has two ports and to support multiple parallel accesses to the same on-chip buffers, the compiler has to replicate such buffers onto multiple Block RAMs to obtain enough ports for supporting all the accesses without stalling. Because of this, in many cases, you will actually run out of ports (Memory blocks) sooner than you run out of memory space (Memory bits).

--- Quote End ---

I am targeting Stratix V. Are you sure it doesn't support natively floating point operations?

I have tried adding those two options when compiling but the estimated values did not change. Should I just let it compile and see?

Also does what you said about Block RAMs mean that if I reduce the number of buffers I will save on ports?

Thanks for your help

Altera_Forum · ‎09-04-2017

Of course. The DSPs in Stratix V only natively support integer operations. For floating-point multiplication, large parts of the shifting and rounding will have to be performed outside of the DSP which consumes a significant amount of logic and memory. Note that only floating-point multiplication will use DSPs on Stratix V (alongside with logic and memory), while every other floating-point operation (including addition) will only use logic and memory. Adding --fp-relaxed might or might not make any difference, but on Stratix V, adding --fpc must make a very noticeable difference. The difference will also reflect in the report; if you are not seeing it in the report, it means the switch has not been applied correctly. Assuming that you are using a makefile, have you made sure the switches are being correctly applied to the aoc command?

Considering the fact that you are using Stratix V, the high memory and logic utilization you are seeing is because of using floating-point operations and even with --fpc, you might not be able to fit the design. Switching to fixed-point as outlined in Altera's documents could also help.

Reducing number of buffers, and also number of accesses to each buffer, can help reduce the memory utilization. Assuming that you are using an NDRange implementation, the compiler report will give detailed info about which buffers are implemented using memory blocks, and how many accesses exist to each buffer, and how many times the buffer has been replicated to support these accesses. Though I am not sure if this info still exists in latest versions of the compiler; I haven't used NDRange in a long time.

Altera_Forum · ‎09-04-2017

What about multi kernel approach? Would that save space on the FPGA? Like if I had to re-flash the FPGA during execution? Also what about Partial reconfiguration? I have just read that and I can't find anything online about how to use it.

Thanks for your precious help

Altera_Forum · ‎09-05-2017

The multi-kernel approach could certainly help, as long as you can logically split your computation to separate sequential sections. I personally have not tried doing this yet, but you can put the different kernels in separate files and compile them individually. Then, in the host code, you will load one kernel image, compute, leave the output on device memory, load the second kernel, use the output of the first kernel as input, and generate the final output. You might need some extra buffer management in this case; e.g. you might need to free the input buffer of the first kernel after its execution has finished, to make up space for the output buffer of the second kernel.

Altera_Forum · ‎09-05-2017

Ok so I had removed something from the code and now it fits. Now I am experiencing a bit too long execution timings.

I was wondering. Do workgroups run in parallel or it's not guaranteed? Because I am running this configuration: 256 as global size and 1 as local size.

It looks like as I increase the global size the execution time increases linearly. Am I doing it wrong?

Altera_Forum · ‎09-05-2017

Work-group parallelism will only be guaranteed if you use num_compute_units to replicate your kernel pipeline. With one compute unit, there can still be some limited degree of work-group pipelining depending on how many barriers you have in the kernel, but there will be no guaranteed parallelism.

If you are fine with a local size of one, i.e. you do not use any local-memory-based optimization, you might as well use the single work-item kernel type and simply wrap your computation in a for loop from 0 to global_work_size - 1. At least in that case you will get guaranteed pipelining with an iteration interval that will depend on iterations dependencies, and will be reported by the compiler.

I do not have a good-enough understanding of the inner workings of the scheduler to tell you why your performance is increasing linearly with global_work_size. Assuming that all work-items from all of work-groups are pipelined one after another with a small iteration interval, run time will increase, but not linearly. I would expect linear increase in run time only if execution is fully sequential.