Re: What does Performance saturation mean when we Increase SIMD size?

MAsla5 · ‎02-13-2020

Hi,

I am accelerating my application on altera FPGA, When I go with SIMD 32 the resources drops apart of increasing. I studied somewhere that its a performance saturation. My question is, how to prove it? Where could i find the answer of this question? Could i find in report somewhere?

Thank you.

MEIYAN_L_Intel · ‎02-14-2020

Hi,

May I know the "When I go with SIMD 32 the resources drops apart of increasing", is mean by you are incresing the SIMD work item to 32, am I right?

Thanks

MAsla5 · ‎02-14-2020

Hi,

Yes, you're right, Usually, if resources are used more than hundred percent , then offline compilation is terminated but what happens in case of SIMD 32?

Thanks!

MAsla5 · ‎02-14-2020

Hi,

When is see reports, there are only two memory banks created . In case of 16, there are only 16 memory banks that i can see in reports.

Is there any memory bound issue? If it is, then what it is? Please guide me in this matter.

Thanks!

MEIYAN_L_Intel · ‎02-14-2020

Hi,

May I have the kernel code and report file for further investigate?

Thanks

MAsla5 · ‎02-14-2020

Hi ,

You can feel the same bevaior with intel DESIGN example of matrix multiplication. I have checked with that as well.

Thanks!

MAsla5 · ‎02-14-2020

posted a file.

Hi,

Here you can find the attached file.

Thanks!

HRZ · ‎02-14-2020

That is because the compiler does not support SIMD sizes above 16 and if you choose such SIMD size, it will automatically revert to a SIMD size of 1 and hence, resource utilization will decrease. There should be warning about this in the compilation log, or at least there was one before. A lot of the important warning have been removed in the newer versions of the compiler, hope this one is still there.

Of course there is zero logical reason to have any restriction on SIMD size for FPGAs since, unlike GPUs, FPGAs do not have a fixed architecture; however, this has been like this since the very first version of the compiler and will probably never change.

MAsla5 · ‎02-14-2020

Hi,

Thanks for your answer, so can we say, more or less memory bound issue? because when SIMD gets bigger then resource usage increases, and when some of the resources usage is more than hundred percent then offline compilation terminates with some quartus error in normal cases. But my question is , in SIMD case, if this is memory bound issue then why it doesn't terminate the off-line compilation except doping the resources?

Thank you .

HRZ · ‎02-15-2020

Not really, this has nothing to do with memory bandwidth, this is an artificial compiler limitation. The following compiler warning is generated when compiling your kernel:

Compiler Warning: Kernel Vectorization: requested number of SIMD work items is larger than  ... cannot vectorize efficiently beyond OpenCL widest vector type.

If you write the kernel using the Single Work-item model and use an unroll size of 32, which would have a similar effect to using a SIMD size of 32 in an NDRange kernel, the kernel will compile just fine and the area usage will keep increasing as you increase the unroll factor. Depending on your kernel and FPGA size, you might not be able to fit the design with literally any SIMD size (even 1), or you might be able to still fit it with a hypothetical SIMD size of 32 or more. The compiler cannot know if your design will fit or not without place and routing it; hence, it will not terminate the compilation if some resource is expected to be overutilized. Note that the area utilization numbers you get from the "-report" switch are based on estimation, and final area utilization could be more or less than that.

Memory bandwidth depends on a lot of factors, only one of which is SIMD/unroll size. You can find a comprehensive analysis of memory performance on Intel FPGAs in the following document:

https://arxiv.org/abs/1910.06726

MEIYAN_L_Intel · ‎02-20-2020

Hi,

For your information, according to https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf in chapter 7.3.1 shows the limitation in implement num_simd_work_items attribute.

Thanks