I have a Stratix 10 design that is based around an ip core generated using Intel's HLS. The core does some simple floating point operations and by itself uses very few resources (1 DSP, a few hundred flops etc).
This core sits inside a generate statement like this:
for(i = 0; i < SOMEBIGNUMBER; i=i+1)
myhlscore u0 (inputs, outputs);
The design works and is proven in simulation and in hardware.
The problem comes when I try to increase the value of SOMEBIGNUMBER. Despite there being adequate resources, using values above 200 or so make the synthesis tool run out of memory.
I cannot alleviate this easily by adding more memory - I already tried synthesizing on a computer with 256GB memory and a 200GB swap space and quartus ate it all up before dying.
I'm using a .ip file from HLS right now. I'm wondering is there is some way to pre-synthesis the module and keep the results, or is there someway I need to write the generate statement so that it caches less? Perhaps there are some synthesis settings I can change?
We tried using a design partition, but the elaboration stage still exceeds the 120 GB of memory.
Thank you for posting in Intel community forum and hope all is well.
If I understand correctly the situation I would say the looping might be the cause here.
Hence would recommend to look into pipelining the loops which will enable parallelism.
Here are some explanation on the concept, and would recommend to look into the guide of writing the loop in HLS.
Hope that clarify.
Apologies for the confusion, if I understand correctly what has been implemented is a for loop to trigger the ip core that are performing the simple floating point. If the for loop are increase to a big number, memory are overloaded.
Mind if I asked what are the value for the SOMEBIGNUMBER which is causing the issues?
As well as are you able to share how the floating-point operation are written?
Hope to hear from you soon.
Thanks for getting back to me. Sorry for slow reply.
I'm not sure what you mean by 'trigger', the Verilog generate-for loop instantiates parallel instances of the IP core. If the for loop is large, then during compilation the memory usage is untenable.
SOMEBIGNUMBER is approximately 200. I cannot share the exact HLS code, but it is essentially a cumulative sum across 128 inputs. Here is some psuedo code:
float myHlsCore(16bit integer stream_in)
static float runSum = 0;
We have partially solved the issue by creating a very large swap file on the system (~500GB), but this is not a realistic solution as memory access is extremely slow on the swap.
Now the compilation process fails, saying the design cannot be routed, despite resource usage being less than 60%. It is a slightly different issue, but I think they are related, and I do not think that the swap file is a good solution to the original problem either.
Greetings and apologies for the delayed in response.
We did try to test out a simple floating-point matrix in HLS together with the quartus compilation.
However, we did not notice the increase of resources.
Hence our guess is on the three might be some resources usage in the quartus design, which we would suggest sharing more on the qsys design you have.
Hope to hear from you soon.
Greetings, as we do not receive any further clarification on what is provided, we would assume challenge are resolved. For new queries, please feel free to open a new thread and we will be right with you or let us know if challenge are still open and we would get back to you as soon as convenient. Pleasure having you here.