Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
722 Discussions

Intel HLS - high load / store latency

Mopplikus
Beginner
1,947 Views

Hello,

I am in the process of benchmarking a few common HLS tools, and I'm having some issues with Intel HLS. I've implemented a simple histogram to test the tool, however in the test-fpga report I'm getting unusually high latency from load / store operations, raising my II far above normal levels.

A load / store operation according to the report takes 31 cycles, leading me to believe that the way I wrote the histogram, the tool does not use the embedded memory on the board (which should have a 1-cycle load / store latency, knowing that I expect this circuit to run in the 200-300 MHz range). What do I need to specify to change this to use the on-board memory, or to reduce the load / store latency?

Below you can find the C++ code I'm synthesizing. Note that the goal is to pre-initialize the RAM with the inputs to the histogram function, and the function should then iterate over them.

#include <HLS/hls.h>
#include <stdio.h>
#include <iostream>

#define N 100

using namespace ihc;

component void histogram(
    int feature[],
    float weight[],
    float hist[],
    int n
)
{
    int i;
    for(i = 0; i < n; i++)
    {
        int m = feature[i];
        float wt = weight[i];
        float x = hist[m];
        hist[m] = x + wt;
    }
}

int main()
{
    hls_memory hls_singlepump int feature[N];
    hls_memory hls_singlepump float weight[N];
    hls_memory hls_singlepump float hist[N];

    int i;
    for(i = 0; i < N; i++)
    {
        feature[i] = i + 1;
        weight[i] = (float) (2 * i);
        hist[i] = 0.0f;
    }

    histogram(feature, weight, hist, N - 1);

    bool failed = false;
    for(i = 0; i < N; i++)
    {
        float val = hist[i];

        if(i == 0)
        {
            if(val != 0.0)
            {
                failed = true;
                break;
            }
        }
        else
        {
            if(val != (float) ((i - 1) * 2))
            {
                failed = true;
                break;
            }
        }
        
    }

    if(failed)
    {
        printf("FAILED");
    }
    else
    {
        printf("PASSED");
    }

    return 0;
}

For reference, I'm synthesizing on the default Arria 10 board.

Also, if you have any tips on improving my code or some standard practices which I'm unaware of, I'll gladly take them.

Thanks in advance.

0 Kudos
4 Replies
Mopplikus
Beginner
1,944 Views

Update: I found a solution to my problem. It turns out that I need to initialize the memory on-board inside a component, since initializing it in the main function probably assumes that the loads and stores go through the board's I/O instead of the embedded memory. Moving the initialization inside the component solved the issue and brought it to an expected 1 cycle latency.

0 Kudos
hareesh
Employee
1,917 Views

Hi,

thank you for posting here. I have seen your last message/post. I think you got the solution so if you don't have any issues I will close this case. please confirm it.


Thanks,


0 Kudos
Mopplikus
Beginner
1,902 Views

Hello,

Yes, you may close this thread.

K.R.

0 Kudos
hareesh
Employee
1,859 Views

thank you for confirmation.

if you want to reopen this case Please login to ‘https://supporttickets.intel.com’, view details of the desired request, and post a feed/response within the next 15 days to allow me to continue to support you. 


0 Kudos
Reply