Re: Stable argument doesn't work in simulation

DorianL · ‎06-10-2024

Hi everyone,

I have an issue when I try to run my oneapi kernel by passing my arguments with "stable annotated_arg". I try to use a "for" loop with those "stable" arguments as variable in simulation but it is very slow and doesn't work very well whereas when i use classic "int" declared in the kernel without using an argument variable I don't have this issue in the "for" loop and the simulation work fine and fast. Do you have an idea of what could be the issue ? Thank you !

DorianL

aikeu · ‎06-12-2024

Hi DorianL,

Can try to check out the optimization reference from the guide below:

https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/optimization-guide/2023-1/loops.html

Thanks.

Regards,

Aik Eu

DorianL · ‎06-14-2024

Hi @aikeu,

Thank you for your reply, I already checked the documentation and I think the issue is that my inner loop is dynamic because it is an input of my IP (the number of column of the picture to process). I tried to solve the issue by bounding the loop with a random number like it is made in the example of the documentation, but my kernel doesn't accept the rand() function because it's an external SYCL function. Do you have an idea of how I could make my kernel work ?

Thank you !

DorianL

BoonBengT_Intel · ‎06-17-2024

Hi @DorianL,

Greetings, just to further understand the issues you mention that you have a inner loop issues.

Would you be able to share the code example that you have for the kernels and also what are the error/issues that you are seeing? And is it correct to assume that the error happens on the compilation?

If you can provide the compilation command it would also be very helpful.

That would better help us to understand the situation.

Thanks,

Regards

BB

DorianL · ‎06-19-2024

Hi @BoonBengT_Intel,

Thank your for your answer, here is my code with my kernel. My issue is about the inner loop (line 49) in the kernel that process pixels should have a II of 1 wich is not the case in simulation. I have rdy signals that are not continually high whereas it is indicated in the report that the II should be 1. The issue is in my opinion about the dynamic trip count of this loop because the stop condition of my "for" loop is an input of the kernel and I think that the compiler is considering it as a dynamic input. How can I change that to be a fixed input ?

Here is the code of my kernel :

template <typename flux_in, typename flux_tempo>
struct travail_sur_voisinage {

    sycl::ext::oneapi::experimental::annotated_arg<
       int , decltype(sycl::ext::oneapi::experimental::properties{
               stable})>
       taille_h;

     sycl::ext::oneapi::experimental::annotated_arg<
       int , decltype(sycl::ext::oneapi::experimental::properties{
               stable})>
       taille_v;
  
    auto get(sycl::ext::oneapi::experimental::properties_tag) {
        return sycl::ext::oneapi::experimental::properties{
            
                streaming_interface<>};
  }

  void operator()() const {
    //Compteurs ligne pixel

	//Entree Sortie
  [[intel::fpga_register]]
	unsigned int pixel_a_traiter;

  [[intel::fpga_register]]
	unsigned int pixel_a_envoyer;

  
 
  [[intel::fpga_register]]
	unsigned int pixel_apres_traitement;

	//Ligne a retard
  [[intel::fpga_memory("BLOCK_RAM")]]
	unsigned int line_buffer[5][NB_COLONNE_MAX];

	//Voisinnage
  [[intel::fpga_register]]
	unsigned int fenetre[5][5];

    [[intel::initiation_interval(1)]]
    for (int num_lig = 0; num_lig < taille_v + 2; num_lig++)
    {
      
      [[intel::initiation_interval(1)]]
      //[[intel::speculated_iterations(0)]]
    	for (int num_col = 0; (num_col < taille_h + 2); num_col++)
    	{

        if (num_lig < taille_v && num_col < taille_h)
        {

          pixel_a_traiter = flux_in::read();

          //Gestion ligne a retard
  
          fpga_tools::UnrolledLoop<0,4>([&](auto l)
          {
            line_buffer[l][num_col] = line_buffer[l + 1][num_col];
          });
          line_buffer[4][num_col] = pixel_a_traiter;
          
          //Fin gestion ligne a retard

          //Fenetre video glissante

          fpga_tools::UnrolledLoop<0,5>([&](auto li)
          {
              // #pragma unroll

            fpga_tools::UnrolledLoop<0,4>([&](auto co)
            {
              fenetre[li][co] = fenetre[li][co + 1];
            });
            fenetre[li][4] = line_buffer[li][num_col]; 
          });
          //Fin Fenetre video glissante
        }
      pixel_apres_traitement = traitement_5x5(fenetre);
    


      if ((num_lig >= 2) && (num_col >= 2))
      {
        pixel_a_envoyer = 0;
        
        if (((num_lig >= 4) && (num_lig < taille_v) && (num_col >= 4) && (num_col < taille_h)))
        {
          pixel_a_envoyer = pixel_apres_traitement;
        }
        flux_tempo::write(pixel_a_envoyer);
      }
       
  }
		}
	}
      
};

the compilation command is :

tp3_video.fpga_sim: kernel_sim.o
icpx -fsycl -fintelfpga -Xsclock=400MHz -Xsoptimize=latency  -Xssimulation -Xsghdl=0 -Xstarget=Agilex7 -Xsv $^ -o $@ 


kernel_sim.o: src/tp3_video.cpp
	icpx -fsycl  -O3 -g -std=c++17 -Wall -I include -v -fintelfpga -Xsprofile -Xssimulation -DFPGA_SIMULATOR -o $@ -c src/tp3_video.cpp

Have a great day,

DorianL

aikeu · ‎06-18-2024

Hi DorianL,

I not sure it is directly related to your application request or not based on your previous feedback but can check out the method below on getting random number:

https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Quick-random-number-on-SYCL-device/td-p/1262724

Thanks.

Regards,

Aik Eu

DorianL · ‎06-19-2024

Hi @aikeu,

Thank you for your help, I'll check that now !

DorianL

aikeu · ‎06-23-2024

Hi DorianL,

I will close the thread if no further follow up question.

Thanks.

Regards,

Aik Eu

aikeu · ‎06-25-2024

Hi DorianL,

As we do not receive any response from you on the previous question/reply/answer that we have provided. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.

Thanks.

Regards,

Aik Eu

DorianL · ‎06-26-2024

Hi @aikeu ,

I was waiting for an answer following my last post with my code and compilation commands as asked by @BoonBengT_Intel in a previous post. Could we wait an answer from him before closing the thread ? Thank you !

DorianL

BoonBengT_Intel · ‎07-01-2024

Hi @DorianL,

Appreciate the hold, based on the explanation above on the first part when you are facing challenges in working with the loop's optimization, the Loop Analysis Report provides a good insight on the loop structure that you have written and would give some ideas on where the bottleneck is, more details on the report can be found in the following link below:
- https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-1/loop-analysis.html

On the other hand, when you are trying the inner loop optimization and facing another issues when trying to implement the external function, you can use the 'std::default_random_engine' instead in c++.
More details of the implementation of inner loop optimization could be found in our git repo below:
- https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/DesignPatterns/optimize_inner_loop

Hope that clarifies
Regards
BB

DorianL · ‎07-04-2024

Hi @BoonBengT_Intel,

Thank you for the information but the issue is that in the kernel code I sent before the variable "taille_v" and "taille_h" are input of my IP and it increases the latency of the loop compared to when both variable are defined and not input anymore. Is this an issue that you have ever faced ? Thank you !

DorianL

whitepau · ‎07-09-2024

Hi @DorianL

I think that the piece of documentation you referred to is a bit confusing: https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/current/optimize-inner-loop-throughput.html

The recommendation in that document is to bound your loop with a constant, not a random number. The rand() function is just used to indicate a value that is not known at compile time, and is equivalent to your taille_h and taille_v kernel arguments.

Your line 49 loop could be written like this:

for (int num_col = 0; (num_col < 64) && (num_col < taille_h + 2); num_col++)

This code looks like you are trying to create a local filter. We have a convolution 2d reference design that may be helpful to you: https://github.com/oneapi-src/oneAPI-samples/tree/development/DirectProgramming/C%2B%2BSYCL_FPGA/ReferenceDesigns/convolution2d

Importantly, this design replaces your loop nest with a single loop. Instead of

for (int row = 0; row < num_rows; row++) {
    for(int col = 0; col < num_cols; col++) {
        <...>
    }
}

the convolution design uses a while(1) control loop, and updates the current row/column variables depending on sideband signals that are inserted when the input image is read. This also makes for a design that is able to recover from defective packets (e.g. when a camera is unexpectedly unplugged).

If you prefer the simplicity of a double loop nest and don't need to recover from stream interruptions, you can coalesce the loop nest with the loop_coalesce attribute.

DorianL · ‎07-11-2024

Hi @whitepau,

Thank you for your answer, I tried to bound the loop with a constant number and it doesn't change anything on the simulation, the read_rdy signal of the pipe is sometimes low whereas it is always high when i use a constant size of image instead of kernel arguments.

The issue is still the same the inner loop is very slow when I compare between kernel argument and constant value.

I tried to coalesce the loop nest but I have worse results in simulation than when I don't use coalesce loops.

I already tried to use the convolution_2d github example and I can't use a picture with higher dimension than the original example (test_0.bmp,test_1.bmp...). Is this possible to use other pictures ?

Thank you,

DorianL

whitepau · ‎07-11-2024

Your screenshot mixes the ready/data signals from flux_in with the valid signal from flux_out.

It almost looks like the compiler was not able to schedule your loop with II=1. You can upload your report.zip so we can look at it.
https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/current/review-the-report-html-file.html

You may want to consider the max_reinvocation_delay attribute (code sample)

You can replace the images but make sure they have the same filenames as the originals. (test_0.bmp, test_1.bmp. etc.). I think all test images must be the same size too, since the testbench passes a sequence of images but only sets the image dimensions once.

DorianL · ‎07-12-2024

Hi @whitepau,

Thank you for your answer, I was just talking about the read_rdy from flux_in wich is not constant. I just uploaded the report with this post if you can find something wrong in it, but suprisingly the compiler can reach a II=1 whereas it is clearly not the case in simulation.

I tried with the max_reinvocation_delay pragma but it doesn't change anything on it and it is still slow.

Thank you for the information about the convolution_2d example, I will try to replace images and keep you in touch !

DorianL

whitepau · ‎07-16-2024

Thanks for sharing the report, @DorianL .

It looks like the loop at line 137 was pipelined with II=1, but it was constrained to serial execution.

This means that this outer loop is effectively un-pipelined. This doesn't explain the gaps you are seeing in the simulation waveform though.

I also see that you are getting a memory system with lots of arbitration:

After some experimenting, I discovered that the warning about the variable 'fenetre' is a bit of a red herring here. I would expect for a line buffer like you are describing to have a memory system with multiple banks, and each bank having a dedicated load/store unit (LSU). From the image above, we can see that the memory system is not efficiently selecting banks. I tried using the bank_bits attribute to constrain this, but it appears the compiler is ignoring this attribute now.

I was able to get the compiler to partition your 2d array by swapping the dimensions (transposing) so that the dimension to be split into banks (i.e. accessed simultaneously by different unrolled loop iterations) was in the least significant place. This appears to result in the desired memory system (don't forget to swap the accesses too!!)

OLD:

// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]]  // memory
unsigned int line_buffer[8][NB_COLONNE_MAX];

NEW:

// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]]  // memory
unsigned int line_buffer[NB_COLONNE_MAX][8];

* Note that I changed the dimension from 5 to 8: the compiler complains if you try to create a memory system with a non-power-of-2 number of banks. Changing to 8 is ok because the compiler sees that the extra 3 banks aren't used and it optimizes them away.

The new memory system looks a lot better now:

The sim looks a lot better too:

I think i know how to solve these 2-cycle dips but I'm still waiting for the test to finish.

I suspect it's a side-effect of using a loop nest instead of using a while(1) loop to iterate across image pixels.

DorianL · ‎07-17-2024

Hi @whitepau,

Thank you for your work and answer, everything is working good now ! I never thought it could be a memory issue on that, I will check the kernel memory viewer more often now. The simulation is perfect now and the rdy signal is as it should be. Have a great day !

DorianL

whitepau · ‎07-16-2024

wait... no those blips are intentional: in your inner loop you have a conditional statement intentionally causes two stalls at the end of each row of pixels.

for (int num_col = 0; (num_col < NB_COLONNE_MAX) && (num_col < taille_h + 2); num_col++) {
    if (num_lig < taille_v && num_col < taille_h) {
          pixel_a_traiter = flux_in::read();

          <...>
    }
}

whitepau · ‎07-17-2024

glad it's working @DorianL

When i was looking for an answer for you, I saw that we don't actually have a code sample demonstrating any of these things. Would it be alright if we used the code you sent in as the basis for a sample to demonstrate kernel memory optimization?

Also, please mark my post as the solution to close the thread

Stable argument doesn't work in simulation

oneAPI (SW Development| BSP | IP Integration)