Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
677 Discussions

Deadlock while filling pipe for simulation

Ric2
Beginner
670 Views

Hi, following best practice as descibed here we run into trouble with stalled writes. 

 

When you create a testbench for a oneAPI kernel that you intend to compile as an IP core, write all your data to the host pipe before invoking the kernel.

 

There is a reproducer attached based on streaming data interface example.

As reducing the amount of data to process for simulation seems an easy workaround, this will not work if the design is more complex or bigger and the interface width is larger. Both conditions reduce the number of data samples before the buffer access stalls.

For our relevant design this means a too small size of simulateable data to get reasonable simulation results.

 

I have following questions:

1) Is this intented behaviour?

2) Is there a parameter to increase the accepted number of samples for simulation pipes?

3) Will this solution solve the simulation issue too?

 

Thanks for any feedback!

Ric.

 

oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308), Ubuntu 22.04.4 LTS

0 Kudos
12 Replies
Ric2
Beginner
629 Views

Having modified the write command to be non-blocking I notice that the command still blocks (the if condition is never satisfied):

      bool success = false;
      InPixelPipe::write(q, in_beat, success);
      if (!success){
        sleep(1);
        std::cerr << i << "### STALL ###\n";  
      } 

Documentation reads this:

Non-blocking writes add a bool argument in both host and device APIs that is passed by reference and returns true in this argument if the write was successful, and false if it was unsuccessful.

On the host:
// attempt non-blocking write from host to pipe until successful
while (!success) MyPipeInstance::write(q, data_element, success);

Remember, it's while running simulation. In emulation, the pipe is never stalling.

Btw, with i++ HLS we didn't run in this problem, even with very large simulation data sets.

 

Any advice? Thanks!

0 Kudos
Kevin_Xu_Intel
Employee
556 Views

Hello,

I was able to reproduce your issue, I think there is a buffer limit to host pipe in simulation, and since the host testbench is writing all of its data into the buffer before processing it, it stalled when the limit was hit. 

 

I found that a work around for this is to perform the host pipe read and write in separated threads. 

 

Attached design use pthread for threading the host testbench, and it should not effecting the IP you are working on.

 

Thanks,

0 Kudos
Ric2
Beginner
520 Views

Hi Kevin,

thanks for testing and proposing a work around. Unfortunatelly it helps for half of the problem only. The buffer depth limitation can be addressed this way.

However, as soon as we had implemented the changes we run into trouble again seeing this error massage sporadically:

 

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)

 

Digging into the issue we found that the non-blocking pipe read causes the error as soon as an empty pipe is accessed to read.

To reproduce just generate e.g. 9 write samples and try to read 10.

With the print modification from below the output should look like this:

 

Running on device: Intel(R) FPGA Emulation Device
Input: 0         9
Input Done!
try read: 0      Output: 0       0      success: 1       
try read: 1      Output: 1       1      success: 1       
try read: 2      Output: 2       2      success: 1       
try read: 3      Output: 3       3      success: 1       
try read: 4      Output: 4       4      success: 1       
try read: 5      Output: 5       5      success: 1       
try read: 6      Output: 6       6      success: 1       
try read: 7      Output: 7       7      success: 1       
try read: 8      Output: 8       8      success: 1       
try read: 9      terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)
while (!read_success){ 
    std::cerr << "try read: " << i << "\t "; 
    out_beat = OutPixelPipe::read(*(out_args->q), read_success);
    std::cerr << "Output: " << i << "\t " << out_beat.data << "\tsuccess: " << (int)read_success << "\t " << std::endl;
} 

 

Remember my second post here where some odd behaviour was reported for the non-blocking write command too.

 

Thanks for your help!

Ric.

 

0 Kudos
Kevin_Xu_Intel
Employee
502 Views

Hi Ric,

 

I did some testing, it dose looks like in emulation, a non-blocking read of a empty pipe causes crash. I will look into this problem. 

 

Dose the crash happens to you in simulation? I was not able to reproduce the crash in simulation.

 

Thanks,

 

-Kevin

0 Kudos
Ric2
Beginner
411 Views

Hi Kevin,

I guess there are race conditions. I sporadically see issues in simulation too:

 

.../build$ ./streaming.fpga_sim
Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)
Aborted (core dumped)

 

Please check the host pipe write command in your code. Modify the core to not accept data at it's input for a longer time (backpressure condition). Then the host will throw an error when the pipe is full but the host write is not handling this properly.

To reproduce, just modify line 57:

 

57   // while (!end_of_packet) {
57    while (end_of_packet) {
58      // Read in next pixel

 

As the reproducer is a bit of artificial, I see this issue in our production design too.

I hope this helps you to find the root cause of these problems!

Regards, Ric.

 

0 Kudos
Kevin_Xu_Intel
Employee
386 Views

Hi Ric,

 

I was able to reproduce this error. But i think it might be due to different reason. 

 

If we change the while loop condition to be never to enter it, the compiler would optimized it way.

 

If I change the while condition to:

    // while (!end_of_packet) {
    for (int i = 0; i < 1; i ++){

 To allows one packet to go though, the error goes away.

Do you have a different reproducer for the sporadically write issue?

 

Thanks,

 

-Kevin

 

0 Kudos
Ric2
Beginner
320 Views

Hi Kevin, you are right, the very simple reproducer did not show the effect. Same for the simple i++ loop as this will end the component without having consumed more data from the input pipe at a later time.

The updated reproducer is consuming input data with a large gap. I designed the gap to be large enougth to show that:
1) TB is pushing data to the input pipe until the related tread stalls as the write command is not correctly signalling FULL (the WR STALL print is never shown).
2) Simulation continuosly generates output data, restarts consuming input data after the gap but comes to a dead end as the input pipe is corrupted.
3) The final error comes from vsim but I guess the root cause is in the TB WR handling.

WR ... Input: 17000      Output: 51      Diff: 16949
WR ... Input: 18000      Output: 60      Diff: 17940
^CExiting simulation due to Interrupt
vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed.

Expected behavioural is that the TB WR thread hold on as long as WR pipe is full but restarts pushing data to it when the kernel has consumed some.

Regards, Ric.

Ric2_0-1718357943500.png

 

0 Kudos
Kevin_Xu_Intel
Employee
240 Views

 

 

Hi Ric,

I have simulated you reproducer a several times, here are what i found:

  1. I see your issue number (1) where when a non-blocking pipe write is writing into a pipe that has hit the implicit capacity, the write failed signal isn't being raised. I will ask around if this is a known behaviors.
  2. I don't really see issue (2). The data seems to be processed in burst for me (about every ~17000) but it eventually reach the end (see the output). The simulation dose took about 1 hour with ModelSim pro. So if you are simulating with a starter edition, it is going to take very long time to simulate (may appears to be stalled).
  3. "vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed" I believe this is the expected error when stopping simulation with ctrl+C.

 

 

 

Simulation output (only shown the begining and the end)

Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
WR ... Input: 0	 Output: 0	 Diff: 0
RD STALL WORKS! 

WR ... Input: 1000	 Output: 0	 Diff: 1000
RD ... Input: 1274	 Output: 0	 Diff: 1274
WR ... Input: 2000	 Output: 41	 Diff: 1959
WR ... Input: 3000	 Output: 99	 Diff: 2901
WR ... Input: 4000	 Output: 156	 Diff: 3844
WR ... Input: 5000	 Output: 222	 Diff: 4778
WR ... Input: 6000	 Output: 292	 Diff: 5708
WR ... Input: 7000	 Output: 366	 Diff: 6634
WR ... Input: 8000	 Output: 438	 Diff: 7562
WR ... Input: 9000	 Output: 513	 Diff: 8487
WR ... Input: 10000	 Output: 588	 Diff: 9412
WR ... Input: 11000	 Output: 662	 Diff: 10338
WR ... Input: 12000	 Output: 735	 Diff: 11265
WR ... Input: 13000	 Output: 809	 Diff: 12191
WR ... Input: 14000	 Output: 882	 Diff: 13118
WR ... Input: 15000	 Output: 955	 Diff: 14045
RD ... Input: 15623	 Output: 1000	 Diff: 14623
WR ... Input: 16000	 Output: 1029	 Diff: 14971
WR ... Input: 17000	 Output: 1097	 Diff: 15903
WR ... Input: 18000	 Output: 1159	 Diff: 16841

... ...

WR ... Input: 982000	 Output: 952124	 Diff: 29876
RD ... Input: 982875	 Output: 953000	 Diff: 29875
WR ... Input: 983000	 Output: 953126	 Diff: 29874
RD ... Input: 983859	 Output: 954000	 Diff: 29859
WR ... Input: 984000	 Output: 954142	 Diff: 29858
RD ... Input: 984858	 Output: 955000	 Diff: 29858
WR ... Input: 985000	 Output: 955143	 Diff: 29857
RD ... Input: 985857	 Output: 956000	 Diff: 29857
WR ... Input: 986000	 Output: 956147	 Diff: 29853
RD ... Input: 986853	 Output: 957000	 Diff: 29853
WR ... Input: 987000	 Output: 957148	 Diff: 29852
RD ... Input: 987852	 Output: 958000	 Diff: 29852
WR ... Input: 988000	 Output: 958149	 Diff: 29851
RD ... Input: 988852	 Output: 959000	 Diff: 29852
WR ... Input: 989000	 Output: 959150	 Diff: 29849
RD ... Input: 989835	 Output: 960000	 Diff: 29835
WR ... Input: 990000	 Output: 960166	 Diff: 29834
RD ... Input: 990822	 Output: 961000	 Diff: 29822
WR ... Input: 991000	 Output: 961178	 Diff: 29822
RD ... Input: 991823	 Output: 962000	 Diff: 29823
WR ... Input: 992000	 Output: 962178	 Diff: 29822
RD ... Input: 992823	 Output: 963000	 Diff: 29823
WR ... Input: 993000	 Output: 963179	 Diff: 29821
RD ... Input: 993823	 Output: 964000	 Diff: 29823
WR ... Input: 994000	 Output: 964177	 Diff: 29823
RD ... Input: 994822	 Output: 965000	 Diff: 29822
WR ... Input: 995000	 Output: 965179	 Diff: 29821
RD ... Input: 995810	 Output: 966000	 Diff: 29810
WR ... Input: 996000	 Output: 966199	 Diff: 29801
RD ... Input: 996802	 Output: 967000	 Diff: 29802
WR ... Input: 997000	 Output: 967198	 Diff: 29802
RD ... Input: 997804	 Output: 968000	 Diff: 29804
WR ... Input: 998000	 Output: 968197	 Diff: 29803
RD ... Input: 998804	 Output: 969000	 Diff: 29804
WR ... Input: 999000	 Output: 969197	 Diff: 29803
RD ... Input: 999807	 Output: 970000	 Diff: 29807
Input Done!
RD ... Input: 1000000	 Output: 971000	 Diff: 29000
RD ... Input: 1000000	 Output: 972000	 Diff: 28000
RD ... Input: 1000000	 Output: 973000	 Diff: 27000
RD ... Input: 1000000	 Output: 974000	 Diff: 26000
RD ... Input: 1000000	 Output: 975000	 Diff: 25000
RD ... Input: 1000000	 Output: 976000	 Diff: 24000
RD ... Input: 1000000	 Output: 977000	 Diff: 23000
RD ... Input: 1000000	 Output: 978000	 Diff: 22000
RD ... Input: 1000000	 Output: 979000	 Diff: 21000
RD ... Input: 1000000	 Output: 980000	 Diff: 20000
RD ... Input: 1000000	 Output: 981000	 Diff: 19000
RD ... Input: 1000000	 Output: 982000	 Diff: 18000
RD ... Input: 1000000	 Output: 983000	 Diff: 17000
RD ... Input: 1000000	 Output: 984000	 Diff: 16000
RD ... Input: 1000000	 Output: 985000	 Diff: 15000
RD ... Input: 1000000	 Output: 986000	 Diff: 14000
RD ... Input: 1000000	 Output: 987000	 Diff: 13000
RD ... Input: 1000000	 Output: 988000	 Diff: 12000
RD ... Input: 1000000	 Output: 989000	 Diff: 11000
RD ... Input: 1000000	 Output: 990000	 Diff: 10000
RD ... Input: 1000000	 Output: 991000	 Diff: 9000
RD ... Input: 1000000	 Output: 992000	 Diff: 8000
RD ... Input: 1000000	 Output: 993000	 Diff: 7000
RD ... Input: 1000000	 Output: 994000	 Diff: 6000
RD ... Input: 1000000	 Output: 995000	 Diff: 5000
RD ... Input: 1000000	 Output: 996000	 Diff: 4000
RD ... Input: 1000000	 Output: 997000	 Diff: 3000
RD ... Input: 1000000	 Output: 998000	 Diff: 2000
RD ... Input: 1000000	 Output: 999000	 Diff: 1000
Output Done!
FAILED

 

0 Kudos
Kevin_Xu_Intel
Employee
231 Views

File for the entire output attached here.

0 Kudos
Ric2
Beginner
220 Views

Hi,
thanks for having a look at the issues!
CPU speed will have influence on the issue generation (RAM size might be too), so for your machine the settings for reproduction might be different. When I set the gap number e.g. 5000 it runs through w/o issues on my machine too. So on you machine you need to set it higher, try 30000.

 

b.t.w. the full error I see is:

vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed.
# Attempting stack trace sig  6
# Signal caught: signo [0]
# vsim_stacktrace.vstf written
# Current time Mon Jun 17 16:48:02 2024
# Program = vsim
# Id = "2023.2"
# Version = "2023.04"
# Date = "Apr 11 2023"
# Platform = "linux_x86_64"
# Signature = 016838926890ae993a152c369dc4c1be
# --> START OF USERCODE
# 0    0x00007ffff78969fc: 'pthread_kill + 0x000000000000012c' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# 1    0x00007ffff7842476: 'raise + 0x0000000000000016' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# 2    0x00007ffff78287f3: 'abort + 0x00000000000000d3' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# <-- END OF USERCODE
# 3    0x00007ffff782871b: '<unknown (@0x7ffff782871b)>'
# 4    0x00007ffff7839e96: '<unknown (@0x7ffff7839e96)>'
# 5    0x00007ffff2818321: '<unknown (@0x7ffff2818321)>'
# --> START OF USERCODE
# 6    0x00007ffff2806c1b: 'SimulatorInterface::send_host_channel(void*, void*, bool*, bool*, unsigned int*) + 0x000000000000015b' in '/data1/intel/oneapi/compiler/2024.1/opt/oclfpga/host/linux64/lib/libaoc_cosim_msim.so'
# <-- END OF USERCODE
# 7    0x00007feff1fc4f1f: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:38'
# 8    0x00007feff1fc65ef: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:70'
# 9    0x00007feff1fc77ce: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:185'
# 10   0x00000000023e53b2: '<unknown (@0x23e53b2)>'
# 11   0x00000000004df304: '<unknown (@0x4df304)>'
# 12   0x000000000074da63: '<unknown (@0x74da63)>'
# 13   0x0000000000ca58ad: '<unknown (@0xca58ad)>'
# 14   0x0000000000caabd0: '<unknown (@0xcaabd0)>'
# 15   0x0000000000cac54e: '<unknown (@0xcac54e)>'
# 16   0x0000000000f9bd2d: '<unknown (@0xf9bd2d)>'
# 17   0x000000000287a82d: '<unknown (@0x287a82d)>'
# 18   0x000000000287ec86: '<unknown (@0x287ec86)>'
# 19   0x0000000002880371: '<unknown (@0x2880371)>'
# 20   0x00000000028806d6: '<unknown (@0x28806d6)>'
# 21   0x0000000002881df3: '<unknown (@0x2881df3)>'
# 22   0x00000000028825b1: '<unknown (@0x28825b1)>'
# 23   0x0000000000c6c700: '<unknown (@0xc6c700)>'
# 24   0x0000000000c6e315: '<unknown (@0xc6e315)>'
# End of Stack Trace

0 Kudos
Kevin_Xu_Intel
Employee
168 Views

Hello, I was able to reproduce the simulation hang with a 60000 gap. I suspecting this has something to do with your issue (1), where the handling of reaching the buffer limit isn't done properly.  I will ask around to see if there is a work around.

0 Kudos
Kevin_Xu_Intel
Employee
70 Views

Hi Ric,

 

Sorry this took so long. I think the root of the problem we encountered with all of our workaround that uses pthread is that multithreading isn't fully supported in FPGA flow with oneAPI as it is mention in this release note.

 

I would suggest performing pipe read and write in the same loop:

q.single_task<Threshold>(ThresholdKernel{});

    // Check that output pixels are below the threshold
    bool passed = true;
    for (int i = 0; i < (width * height); ++i) {
      // Write to input pipe
      bool start_of_packet = (i == 0);
      bool end_of_packet = (i == ((width * height) - 1));
      StreamingBeatT in_beat(i, start_of_packet, end_of_packet);
      InPixelPipe::write(q, in_beat);
      
      if(i%1000 == 0)
        std::cerr << "Input: " << i << "\t " << (width * height) << std::endl; 
      
      // Read from output pipe
      StreamingBeatT out_beat = OutPixelPipe::read(q);
      passed &= (out_beat.data <= kThreshold);
      if(i%1000 == 0)
        std::cerr << "Output: " << i << "\t " << out_beat.data << std::endl;
    }

 Let me know if this could work for you,

 

Thanks,

0 Kudos
Reply