- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, following best practice as descibed here we run into trouble with stalled writes.
When you create a testbench for a oneAPI kernel that you intend to compile as an IP core, write all your data to the host pipe before invoking the kernel.
There is a reproducer attached based on streaming data interface example.
As reducing the amount of data to process for simulation seems an easy workaround, this will not work if the design is more complex or bigger and the interface width is larger. Both conditions reduce the number of data samples before the buffer access stalls.
For our relevant design this means a too small size of simulateable data to get reasonable simulation results.
I have following questions:
1) Is this intented behaviour?
2) Is there a parameter to increase the accepted number of samples for simulation pipes?
3) Will this solution solve the simulation issue too?
Thanks for any feedback!
Ric.
oneAPI DPC++/C++ Compiler 2024.1.0 (2024.1.0.20240308), Ubuntu 22.04.4 LTS
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Having modified the write command to be non-blocking I notice that the command still blocks (the if condition is never satisfied):
bool success = false;
InPixelPipe::write(q, in_beat, success);
if (!success){
sleep(1);
std::cerr << i << "### STALL ###\n";
}
Documentation reads this:
Non-blocking writes add a bool argument in both host and device APIs that is passed by reference and returns true in this argument if the write was successful, and false if it was unsuccessful.
On the host:
// attempt non-blocking write from host to pipe until successful
while (!success) MyPipeInstance::write(q, data_element, success);
Remember, it's while running simulation. In emulation, the pipe is never stalling.
Btw, with i++ HLS we didn't run in this problem, even with very large simulation data sets.
Any advice? Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I was able to reproduce your issue, I think there is a buffer limit to host pipe in simulation, and since the host testbench is writing all of its data into the buffer before processing it, it stalled when the limit was hit.
I found that a work around for this is to perform the host pipe read and write in separated threads.
Attached design use pthread for threading the host testbench, and it should not effecting the IP you are working on.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
thanks for testing and proposing a work around. Unfortunatelly it helps for half of the problem only. The buffer depth limitation can be addressed this way.
However, as soon as we had implemented the changes we run into trouble again seeing this error massage sporadically:
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)
Digging into the issue we found that the non-blocking pipe read causes the error as soon as an empty pipe is accessed to read.
To reproduce just generate e.g. 9 write samples and try to read 10.
With the print modification from below the output should look like this:
Running on device: Intel(R) FPGA Emulation Device
Input: 0 9
Input Done!
try read: 0 Output: 0 0 success: 1
try read: 1 Output: 1 1 success: 1
try read: 2 Output: 2 2 success: 1
try read: 3 Output: 3 3 success: 1
try read: 4 Output: 4 4 success: 1
try read: 5 Output: 5 5 success: 1
try read: 6 Output: 6 6 success: 1
try read: 7 Output: 7 7 success: 1
try read: 8 Output: 8 8 success: 1
try read: 9 terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)
while (!read_success){
std::cerr << "try read: " << i << "\t ";
out_beat = OutPixelPipe::read(*(out_args->q), read_success);
std::cerr << "Output: " << i << "\t " << out_beat.data << "\tsuccess: " << (int)read_success << "\t " << std::endl;
}
Remember my second post here where some odd behaviour was reported for the non-blocking write command too.
Thanks for your help!
Ric.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ric,
I did some testing, it dose looks like in emulation, a non-blocking read of a empty pipe causes crash. I will look into this problem.
Dose the crash happens to you in simulation? I was not able to reproduce the crash in simulation.
Thanks,
-Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
I guess there are race conditions. I sporadically see issues in simulation too:
.../build$ ./streaming.fpga_sim
Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)
Aborted (core dumped)
Please check the host pipe write command in your code. Modify the core to not accept data at it's input for a longer time (backpressure condition). Then the host will throw an error when the pipe is full but the host write is not handling this properly.
To reproduce, just modify line 57:
57 // while (!end_of_packet) {
57 while (end_of_packet) {
58 // Read in next pixel
As the reproducer is a bit of artificial, I see this issue in our production design too.
I hope this helps you to find the root cause of these problems!
Regards, Ric.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ric,
I was able to reproduce this error. But i think it might be due to different reason.
If we change the while loop condition to be never to enter it, the compiler would optimized it way.
If I change the while condition to:
// while (!end_of_packet) {
for (int i = 0; i < 1; i ++){
To allows one packet to go though, the error goes away.
Do you have a different reproducer for the sporadically write issue?
Thanks,
-Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin, you are right, the very simple reproducer did not show the effect. Same for the simple i++ loop as this will end the component without having consumed more data from the input pipe at a later time.
The updated reproducer is consuming input data with a large gap. I designed the gap to be large enougth to show that:
1) TB is pushing data to the input pipe until the related tread stalls as the write command is not correctly signalling FULL (the WR STALL print is never shown).
2) Simulation continuosly generates output data, restarts consuming input data after the gap but comes to a dead end as the input pipe is corrupted.
3) The final error comes from vsim but I guess the root cause is in the TB WR handling.
WR ... Input: 17000 Output: 51 Diff: 16949
WR ... Input: 18000 Output: 60 Diff: 17940
^CExiting simulation due to Interrupt
vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed.
Expected behavioural is that the TB WR thread hold on as long as WR pipe is full but restarts pushing data to it when the kernel has consumed some.
Regards, Ric.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ric,
I have simulated you reproducer a several times, here are what i found:
- I see your issue number (1) where when a non-blocking pipe write is writing into a pipe that has hit the implicit capacity, the write failed signal isn't being raised. I will ask around if this is a known behaviors.
- I don't really see issue (2). The data seems to be processed in burst for me (about every ~17000) but it eventually reach the end (see the output). The simulation dose took about 1 hour with ModelSim pro. So if you are simulating with a starter edition, it is going to take very long time to simulate (may appears to be stalled).
- "vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed" I believe this is the expected error when stopping simulation with ctrl+C.
Simulation output (only shown the begining and the end)
Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
WR ... Input: 0 Output: 0 Diff: 0
RD STALL WORKS!
WR ... Input: 1000 Output: 0 Diff: 1000
RD ... Input: 1274 Output: 0 Diff: 1274
WR ... Input: 2000 Output: 41 Diff: 1959
WR ... Input: 3000 Output: 99 Diff: 2901
WR ... Input: 4000 Output: 156 Diff: 3844
WR ... Input: 5000 Output: 222 Diff: 4778
WR ... Input: 6000 Output: 292 Diff: 5708
WR ... Input: 7000 Output: 366 Diff: 6634
WR ... Input: 8000 Output: 438 Diff: 7562
WR ... Input: 9000 Output: 513 Diff: 8487
WR ... Input: 10000 Output: 588 Diff: 9412
WR ... Input: 11000 Output: 662 Diff: 10338
WR ... Input: 12000 Output: 735 Diff: 11265
WR ... Input: 13000 Output: 809 Diff: 12191
WR ... Input: 14000 Output: 882 Diff: 13118
WR ... Input: 15000 Output: 955 Diff: 14045
RD ... Input: 15623 Output: 1000 Diff: 14623
WR ... Input: 16000 Output: 1029 Diff: 14971
WR ... Input: 17000 Output: 1097 Diff: 15903
WR ... Input: 18000 Output: 1159 Diff: 16841
... ...
WR ... Input: 982000 Output: 952124 Diff: 29876
RD ... Input: 982875 Output: 953000 Diff: 29875
WR ... Input: 983000 Output: 953126 Diff: 29874
RD ... Input: 983859 Output: 954000 Diff: 29859
WR ... Input: 984000 Output: 954142 Diff: 29858
RD ... Input: 984858 Output: 955000 Diff: 29858
WR ... Input: 985000 Output: 955143 Diff: 29857
RD ... Input: 985857 Output: 956000 Diff: 29857
WR ... Input: 986000 Output: 956147 Diff: 29853
RD ... Input: 986853 Output: 957000 Diff: 29853
WR ... Input: 987000 Output: 957148 Diff: 29852
RD ... Input: 987852 Output: 958000 Diff: 29852
WR ... Input: 988000 Output: 958149 Diff: 29851
RD ... Input: 988852 Output: 959000 Diff: 29852
WR ... Input: 989000 Output: 959150 Diff: 29849
RD ... Input: 989835 Output: 960000 Diff: 29835
WR ... Input: 990000 Output: 960166 Diff: 29834
RD ... Input: 990822 Output: 961000 Diff: 29822
WR ... Input: 991000 Output: 961178 Diff: 29822
RD ... Input: 991823 Output: 962000 Diff: 29823
WR ... Input: 992000 Output: 962178 Diff: 29822
RD ... Input: 992823 Output: 963000 Diff: 29823
WR ... Input: 993000 Output: 963179 Diff: 29821
RD ... Input: 993823 Output: 964000 Diff: 29823
WR ... Input: 994000 Output: 964177 Diff: 29823
RD ... Input: 994822 Output: 965000 Diff: 29822
WR ... Input: 995000 Output: 965179 Diff: 29821
RD ... Input: 995810 Output: 966000 Diff: 29810
WR ... Input: 996000 Output: 966199 Diff: 29801
RD ... Input: 996802 Output: 967000 Diff: 29802
WR ... Input: 997000 Output: 967198 Diff: 29802
RD ... Input: 997804 Output: 968000 Diff: 29804
WR ... Input: 998000 Output: 968197 Diff: 29803
RD ... Input: 998804 Output: 969000 Diff: 29804
WR ... Input: 999000 Output: 969197 Diff: 29803
RD ... Input: 999807 Output: 970000 Diff: 29807
Input Done!
RD ... Input: 1000000 Output: 971000 Diff: 29000
RD ... Input: 1000000 Output: 972000 Diff: 28000
RD ... Input: 1000000 Output: 973000 Diff: 27000
RD ... Input: 1000000 Output: 974000 Diff: 26000
RD ... Input: 1000000 Output: 975000 Diff: 25000
RD ... Input: 1000000 Output: 976000 Diff: 24000
RD ... Input: 1000000 Output: 977000 Diff: 23000
RD ... Input: 1000000 Output: 978000 Diff: 22000
RD ... Input: 1000000 Output: 979000 Diff: 21000
RD ... Input: 1000000 Output: 980000 Diff: 20000
RD ... Input: 1000000 Output: 981000 Diff: 19000
RD ... Input: 1000000 Output: 982000 Diff: 18000
RD ... Input: 1000000 Output: 983000 Diff: 17000
RD ... Input: 1000000 Output: 984000 Diff: 16000
RD ... Input: 1000000 Output: 985000 Diff: 15000
RD ... Input: 1000000 Output: 986000 Diff: 14000
RD ... Input: 1000000 Output: 987000 Diff: 13000
RD ... Input: 1000000 Output: 988000 Diff: 12000
RD ... Input: 1000000 Output: 989000 Diff: 11000
RD ... Input: 1000000 Output: 990000 Diff: 10000
RD ... Input: 1000000 Output: 991000 Diff: 9000
RD ... Input: 1000000 Output: 992000 Diff: 8000
RD ... Input: 1000000 Output: 993000 Diff: 7000
RD ... Input: 1000000 Output: 994000 Diff: 6000
RD ... Input: 1000000 Output: 995000 Diff: 5000
RD ... Input: 1000000 Output: 996000 Diff: 4000
RD ... Input: 1000000 Output: 997000 Diff: 3000
RD ... Input: 1000000 Output: 998000 Diff: 2000
RD ... Input: 1000000 Output: 999000 Diff: 1000
Output Done!
FAILED
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
File for the entire output attached here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for having a look at the issues!
CPU speed will have influence on the issue generation (RAM size might be too), so for your machine the settings for reproduction might be different. When I set the gap number e.g. 5000 it runs through w/o issues on my machine too. So on you machine you need to set it higher, try 30000.
b.t.w. the full error I see is:
vsimk: src/hls_cosim_ipc_socket.cpp:133: virtual void IPCSocket::send(const void*, int): Assertion `0 && "send() failed"' failed.
# Attempting stack trace sig 6
# Signal caught: signo [0]
# vsim_stacktrace.vstf written
# Current time Mon Jun 17 16:48:02 2024
# Program = vsim
# Id = "2023.2"
# Version = "2023.04"
# Date = "Apr 11 2023"
# Platform = "linux_x86_64"
# Signature = 016838926890ae993a152c369dc4c1be
# --> START OF USERCODE
# 0 0x00007ffff78969fc: 'pthread_kill + 0x000000000000012c' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# 1 0x00007ffff7842476: 'raise + 0x0000000000000016' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# 2 0x00007ffff78287f3: 'abort + 0x00000000000000d3' in '/usr/lib/x86_64-linux-gnu/libc.so.6'
# <-- END OF USERCODE
# 3 0x00007ffff782871b: '<unknown (@0x7ffff782871b)>'
# 4 0x00007ffff7839e96: '<unknown (@0x7ffff7839e96)>'
# 5 0x00007ffff2818321: '<unknown (@0x7ffff2818321)>'
# --> START OF USERCODE
# 6 0x00007ffff2806c1b: 'SimulatorInterface::send_host_channel(void*, void*, bool*, bool*, unsigned int*) + 0x000000000000015b' in '/data1/intel/oneapi/compiler/2024.1/opt/oclfpga/host/linux64/lib/libaoc_cosim_msim.so'
# <-- END OF USERCODE
# 7 0x00007feff1fc4f1f: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:38'
# 8 0x00007feff1fc65ef: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:70'
# 9 0x00007feff1fc77ce: '../../ip/mpsim/dpic_Threshold/aoc_sim_component_dpi_controller_10/sim/aoc_sim_stream_sink_dpi_bfm.sv:185'
# 10 0x00000000023e53b2: '<unknown (@0x23e53b2)>'
# 11 0x00000000004df304: '<unknown (@0x4df304)>'
# 12 0x000000000074da63: '<unknown (@0x74da63)>'
# 13 0x0000000000ca58ad: '<unknown (@0xca58ad)>'
# 14 0x0000000000caabd0: '<unknown (@0xcaabd0)>'
# 15 0x0000000000cac54e: '<unknown (@0xcac54e)>'
# 16 0x0000000000f9bd2d: '<unknown (@0xf9bd2d)>'
# 17 0x000000000287a82d: '<unknown (@0x287a82d)>'
# 18 0x000000000287ec86: '<unknown (@0x287ec86)>'
# 19 0x0000000002880371: '<unknown (@0x2880371)>'
# 20 0x00000000028806d6: '<unknown (@0x28806d6)>'
# 21 0x0000000002881df3: '<unknown (@0x2881df3)>'
# 22 0x00000000028825b1: '<unknown (@0x28825b1)>'
# 23 0x0000000000c6c700: '<unknown (@0xc6c700)>'
# 24 0x0000000000c6e315: '<unknown (@0xc6e315)>'
# End of Stack Trace
o
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I was able to reproduce the simulation hang with a 60000 gap. I suspecting this has something to do with your issue (1), where the handling of reaching the buffer limit isn't done properly. I will ask around to see if there is a work around.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ric,
Sorry this took so long. I think the root of the problem we encountered with all of our workaround that uses pthread is that multithreading isn't fully supported in FPGA flow with oneAPI as it is mention in this release note.
I would suggest performing pipe read and write in the same loop:
q.single_task<Threshold>(ThresholdKernel{});
// Check that output pixels are below the threshold
bool passed = true;
for (int i = 0; i < (width * height); ++i) {
// Write to input pipe
bool start_of_packet = (i == 0);
bool end_of_packet = (i == ((width * height) - 1));
StreamingBeatT in_beat(i, start_of_packet, end_of_packet);
InPixelPipe::write(q, in_beat);
if(i%1000 == 0)
std::cerr << "Input: " << i << "\t " << (width * height) << std::endl;
// Read from output pipe
StreamingBeatT out_beat = OutPixelPipe::read(q);
passed &= (out_beat.data <= kThreshold);
if(i%1000 == 0)
std::cerr << "Output: " << i << "\t " << out_beat.data << std::endl;
}
Let me know if this could work for you,
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin_Xu_Intel,
Thanks for your support and helping in issue!
Hi Ric2,
I now transition this thread to community support. If you have a new question, Please login to ‘https://supporttickets.intel.com/s/?language=en_US’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.
Thanks.
Regards,
Aik Eu

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page