FPGA, SoC, And CPLD Boards And Kits
FPGA Evaluation and Development Kits
5930 Discussions

OpenCL multi-thread error: HAL Kern Error: Read failed from addr 20

jackgreen
Novice
1,025 Views

Platform: terasic De10-nano (CycloneV soc)

Software version: OpenCL SDK 18.1, MMD version 14.1

Problem description: I am developing an application using multi-thread (while only one thread will run NDRange and execute the FPGA kernel) and got this error:

HAL Kern Error: Read failed from addr 20, read -1237012464 expected 4                                             
HAL Kern Error: Read failed from addr 20, read -1237012464 expected 4                                             
HAL Kern Error: Write failed to addr 1000 with value 0, wrote -1237012464 expected 4                              
HAL Kern Error: Write failed to addr 20 with value be80a2dc, wrote -1237012464 expected 4 

Sometimes the program will be killed and sometimes will not.

I suspect it may due to multi-thread somehow, so I test the intel official example multithread_vector_operation (which is attached)

A similar error comes out but with only read failed. 

HAL Kern Error: Read failed from addr 20, read -1234415600 expected 4

I used gdb to debug this and found out the SIG44 was received when two threads were at clWaitForEvents().

I notice that someone already asked a similar question and mentioned that it may be due to the obsolete MMD library. But I saw that all csoc5 RTE are with MMD 14.1. 

Can anyone shed a light on this problem, please?

Thanks in advance!

0 Kudos
2 Replies
jackgreen
Novice
1,008 Views

Not sure how many active users are still using this forum. For what is worth, I want to share my results for solving this problem. Hope this could help people who come across the same problem.

 

The problem indeed lies in the MMD library. In all current version of csoc5 BSPs and RTEs, Version 14.1 MMD is used. The newest MMD is of version 18.1 which I found in Intel Opencl FPGA SDK Pro 20.4. Seems like this version was just resleased. 

 

In the MMD 18.1, it mentioned in the source file about fixing this bug in 14.1. 

// global variables used for handling multi-devices and its helper functions
// Use a DeviceMapManager to manage a heap-allocated map for storing device information
// instead of using a static global map because of a segmentation fault which occurs in
// the following situation:
// 1) Host program contains a global variable which calls clReleaseContext in its destructor.
// When the program ends the global goes out of scope and the destructor is called.
// 2) clReleaseContext calls a function in the MMD library which modifies the static global map in
// the MMD library.
// In this situation it was discovered that the destructor of the static global map is called before
// the destructor of the global in the host program, thus resulting in a segmentation fault when
// clReleaseContext calls a function that modifies the internal map after it has been destroyed.
// Using a heap-allocated map avoids this issue as the lifetime of the map persists until it is
// deleted or the process is completely terminated.

So when I remove my global variables, everything works. 

0 Kudos
AnilErinch_A_Intel
935 Views

Hi ,

Thanks for sharing your solution with the community.

Thanks and Regards

Anil


0 Kudos
Reply