Hi, - Page 2

Senfter__Thomas · ‎02-05-2019

Hello

we have a difficult to find problem with a Python script, where an CNN for license plate recognition runs using the Inference Engine. We are not sure what causes the problem, but the problem started when upgrading from the alpha version of the Inference Engine to OpenVino 2018R5 (in addition with some code changes and an upgrade of Tensorflow), so a connection to the Inference Engine is possible.

Problem description:

We have a Python script, which runs 3 different CNNs using the Inference Engine from OpenVino 2018R5 on images from Ethernet cameras, which are retrieved with OpenCV VideoCapture. In addition ZMQ is used to pass results to other programs. The used hardware is either an Intel NUC7BNH, an NUC7DNH or an NUC8BEH (on the NUC8 no freeze was observed until now). The OS is an Ubuntu 16.04 (with patched kernel 4.7.0.intel.r5.0 or kernel 4.15.0-15-generic (freezes happen less frequent with kernel 4.15). The script is running multiple times in separated Docker containers together with programs in other docker containers.

What happens is that the Linux freezes randomly after some time (sometimes after a few minutes, sometimes after a few hours but also two are now running for many days without a problem). When it freezes no ACPI shutdown works, the screen freezes and even the Magic SysRq keys have no effect. A strange side effect is that a lot of network traffic is created (so much traffic that the network dies and no PC on the switch can communicate). The logs (kern.log, syslog) show nothing special.

If anyone observed a similar problem or has an idea, what can cause this behavior, please let me know.

Greetings,

Thomas

Senfter__Thomas · ‎09-18-2019

Hi,

sorry for not answering for some time. We thought "intel_idle.max_cstate=1" on kernel 4.19 fixed the problem however recently we observed the problem again.

If you want to look into the problem again, we prepared an image of the whole 120GB SSD, which produces the freeze consistently at a NUC7i3DNH and a NUC7i5DNH within an hour. We also observed a freeze on a NUC7i7BNH with this image however it took more time to freeze there.

Here you can find the image: https://drive.google.com/file/d/1gGz-92hfzjaDLK1IC0kQYchIuUimonSv/view?usp=sharing
We flashed the image on a 120GB SSD with: gunzip -c ZombieDisk.iso.gz | dd of=/dev/sda

username: alpr
password: dev

To start the "Zombie"-Creation just execute "./startthezombie.sh" in the home directory. You don't need network or anything else for this.

In the attached zip-archive are our BIOS settings (images of the settings and a profile-file, which may be possible to be loaded).

If you have any question, feel free to ask.

Greetings,
Thomas

Senfter__Thomas · ‎11-07-2019

Hi,

at the moment we are trying with "acpi=off" kernel parameter. So far (half a week on 4 NUCs) we got no zombie. This looks promising but other kernel parameters achieved the same and zombied later.

We also found another problem (probably related) that processes using OpenCL get stuck in kernel space. The system is still responsive in this case for some time when not doing anything. However the system might become not responsive anymore (e.g by starting more containeres, maybe it's triggered by a lot of disk read/write?), with a high load (low CPU usage, so maybe the load is created by disk usage). Attached is the output of the syslog (you can see there where the processes stopped).

Regards,

Thomas

Cobb__Tom · ‎12-05-2019

Hi Thomas,

Has the problem been fixed by acpi=off? I have a NUC8i5BE that is freezing once a week when being used as a standard desktop machine (no OpenCL applications). memtest86 shows no faulty RAM, and the problem seems to occur more often when using the graphics card rather than just sitting idle.

Thanks,

Tom

Senfter__Thomas · ‎02-13-2020

Hi Tom,

we didn't observe any "zombie" with acpi=off while testing on 3 NUC7 with kernel "4.7.0.intel.r5.0" for more than 3 month. We are not sure if it really fixes the problem but at least it definitely makes it less likely to happen.

Cheers,

Thomas

Senfter__Thomas · ‎03-03-2020

Hi,

Finally we were able to capture the network traffic. We could not capture anything with our PCs but when capturing with a NUC (directly connected to the "Zombie-NUC" or with some but not all switches in between) we got the attached traffic dump (zipped because of file extension restrictions).

The "Zombie-NUC" continuously sends pause frames (https://en.wikipedia.org/wiki/Ethernet_flow_control#Pause_frame). Here a rather old thread with the same/similar bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=709616

So it seems like a bug in the network driver but the question still is why our old implementation with the old inference engine does not trigger this problem and the newer inference engine does. Also why intel_idle.max_cstate=1 and acpi=off makes this problem much less likely to occur. Is there a problem with the power supply?

We also got the same problem with a damaged RAM (a lot of errors found by memtest86) where a system crash (done by a magic SysRq key) triggered the problem but a damaged RAM can cause a lot of random problems.

So I guess this problem is not directly related to OpenVino and we should better ask somewhere else for help.

Thanks,
Thomas

Senfter__Thomas · ‎03-10-2020

Hi,

the "Zombie" can be split into 2 problems.

1) The freeze:
The Kernel freezes and does not react to anything. We still do not know what causes this problem but it seems to be related to intense GPU usage.
Setting the kernel parameter "intel_idle.max_cstate=1" reduces the likelihood of this problem to occur and "acpi=off" fixes the problem or at least makes it very unlikely. (but then no hyperthreading is available)
However setting "acpi=ht" (which disables all ACPI stuff except what is necessary for hyperthreading) also leads to freezes. So maybe the freeze is related to hyperthreading? We have no CPUs available without hyperthreading, so we cannot check this assumption.

2) The broken network caused by the pause frames:
We think the problem here is that the ethernet device still receives messages when the kernel is frozen and when the devices buffer is full it automatically sends the pause frames with the default configuration. This assumption is supported by the fact that it takes a few seconds after plugging in the network cable to take down the network.
We were able to fix this part of the problem by disabling flow control (see https://help.ubuntu.com/community/UbuntuLTSP/FlowControl)

So the current situation for us is a "Zombie" in quarantine ;)

Linux freezes when running Python script using the Inference Engine