Possible memory corruption bug in async inference API (MYRIAD and HDDL plugins)

EdBordin · ‎04-18-2022

I was pulled away from investigating this for a while so I am continuing from the thread I started here: https://community.intel.com/t5/Intel-Distribution-of-OpenVINO/Possible-memory-corruption-and-or-race-condition-bug-in-async/m-p/1374322

I have upgraded to openvino_2022.1.0.643 before continuing my investigating, the bug seems to persist for both MYRIAD and HDDL plugins.

Responding to the last couple of things Peh_Intel suggested to work around this:

1. "I’ve validated that running the test script with HDDL plugin (8 MYRIAD), the output results are smooth."

Did you leave it set to this?:

num_inference_requests = 4

I believe if you set this to something larger than 8 you may be able to reproduce the bug on the HDDL plugin with your Vision Accelerator Design card too. I now have a UP AI Core X (1 VPU, mPCIe form factor) which I can use with the HDDL plugin and I get the same behaviour as I got with the NCS 2 (I just changed the plugin in the test script to use HDDL). This is on a different host to the other two I tested on before.

2. "Surprisingly, when enlarging the model input shape, the output results are also smooth. However, changing the model input shape may significantly affect its accuracy."

This classifier model architecture doesn't work with dynamic input sizes and we would rather not make it larger at training time. I have also found that changing the input size and the model depth when I train the model affect whether I can trigger this error reliably. That doesn't really help me figure out how to avoid the error. It can sometimes take many iterations before I catch the error too which makes it hard to be confident in the model before deploying it.

I noticed there are a few bugfix commits related to VPU since 2022.1 was released so I might aim to build Openvino from source as a next step just in case this has been fixed in isolation. Any further help would be much appreciated.

EdBordin · ‎04-20-2022

A few extra observations since I posted this:

1. Building the latest code on the master branch from source and using the MYRIAD plugin with NCS 2 behaves the same (HDDL plugin doesn't appear to be open source so I couldn't test anything new there).

2. If I enable TRACE logging I noticed a possible pattern - the cases that fail seem to all show this log:

[Trace  ][VPU][GraphCompiler]             Try to use CMX for HW inputs
[Trace  ][VPU][GraphCompiler]                 Try use CMX for Data [audio_in@FP16@adjust-strides]
[Trace  ][VPU][GraphCompiler]                     Allocation result : OK

the cases that run normally seem to show this log:

[Trace  ][VPU][GraphCompiler]             Try to use CMX for HW inputs
[Trace  ][VPU][GraphCompiler]                 Try use CMX for Data [audio_in@FP16@adjust-strides]
[Trace  ][VPU][GraphCompiler]                     Allocation result : DATA_FAILED

trace logging was enabled by adding:

core.set_config(config={"LOG_LEVEL": "LOG_TRACE"}, device_name="MYRIAD")

I am not certain this is actually a general pattern though, it could just be I have not found a counterexample yet. But it would make sense that making the input 2x larger stopped the compiler from being able to fit the input layer into the CMX memory slice and worked around a possible bug.

3. If I override MYRIAD_THROUGHPUT_STREAMS then the error seems to stop happening!

core.set_config(config={"MYRIAD_THROUGHPUT_STREAMS":"1"}, device_name="MYRIAD")

I might be happy with this as a workaround for now but I still need to test if this impacts performance significantly. This page says the default number of streams is 2. If my theory about the CMX slices has any merit then perhaps the bug is a race condition between two streams contending for a CMX slice? Obviously someone at Intel is in a better position to say if that is plausible than me.

Peh_Intel · ‎04-20-2022

Hi EdBordin,

Thanks for sharing your findings with us.

I will channel this behaviour to our development team for better explanation on such behaviour. It might take some time and will get back to you once I received some feedbacks.

For your information, HDDL plugin is not available for Open-Source OpenVINO™ toolkit.

Regards,

Peh

EdBordin · ‎04-20-2022

Thanks for passing this on!

I ran a rough benchmark over 100 batches of 24 (currently our "worst case" batch size) and found that reducing the number of streams to 1 does improve things slightly over the original workaround of only having 1 inference request, but it does still take about 28% longer on average than leaving things at the recommended settings. Adding more requests does not seem to make a difference when there's only one stream. So the workaround is still not ideal, but a slight improvement on throughput.

benchmark_batches: 100, time_secs: 70.3432936668396, batch_size: 24, num_inference_requests: 4, num_streams: 2, input_length: 10752
benchmark_batches: 100, time_secs: 105.7116174697876, batch_size: 24, num_inference_requests: 1, num_streams: 2, input_length: 10752
benchmark_batches: 100, time_secs: 89.37134289741516, batch_size: 24, num_inference_requests: 4, num_streams: 1, input_length: 10752
benchmark_batches: 100, time_secs: 90.13488292694092, batch_size: 24, num_inference_requests: 1, num_streams: 1, input_length: 10752

edit: since I already had the build set up, I tried patching src/plugins/intel_myriad/graph_transformer/src/middleend/passes/adjust_data_location.cpp to never use CMX for the layers listed in the logs. Unfortunately it does not stop the error from occurring so it may have just been coincidence.

Hari_B_Intel · ‎12-08-2022

Hi EdBordin

Apologize for the delay in our response. The development team has not being able to reproduce and in the latest release OpenVINO 2022.2 release, the NaN issue cannot be observed. We were not able to see this NaN problem, are you still having this issue and require any help?

Thank you

Hari_B_Intel · ‎12-14-2022

Hi EdBordin

Thank you for reporting this issue. We have submitted this to engineering for further investigation, but they cannot commit to a fix at this time and cannot give us a timeframe for a fix.

Since we cannot guarantee that we will receive a response from engineering within a certain period of time, we recommend closing this case.

We recommend you watch the release notes to stay up-to-date on the latest bug fixes. Sign up for product marketing emails to stay informed of the latest releases.

Thank you