Re: Crash at USB transfer in NCSDK2 mvNCProfile

idata · ‎11-28-2018

Hi,

I'm getting a crash after a successful graph compilation in mvNCProfile:

```root@myncsdocker:/inout# mvNCProfile -in input_node_2 -on training_2/concat -s 12 -is 512 480 mynetwork.pb

mvNCProfile v02.00, Copyright @ Intel Corporation 2017

shape: [1, 480, 512, 3]

res.shape: (1, 29, 31, 5)

TensorFlow output shape: (29, 31, 5)

/usr/local/bin/ncsdk/Controllers/FileIO.py:65: UserWarning: You are using a large type. Consider reducing your data sizes for best performance

Blob generated

USB: Transferring Data…

*** Error in `python3': malloc(): memory corruption: 0x000000000b0ba4d0 ***

======= Backtrace: =========

/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f02a363c7e5]

/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7f02a364713e]

/lib/x86_64-linux-gnu/libc.so.6(libc_malloc+0x54)[0x7f02a3649184]

python3(PyObject_Malloc+0x157)[0x5d4597]

python3(PyBytes_FromStringAndSize+0x3f)[0x5c632f]

python3[0x4e9d28]

python3(_PyObject_GenericGetAttrWithDict+0x11d)[0x5941dd]

python3(PyEval_EvalFrameEx+0x44d)[0x536bcd]

python3(PyEval_EvalFrameEx+0x4b14)[0x53b294]

python3[0x53fc97]

python3(PyEval_EvalFrameEx+0x50bf)[0x53b83f]

python3[0x53fc97]

python3(PyEval_EvalCode+0x1f)[0x5409bf]

python3[0x60cb42]

python3(PyRun_FileExFlags+0x9a)[0x60efea]

python3(PyRun_SimpleFileExFlags+0x1bc)[0x60f7dc]

python3(Py_Main+0x456)[0x640256]

python3(main+0xe1)[0x4d0001]

/lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf0)[0x7f02a35e5830]

python3(_start+0x29)[0x5d6999]```

Full log at https://pastebin.com/0PKpSMZ6

idata · ‎11-28-2018

Additional info: Docker image built from github repo, HEAD -> ncsdk2, tag: v2.08.01.02, origin/ncsdk2.

idata · ‎11-28-2018

Even more info…

It turns out that the error varies seemingly randomly. It switches between:

*** Error in `python3': malloc(): memory corruption: 0x000000000b0ba4d0 ***
*** Error in `python3': double free or corruption (!prev): 0x000000000287e410 ***
*** Error in `python3': free(): invalid next size (normal): 0x0000000008dc9880 ***

So, a general memory overwrite problem.

Any hints and tips for how I can solve this problem? Can I enable more verbose debug output?

// Karl-Anders

idata · ‎11-28-2018

I tracked it down to the line

myriad_output, userobj = fifoOut.read_elem()

in Controllers/MiscIO.py, it this helps shed any light on what is happening.

idata · ‎11-28-2018

Sorry to keep spamming, but I now have a "final" observation about this:

Tracing further into read_elem, the first thing read is the "elementsize", and the elementsize.value equals 1798, and that is the size it allocates the tensor "string" with.

Now, 1798 does _not_ match the expected size of the output tensor:

...
res.shape: (1, 29, 31, 5)
TensorFlow output shape: (29, 31, 5)
...

1798 happens to be 29_31_sizeof(FP16), so the dimension of size 5 has been lost somewhere.

@Tome_at_Intel, perhaps you can point me in the right direction?

Thanks!

// Karl-Anders

idata · ‎12-06-2018

Hi!

@Tome_at_Intel, sorry to reference you directly, but perhaps you can point me in the right direction?

mvNCProfile says:

... shape: [1, 480, 512, 3] res.shape: (1, 29, 31, 5) TensorFlow output shape: (29, 31, 5) ...

…which is the expected output size, but later, after the USB: Transferring Data... printout I added some debug prints in the code talking to the hardware, and the line myriad_output, userobj = fifoOut.read_elem() returns the wrong number of elements, i.e. 1798 bytes. That corresponds to 29 x 31 x sizeof(FP16), so the output has gone from 5-value per "pixel" to 1 value.

Is there any way I can reverse engineer the graph file itself to figure out it it's _that_ one that misbehaves?

Cheers, and thanks in advance!

// Karl-Anders