Tensorflow: NaN outputs after training n steps.

idata · ‎01-30-2018

Hi!

I've been trying out the NCS and so far I've managed to make it run inference on a network that tried to predict the speaker's emotion from a spectrogram of the voice samples.

Now I'm trying to train and compile a network that tries to estimate the locations of a human's joints in a picture.

Thus far, it appears that the training is mildly successful and the loss decreases noticeably during training.

The problem is the outputs of the NCS.

At first I wasn't even able to compile the network because FusedBatchNorm required a convolutional network before batch normalization.

After this was solved, the variables mean and variance weren't defined for batch normalization.

After I found out that the is_training flag must be appropriately set (corresponding to either training or testing), the network actually compiled.

The problem is with the outputs - always nan.

I was able to get almost exactly the same results when training, testing and running inference on the NCS. However I had mistakenly set the is_training flag for the batch normalization layer to False when both training and testing. Although I finally got this network working on the NCS, the results were poor, because apparently some neurons die and some outputs are 0, which are false.

So the whole problem with this network architecture is revolving around batch normalization and the flag is_training.

The code, the compiled graph and some testing images are located here: link

The author of the code from which mine is derived: link

I compiled the network with mvNCCompile graph.meta -w graph -is 200 200 -on 'output/Relu' -o GRAPH

One thing to note: I have observed that after very brief training (around 30 iterations), the outputs of the networks are in fact not nan, but numbers (ranging from 0 to 200). After another 30 iterations of training, the outputs become nan.

What could be the problem? The mean and variance values are in fact stored in the graph and I can see them with a modified TensorFlowParser.py. Is it a problem in the last fully connected layer?

The fully connected layer adds an activation function of ReLu by default, hence the output node.

Thank you for your time!

idata · ‎01-31-2018

@jooseph Can you try running mvNCCheck on both graph files (the one with about 30 iterations and the one that gives you nans) and paste the log here? Thanks.

idata · ‎02-01-2018

Hey! Here is the output.

training for 20 iterations

mvNCCheck -s 12 graph.meta -on 'output/Relu'

USB: Transferring Data…

USB: Myriad Execution Finished

USB: Myriad Connection Closing.

USB: Myriad Connection Closed.

Result: (1, 1, 18)

1) 6 45.03

2) 10 43.75

3) 0 39.06

4) 15 38.38

5) 16 37.84

Expected: (1, 18)

1) 6 0.30425176

2) 10 0.23531708

3) 0 0.21833332

4) 16 0.20731162

5) 11 0.20570508

Obtained values

Obtained Min Pixel Accuracy: 14700.653076171875% (max allowed=2%), Fail

Obtained Average Pixel Accuracy: 9536.827087402344% (max allowed=1%), Fail

Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail

Obtained Pixel-wise L2 error: 10091.091104800238% (max allowed=1%), Fail

Obtained Global Sum Difference: 522.287353515625

training for 40 iterations

USB: Transferring Data…

USB: Myriad Execution Finished

USB: Myriad Connection Closing.

USB: Myriad Connection Closed.

Result: (1, 1, 18)

1) 17 nan

2) 16 nan

3) 1 nan

4) 2 nan

5) 3 nan

Expected: (1, 18)

1) 16 1.1342309

2) 6 1.0005219

3) 9 0.92981833

4) 1 0.91985106

5) 4 0.89293176

/home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce

return umr_maximum(a, axis, None, out, keepdims)

/usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater

diff)) / total_values * 100

Obtained values

Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail

Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail

Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail

Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail

Obtained Global Sum Difference: nan

Markdown formatting makes the output look wonky.

idata · ‎02-02-2018

@jooseph Have you checked out the Tensorflow Network Compliance Guide at https://movidius.github.io/ncsdk/tf_compile_guidance.html? This is the quick and dirty run down:

Train your network. After it finishes, you will get an index file, a meta file and a weights file.

Create a copy of your training code and remove all the training related code from it. (Please see TF Compliance Guide for more details.)

Remove all placeholders except for the input tensor.

Add a line to restore the previous session from your trained model.

Run the new training script and it should finish immediately and create another set of files that will should be NCS-friendly.

idata · ‎02-03-2018

Yes, I had the restoring part interwoven in the start.py file.

For completeness sake I've created a new script to create an NCS-friendly version of the graph.

Here is the code:

import glob
import numpy as np
import tensorflow as tf
import sys
import os
from utils import next_batch, load_images, load_labels
from densenet import DenseNet

sess = tf.Session()
batch_size = 1
images_source = tf.placeholder(tf.float32, shape=[batch_size, 200,200,3], name='input')
dense_blocks = 5 
growth_rate = 12 # filters
model = DenseNet(images_source, dense_blocks, growth_rate, None, is_training=False)
y = tf.nn.relu(model.model, name='output')

saver = tf.train.Saver(tf.global_variables())
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
saver.restore(sess, "./training/training")
saver.save(sess, "./ncsdk/graph")

Output after 20 iterations:

USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1, 18)
1) 0 0.2537
2) 4 0.2373
3) 7 0.2251
4) 16 0.2223
5) 9 0.2076
Expected:  (1, 18)
1) 10 0.586495
2) 14 0.58181506
3) 13 0.52246326
4) 2 0.50935996
5) 16 0.4961934
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: 83.80004167556763% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: 46.76774442195892% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: 50.356128171269596% (max allowed=1%), Fail
 Obtained Global Sum Difference: 4.937228679656982
------------------------------------------------------------

Output after 40 iterations:

USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1, 18)
1) 0 103.56
2) 9 69.44
3) 14 66.7
4) 8 66.25
5) 17 61.62
Expected:  (1, 18)
1) 14 0.6836831
2) 10 0.65763766
3) 6 0.6033226
4) 16 0.6018172
5) 2 0.58865637
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail
 Obtained Global Sum Difference: 932.0097045898438
------------------------------------------------------------

After 60:

USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1, 18)
1) 0 103.56
2) 9 69.44
3) 14 66.7
4) 8 66.25
5) 17 61.62
Expected:  (1, 18)
1) 14 0.6836831
2) 10 0.65763766
3) 6 0.6033226
4) 16 0.6018172
5) 2 0.58865637
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail
 Obtained Global Sum Difference: 932.0097045898438
------------------------------------------------------------

idata · ‎02-03-2018

Correct output after 60 iterations (last one was accidentally the same as 40)

USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1, 18)
1) 17 nan
2) 16 nan
3) 1 nan
4) 2 nan
5) 3 nan
Expected:  (1, 18)
1) 6 0.5064525
2) 10 0.49627382
3) 14 0.4664461
4) 12 0.45869178
5) 2 0.42014557
/home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
/usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater
  diff)) / total_values * 100
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail
 Obtained Global Sum Difference: nan
------------------------------------------------------------

idata · ‎02-08-2018

I am running into a similar situation, except I am not getting NaN, but mvNCCheck gives mismatch between tensorflow output and NCS output.

I followed exactly the steps mentioned by @Tom_at_Intel above. My network involves stacked depth_wise_conv blocks (mobilenet-like).

I tried to apply tensorflow optimization tools (freeze_graph.py, optimize_for_inference.py) with no luck. I cannot go further.

This is a link to tensorflow checkpoint

Note

With the same code but for a different architecture (a small plain convnet like VGG style) I did not get this problem, so it is very strange. Is there a problem with handling depthwise convolutions?

mvNCCheck Output

Cmd

mvNCCheck network.meta -in input -o Output -is 224 224 -cs 0,1,2

Output

    USB: Transferring Data...
    USB: Myriad Execution Finished
    USB: Myriad Connection Closing.
    USB: Myriad Connection Closed.
    Result:  (1, 1, 200)
    1) 23 0.014969
    2) 69 0.013
    3) 159 0.012276
    4) 199 0.011681
    5) 147 0.010986
    Expected:  (1, 200)
    1) 69 0.014389
    2) 23 0.013445
    3) 159 0.0127474
    4) 80 0.0114835
    5) 119 0.0103248
    ------------------------------------------------------------
     Obtained values 
    ------------------------------------------------------------
     Obtained Min Pixel Accuracy: 25.895529985427856% (max allowed=2%), Fail
     Obtained Average Pixel Accuracy: 5.970443412661552% (max allowed=1%), Fail
     Obtained Percentage of wrong values: 75.5% (max allowed=0%), Fail
     Obtained Pixel-wise L2 error: 7.878222508418345% (max allowed=1%), Fail
     Obtained Global Sum Difference: 0.17181694507598877
    ------------------------------------------------------------

idata · ‎02-09-2018

@ahmed.ezzat Thanks for bringing this to our attention. We actually have support for depth-wise convolutions as listed @ https://github.com/movidius/ncsdk/releases/tag/v1.12.00.01 as well as support for various mobilenet variants.

However there is an errata where depth-wise convolutions may not work if the channel multiplier is greater than 1. This doesn't seem to be the case with your network.

idata · ‎02-09-2018

Yes I use 0.75 and 1. So, what do you think?

idata · ‎02-09-2018

@ahmed.ezzat We don't have verified support for the DenseNet architecture yet, so this may be the cause of problem. I don't see any obvious issues with your network.