Intel® Distribution of OpenVINO™ Toolkit
Community assistance about the Intel® Distribution of OpenVINO™ toolkit, OpenCV, and all aspects of computer vision-related on Intel® platforms.
6403 Discussions

Tensorflow: NaN outputs after training n steps.

idata
Employee
2,539 Views

Hi!

 

I've been trying out the NCS and so far I've managed to make it run inference on a network that tried to predict the speaker's emotion from a spectrogram of the voice samples.

 

Now I'm trying to train and compile a network that tries to estimate the locations of a human's joints in a picture.

 

Thus far, it appears that the training is mildly successful and the loss decreases noticeably during training.

 

The problem is the outputs of the NCS.

 

At first I wasn't even able to compile the network because FusedBatchNorm required a convolutional network before batch normalization.

 

After this was solved, the variables mean and variance weren't defined for batch normalization.

 

After I found out that the is_training flag must be appropriately set (corresponding to either training or testing), the network actually compiled.

 

The problem is with the outputs - always nan.

 

I was able to get almost exactly the same results when training, testing and running inference on the NCS. However I had mistakenly set the is_training flag for the batch normalization layer to False when both training and testing. Although I finally got this network working on the NCS, the results were poor, because apparently some neurons die and some outputs are 0, which are false.

 

So the whole problem with this network architecture is revolving around batch normalization and the flag is_training.

 

The code, the compiled graph and some testing images are located here: link

 

The author of the code from which mine is derived: link

 

I compiled the network with mvNCCompile graph.meta -w graph -is 200 200 -on 'output/Relu' -o GRAPH

 

One thing to note: I have observed that after very brief training (around 30 iterations), the outputs of the networks are in fact not nan, but numbers (ranging from 0 to 200). After another 30 iterations of training, the outputs become nan.

 

What could be the problem? The mean and variance values are in fact stored in the graph and I can see them with a modified TensorFlowParser.py. Is it a problem in the last fully connected layer?

 

The fully connected layer adds an activation function of ReLu by default, hence the output node.

 

Thank you for your time!

0 Kudos
9 Replies
idata
Employee
1,585 Views

@jooseph Can you try running mvNCCheck on both graph files (the one with about 30 iterations and the one that gives you nans) and paste the log here? Thanks.

0 Kudos
idata
Employee
1,585 Views

Hey! Here is the output.

 

training for 20 iterations

 

mvNCCheck -s 12 graph.meta -on 'output/Relu'

 

USB: Transferring Data…

 

USB: Myriad Execution Finished

 

USB: Myriad Connection Closing.

 

USB: Myriad Connection Closed.

 

Result: (1, 1, 18)

 

1) 6 45.03

 

2) 10 43.75

 

3) 0 39.06

 

4) 15 38.38

 

5) 16 37.84

 

Expected: (1, 18)

 

1) 6 0.30425176

 

2) 10 0.23531708

 

3) 0 0.21833332

 

4) 16 0.20731162

 

5) 11 0.20570508

 

Obtained values

 

Obtained Min Pixel Accuracy: 14700.653076171875% (max allowed=2%), Fail

 

Obtained Average Pixel Accuracy: 9536.827087402344% (max allowed=1%), Fail

 

Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail

 

Obtained Pixel-wise L2 error: 10091.091104800238% (max allowed=1%), Fail

 

Obtained Global Sum Difference: 522.287353515625

 

training for 40 iterations

 

USB: Transferring Data…

 

USB: Myriad Execution Finished

 

USB: Myriad Connection Closing.

 

USB: Myriad Connection Closed.

 

Result: (1, 1, 18)

 

1) 17 nan

 

2) 16 nan

 

3) 1 nan

 

4) 2 nan

 

5) 3 nan

 

Expected: (1, 18)

 

1) 16 1.1342309

 

2) 6 1.0005219

 

3) 9 0.92981833

 

4) 1 0.91985106

 

5) 4 0.89293176

 

/home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce

 

return umr_maximum(a, axis, None, out, keepdims)

 

/usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater

 

diff)) / total_values * 100

 

Obtained values

 

Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail

 

Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail

 

Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail

 

Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail

 

Obtained Global Sum Difference: nan

 

Markdown formatting makes the output look wonky.

0 Kudos
idata
Employee
1,585 Views

@jooseph Have you checked out the Tensorflow Network Compliance Guide at https://movidius.github.io/ncsdk/tf_compile_guidance.html? This is the quick and dirty run down:

 

     

  1. Train your network. After it finishes, you will get an index file, a meta file and a weights file.
  2.  

  3. Create a copy of your training code and remove all the training related code from it. (Please see TF Compliance Guide for more details.)
  4.  

  5. Remove all placeholders except for the input tensor.
  6.  

  7. Add a line to restore the previous session from your trained model.
  8.  

  9. Run the new training script and it should finish immediately and create another set of files that will should be NCS-friendly.
  10.  

0 Kudos
idata
Employee
1,585 Views

Yes, I had the restoring part interwoven in the start.py file.

 

For completeness sake I've created a new script to create an NCS-friendly version of the graph.

 

Here is the code:

 

import glob import numpy as np import tensorflow as tf import sys import os from utils import next_batch, load_images, load_labels from densenet import DenseNet sess = tf.Session() batch_size = 1 images_source = tf.placeholder(tf.float32, shape=[batch_size, 200,200,3], name='input') dense_blocks = 5 growth_rate = 12 # filters model = DenseNet(images_source, dense_blocks, growth_rate, None, is_training=False) y = tf.nn.relu(model.model, name='output') saver = tf.train.Saver(tf.global_variables()) sess.run(tf.global_variables_initializer()) sess.run(tf.local_variables_initializer()) saver.restore(sess, "./training/training") saver.save(sess, "./ncsdk/graph")

 

Output after 20 iterations:

 

USB: Transferring Data... USB: Myriad Execution Finished USB: Myriad Connection Closing. USB: Myriad Connection Closed. Result: (1, 1, 18) 1) 0 0.2537 2) 4 0.2373 3) 7 0.2251 4) 16 0.2223 5) 9 0.2076 Expected: (1, 18) 1) 10 0.586495 2) 14 0.58181506 3) 13 0.52246326 4) 2 0.50935996 5) 16 0.4961934 ------------------------------------------------------------ Obtained values ------------------------------------------------------------ Obtained Min Pixel Accuracy: 83.80004167556763% (max allowed=2%), Fail Obtained Average Pixel Accuracy: 46.76774442195892% (max allowed=1%), Fail Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail Obtained Pixel-wise L2 error: 50.356128171269596% (max allowed=1%), Fail Obtained Global Sum Difference: 4.937228679656982 ------------------------------------------------------------

 

Output after 40 iterations:

 

USB: Transferring Data... USB: Myriad Execution Finished USB: Myriad Connection Closing. USB: Myriad Connection Closed. Result: (1, 1, 18) 1) 0 103.56 2) 9 69.44 3) 14 66.7 4) 8 66.25 5) 17 61.62 Expected: (1, 18) 1) 14 0.6836831 2) 10 0.65763766 3) 6 0.6033226 4) 16 0.6018172 5) 2 0.58865637 ------------------------------------------------------------ Obtained values ------------------------------------------------------------ Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail Obtained Global Sum Difference: 932.0097045898438 ------------------------------------------------------------

 

After 60:

 

USB: Transferring Data... USB: Myriad Execution Finished USB: Myriad Connection Closing. USB: Myriad Connection Closed. Result: (1, 1, 18) 1) 0 103.56 2) 9 69.44 3) 14 66.7 4) 8 66.25 5) 17 61.62 Expected: (1, 18) 1) 14 0.6836831 2) 10 0.65763766 3) 6 0.6033226 4) 16 0.6018172 5) 2 0.58865637 ------------------------------------------------------------ Obtained values ------------------------------------------------------------ Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail Obtained Global Sum Difference: 932.0097045898438 ------------------------------------------------------------
0 Kudos
idata
Employee
1,585 Views

Correct output after 60 iterations (last one was accidentally the same as 40)

 

USB: Transferring Data... USB: Myriad Execution Finished USB: Myriad Connection Closing. USB: Myriad Connection Closed. Result: (1, 1, 18) 1) 17 nan 2) 16 nan 3) 1 nan 4) 2 nan 5) 3 nan Expected: (1, 18) 1) 6 0.5064525 2) 10 0.49627382 3) 14 0.4664461 4) 12 0.45869178 5) 2 0.42014557 /home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce return umr_maximum(a, axis, None, out, keepdims) /usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater diff)) / total_values * 100 ------------------------------------------------------------ Obtained values ------------------------------------------------------------ Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail Obtained Global Sum Difference: nan ------------------------------------------------------------
0 Kudos
idata
Employee
1,585 Views

I am running into a similar situation, except I am not getting NaN, but mvNCCheck gives mismatch between tensorflow output and NCS output.

 

I followed exactly the steps mentioned by @Tom_at_Intel above. My network involves stacked depth_wise_conv blocks (mobilenet-like).

 

I tried to apply tensorflow optimization tools (freeze_graph.py, optimize_for_inference.py) with no luck. I cannot go further.

 

This is a link to tensorflow checkpoint

 

Note

 

With the same code but for a different architecture (a small plain convnet like VGG style) I did not get this problem, so it is very strange. Is there a problem with handling depthwise convolutions?

 

mvNCCheck Output

 

Cmd

 

mvNCCheck network.meta -in input -o Output -is 224 224 -cs 0,1,2

 

Output

 

USB: Transferring Data... USB: Myriad Execution Finished USB: Myriad Connection Closing. USB: Myriad Connection Closed. Result: (1, 1, 200) 1) 23 0.014969 2) 69 0.013 3) 159 0.012276 4) 199 0.011681 5) 147 0.010986 Expected: (1, 200) 1) 69 0.014389 2) 23 0.013445 3) 159 0.0127474 4) 80 0.0114835 5) 119 0.0103248 ------------------------------------------------------------ Obtained values ------------------------------------------------------------ Obtained Min Pixel Accuracy: 25.895529985427856% (max allowed=2%), Fail Obtained Average Pixel Accuracy: 5.970443412661552% (max allowed=1%), Fail Obtained Percentage of wrong values: 75.5% (max allowed=0%), Fail Obtained Pixel-wise L2 error: 7.878222508418345% (max allowed=1%), Fail Obtained Global Sum Difference: 0.17181694507598877 ------------------------------------------------------------
0 Kudos
idata
Employee
1,585 Views

@ahmed.ezzat Thanks for bringing this to our attention. We actually have support for depth-wise convolutions as listed @ https://github.com/movidius/ncsdk/releases/tag/v1.12.00.01 as well as support for various mobilenet variants.

 

However there is an errata where depth-wise convolutions may not work if the channel multiplier is greater than 1. This doesn't seem to be the case with your network.

0 Kudos
idata
Employee
1,585 Views

Yes I use 0.75 and 1. So, what do you think?

0 Kudos
idata
Employee
1,585 Views

@ahmed.ezzat We don't have verified support for the DenseNet architecture yet, so this may be the cause of problem. I don't see any obvious issues with your network.

0 Kudos
Reply