- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I've been trying out the NCS and so far I've managed to make it run inference on a network that tried to predict the speaker's emotion from a spectrogram of the voice samples.
Now I'm trying to train and compile a network that tries to estimate the locations of a human's joints in a picture.
Thus far, it appears that the training is mildly successful and the loss decreases noticeably during training.
The problem is the outputs of the NCS.
At first I wasn't even able to compile the network because FusedBatchNorm required a convolutional network before batch normalization.
After this was solved, the variables mean and variance weren't defined for batch normalization.
After I found out that the is_training flag must be appropriately set (corresponding to either training or testing), the network actually compiled.
The problem is with the outputs - always nan.
I was able to get almost exactly the same results when training, testing and running inference on the NCS. However I had mistakenly set the is_training flag for the batch normalization layer to False when both training and testing. Although I finally got this network working on the NCS, the results were poor, because apparently some neurons die and some outputs are 0, which are false.
So the whole problem with this network architecture is revolving around batch normalization and the flag is_training.
The code, the compiled graph and some testing images are located here: link
The author of the code from which mine is derived: link
I compiled the network with mvNCCompile graph.meta -w graph -is 200 200 -on 'output/Relu' -o GRAPH
One thing to note: I have observed that after very brief training (around 30 iterations), the outputs of the networks are in fact not nan, but numbers (ranging from 0 to 200). After another 30 iterations of training, the outputs become nan.
What could be the problem? The mean and variance values are in fact stored in the graph and I can see them with a modified TensorFlowParser.py. Is it a problem in the last fully connected layer?
The fully connected layer adds an activation function of ReLu by default, hence the output node.
Thank you for your time!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jooseph Can you try running mvNCCheck on both graph files (the one with about 30 iterations and the one that gives you nans) and paste the log here? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey! Here is the output.
training for 20 iterations
mvNCCheck -s 12 graph.meta -on 'output/Relu'
USB: Transferring Data…
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 6 45.03
2) 10 43.75
3) 0 39.06
4) 15 38.38
5) 16 37.84
Expected: (1, 18)
1) 6 0.30425176
2) 10 0.23531708
3) 0 0.21833332
4) 16 0.20731162
5) 11 0.20570508
Obtained values
Obtained Min Pixel Accuracy: 14700.653076171875% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: 9536.827087402344% (max allowed=1%), Fail
Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: 10091.091104800238% (max allowed=1%), Fail
Obtained Global Sum Difference: 522.287353515625
training for 40 iterations
USB: Transferring Data…
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 17 nan
2) 16 nan
3) 1 nan
4) 2 nan
5) 3 nan
Expected: (1, 18)
1) 16 1.1342309
2) 6 1.0005219
3) 9 0.92981833
4) 1 0.91985106
5) 4 0.89293176
/home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
/usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater
diff)) / total_values * 100
Obtained values
Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail
Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail
Obtained Global Sum Difference: nan
Markdown formatting makes the output look wonky.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jooseph Have you checked out the Tensorflow Network Compliance Guide at https://movidius.github.io/ncsdk/tf_compile_guidance.html? This is the quick and dirty run down:
- Train your network. After it finishes, you will get an index file, a meta file and a weights file.
- Create a copy of your training code and remove all the training related code from it. (Please see TF Compliance Guide for more details.)
- Remove all placeholders except for the input tensor.
- Add a line to restore the previous session from your trained model.
- Run the new training script and it should finish immediately and create another set of files that will should be NCS-friendly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I had the restoring part interwoven in the start.py file.
For completeness sake I've created a new script to create an NCS-friendly version of the graph.
Here is the code:
import glob
import numpy as np
import tensorflow as tf
import sys
import os
from utils import next_batch, load_images, load_labels
from densenet import DenseNet
sess = tf.Session()
batch_size = 1
images_source = tf.placeholder(tf.float32, shape=[batch_size, 200,200,3], name='input')
dense_blocks = 5
growth_rate = 12 # filters
model = DenseNet(images_source, dense_blocks, growth_rate, None, is_training=False)
y = tf.nn.relu(model.model, name='output')
saver = tf.train.Saver(tf.global_variables())
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
saver.restore(sess, "./training/training")
saver.save(sess, "./ncsdk/graph")
Output after 20 iterations:
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 0 0.2537
2) 4 0.2373
3) 7 0.2251
4) 16 0.2223
5) 9 0.2076
Expected: (1, 18)
1) 10 0.586495
2) 14 0.58181506
3) 13 0.52246326
4) 2 0.50935996
5) 16 0.4961934
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: 83.80004167556763% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: 46.76774442195892% (max allowed=1%), Fail
Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: 50.356128171269596% (max allowed=1%), Fail
Obtained Global Sum Difference: 4.937228679656982
------------------------------------------------------------
Output after 40 iterations:
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 0 103.56
2) 9 69.44
3) 14 66.7
4) 8 66.25
5) 17 61.62
Expected: (1, 18)
1) 14 0.6836831
2) 10 0.65763766
3) 6 0.6033226
4) 16 0.6018172
5) 2 0.58865637
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail
Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail
Obtained Global Sum Difference: 932.0097045898438
------------------------------------------------------------
After 60:
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 0 103.56
2) 9 69.44
3) 14 66.7
4) 8 66.25
5) 17 61.62
Expected: (1, 18)
1) 14 0.6836831
2) 10 0.65763766
3) 6 0.6033226
4) 16 0.6018172
5) 2 0.58865637
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: 15063.591003417969% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: 7573.438262939453% (max allowed=1%), Fail
Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: 8141.7633292979035% (max allowed=1%), Fail
Obtained Global Sum Difference: 932.0097045898438
------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Correct output after 60 iterations (last one was accidentally the same as 40)
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 18)
1) 17 nan
2) 16 nan
3) 1 nan
4) 2 nan
5) 3 nan
Expected: (1, 18)
1) 6 0.5064525
2) 10 0.49627382
3) 14 0.4664461
4) 12 0.45869178
5) 2 0.42014557
/home/joosep/.local/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
/usr/local/bin/ncsdk/Controllers/Metrics.py:75: RuntimeWarning: invalid value encountered in greater
diff)) / total_values * 100
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: nan% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: nan% (max allowed=1%), Fail
Obtained Percentage of wrong values: 0.0% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: nan% (max allowed=1%), Fail
Obtained Global Sum Difference: nan
------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running into a similar situation, except I am not getting NaN, but mvNCCheck gives mismatch between tensorflow output and NCS output.
I followed exactly the steps mentioned by @Tom_at_Intel above. My network involves stacked depth_wise_conv blocks (mobilenet-like).
I tried to apply tensorflow optimization tools (freeze_graph.py, optimize_for_inference.py) with no luck. I cannot go further.
This is a link to tensorflow checkpoint
Note
With the same code but for a different architecture (a small plain convnet like VGG style) I did not get this problem, so it is very strange. Is there a problem with handling depthwise convolutions?
mvNCCheck Output
Cmd
mvNCCheck network.meta -in input -o Output -is 224 224 -cs 0,1,2
Output
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1, 1, 200)
1) 23 0.014969
2) 69 0.013
3) 159 0.012276
4) 199 0.011681
5) 147 0.010986
Expected: (1, 200)
1) 69 0.014389
2) 23 0.013445
3) 159 0.0127474
4) 80 0.0114835
5) 119 0.0103248
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: 25.895529985427856% (max allowed=2%), Fail
Obtained Average Pixel Accuracy: 5.970443412661552% (max allowed=1%), Fail
Obtained Percentage of wrong values: 75.5% (max allowed=0%), Fail
Obtained Pixel-wise L2 error: 7.878222508418345% (max allowed=1%), Fail
Obtained Global Sum Difference: 0.17181694507598877
------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ahmed.ezzat Thanks for bringing this to our attention. We actually have support for depth-wise convolutions as listed @ https://github.com/movidius/ncsdk/releases/tag/v1.12.00.01 as well as support for various mobilenet variants.
However there is an errata where depth-wise convolutions may not work if the channel multiplier is greater than 1. This doesn't seem to be the case with your network.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes I use 0.75 and 1. So, what do you think?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ahmed.ezzat We don't have verified support for the DenseNet architecture yet, so this may be the cause of problem. I don't see any obvious issues with your network.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page