Running neural models on a raspberry pi

Drakopoulos__Fotis · ‎01-03-2019

Hello,

I recently purchased an Intel Movidius Neural Compute Stick 2 and I've managed to install OpenVINO on my raspberry pi following the instructions provided on the forum (https://software.intel.com/en-us/articles/OpenVINO-Install-RaspberryPI). What I'm trying to do now is to convert my Keras model to a supported version in order to run it on the Movidius Stick. First of all, is it possible to run a neural model that doesn't take an image as an input?

Thank you in advance.

nikos1 · ‎01-03-2019

Hello Fotis,

> First of all, is it possible to run a neural model that doesn't take an image as an input?

OpenVino supports this. "Other-than-image input" worked fine in my products on both CPU and GPU devices but not sure if I also tried on NCS2. Will try tomorrow and update.. (just having a build issue I need to resolve first before I can test on NCS2).

Cheers,

Nikos

Drakopoulos__Fotis · ‎01-04-2019

Hi Nikos,

First of all thanks for the prompt reply.

The thing is that I have a keras model for audio signal processing and I want to run it on my NCS2, connected on a raspberry pi. I have successfully installed the openVINO on the raspberry pi (according to the instructions provided on the forum), so what I am trying to do now is to convert the keras model in order to run it on the NCS2. From what I understood the model conversion is not possible on the raspberry pi, but even on an ubuntu machine I am still not sure how to convert an "other-than-image input" model.

Cheers,

Fotis

UPDATE: I managed to convert the model with the mo_tf.py script but now I am not sure how to run it on the NCS2. After converting the model to the format needed for the NCS2 (bin and xml files) I tried to load it on the raspberry pi but when I type the command net = IENetwork(model="tf_model.bin",weights="tf_model.xml") I get the following error:

RuntimeError: Error reading network: input must have dimensions

nikos1 · ‎01-04-2019

Have you used the --input_shape parameter of mo_tf.py ?

BTW computer_vision_sdk_2018.5.445/deployment_tools/documentation/_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html has some examples how to convert networks for speech.

nikos1 · ‎01-04-2019

Also please make sure you set input / output precision correctly , like for example

    input_data->setPrecision(Precision::U8);
    input_data->setLayout(Layout::NCHW);  // ?

There are a few options

**
 * @enum Layout
 * @brief Layouts that the inference engine supports
 */
enum Layout : uint8_t {
    ANY = 0,           // "any" layout

    // I/O data layouts
    NCHW = 1,
    NHWC = 2,
    NCDHW = 3,
    NDHWC = 4,

    // weight layouts
    OIHW = 64,

    // bias layouts
    C = 96,

    // Single image layout (for mean image)
    CHW = 128,

    // 2D
    HW = 192,
    NC = 193,
    CN = 194,

    BLOCKED = 200,
};

Drakopoulos__Fotis · ‎01-04-2019

First of all, I wasn't able to convert the model without setting the input shape. Thanks, I'll check the examples for speech models. Regarding the precision parameters, where do I define these?

nikos1 · ‎01-04-2019

> Regarding the precision parameters, where do I define these?

In the inference application - for example in the case of C++ see line 619 of

computer_vision_sdk_2018.5.445/deployment_tools/inference_engine/samples/speech_sample/main.cpp

inputPrecision and layout is set

        /** configure input precision if model loaded from IR **/
        for (auto &item : inputInfo) {
            Precision inputPrecision = Precision::FP32;  // specify Precision::I16 to provide quantized inputs
            item.second->setPrecision(inputPrecision);
            item.second->getInputData()->layout = NC;  // row major layout
        }

nikos

Drakopoulos__Fotis · ‎01-07-2019

Hi Niko,

The thing is that I can't even load the model to specify these parameters or change anything. I'm also using python so I'm trying to figure out what's going on because there are no examples for speech. I tried many different things on the model conversion but I'm still getting the "input must have dimensions" error when I try to load the model.

Thanks

EDIT: I got it finally to work, after specifying certain parameters for the conversion. I just need to figure out how to run the model now, I'll post if I have any more issues.

Drakopoulos__Fotis · ‎01-16-2019

Hello again,

So, I was able to convert the model and run it on the NCS2 and on the raspberry pi, but so far I'm getting noisy outputs and I don't know the cause. First of all, I used data type FP16 to convert the model and run it on the NCS2 (it wasn't possible with FP32) but I've noticed that the output has 'float32' dtype whatever the input data type is. How can I check the input/output precision of the converted model on Python?

nikos1 · ‎01-16-2019

Hello Foti,

May be better to validate FP32 on CPU device first and then move to NCS2 FP16 (-d MYRAD ); would be less deltas and easier to track discrepancies.

Sorry I am not sure about Python API and input/output precision or validation options. My end to end to workflow is using C++ and offers flexibility to adjust precision and also validate and compare results to my reference implementation. I am sure Python API allows all this but never used it :-)

Cheers,

nikos

Drakopoulos__Fotis · ‎01-17-2019

Thanks again Niko!

I'm not able to test the FP32 on a raspberry pi because there is no plugin for CPU right now (only for MYRIAD). I've checked the input/outputs layers of the FP16 model and they are FP16 precision, however the output that I am getting from the NCS2 is in 'float32' format. I am still getting a kind of periodic noise to the output...

However, I found that there is a number of unsupported layers (Const) by the plugin for MYRIAD. What should I do in this case?

nikos1 · ‎01-18-2019

Hi Foti,

> not able to test the FP32 on a raspberry pi because there is no plugin for CPU right now (only for MYRIAD).

Perhaps you could try on a x86 Linux or even Windows platform. Validating on FP32 CPU is essential in your case before moving to pi and NCS for a number of reasons.

> they are FP16 precision, however the output that I am getting from the NCS2 is in 'float32'

That's not an issue. I am also getting FP32 out from FP16 inference. Again this becomes irrelevant in the case of FP32 valldation.

> However, I found that there is a number of unsupported layers (Const) by the plugin for MYRIAD. What should I do in this case?

I think this is the most important issue. Again on CPU FP32 may have no issues. Need to test CPU FP32 and see if they are supported there.

FWIW in my experience a smoother validation and dev workflow is

Native -> CPU FP32 -> run validation app

CPU FP32 -> GPU FP16 -> validate FP16

GPU FP16 -> NCS FP16 -> Validate on NCS

It is slower but makes it easier to track issues.

If it all fails then comparison of results layer by layer as done in another post ( reference ( https://software.intel.com/en-us/forums/computer-vision/topic/801760 by Nikolaev, Viktor )

Cheers,

Nikos

Drakopoulos__Fotis · ‎01-19-2019

Hi Niko,

Yesterday I tried running the FP32 model on an ubuntu machine using CPU but I got a "buffer overrun" error (when I tried to load the network to the plugin). I looked a bit for a solution but I didn't find anything. I guess I'll try this layer by layer comparison to see what happens. Thanks!

Cheers,

Fotis

nikos1 · ‎01-19-2019

Sorry, hard to see what the issue is without more information on model optimizer parameters or workflow in general. Are you converting frozen or non-frozen TensorFlow models or using Caffe or other?

If Caffe supported layers are in https://software.intel.com/en-us/articles/OpenVINO-Using-Caffe#caffe-supported-layers

Tensorflow supported layers in https://software.intel.com/en-us/articles/OpenVINO-Using-TensorFlow#tensorflow-supported-layers

> Keras model

I can see Keras so assuming you are on Tensorflow backend and freeze to a pb.

For TF custom layers, if needed, there is good documentation how to offload but I am not sure if it would make sense in terms of performance in the case of pi+NCS. Some info in https://software.intel.com/en-us/articles/OpenVINO-ModelOptimizer#Tensorflow-models-with-custom-layersSome ideas here from DeepSpeech may help in case more mo_tf.py parameters are needed. Sometimes it is not straightforward to convert TF to IR and it could be the case here that you just need one more parameter and the problem will be solved.

https://software.intel.com/en-us/articles/OpenVINO-Using-tensorflow ( also see section : Supported Layers and the Mapping to Intermediate Representation Layers )

To generate the DeepSpeech Intermediate Representation (IR), provide TensorFlow DeepSpeech model to the Model Optimizer with parameters:

python3 ./mo_tf.py
  --input_model path_to_model/output_graph.pb                         \
  --freeze_placeholder_with_value input_lengths->[16]                \
  --input input_node,previous_state_h/read,previous_state_c/read  \
  --input_shape [1,16,19,26],[1,2048],[1,2048]                              \
  --output raw_logits,lstm_fused_cell/Gather,lstm_fused_cell/Gather_1

nikos

Drakopoulos__Fotis · ‎01-20-2019

Do you think that the "Buffer overrun" error that I got for the FP32 model could be caused because of an incorrect conversion?

To get more into detail, I'm converting a keras model to an IR representation and I tried doing both with a frozen and a non-frozen model. I am specifying the input layer name and size and the output layer name to the conversion command (as shown in the example) but I will experiment a bit with the parameters tomorrow to see if this will make a difference.

Fotis

nikos1 · ‎01-20-2019

> could be caused because of an incorrect conversion?

Yes, assuming you have no unsupported layers, I think it is possible to be a conversion parameter issue causing the inference engine buffer issue when loading weights. Coincidentally also got the same error two weeks ago and fixed but do not remember the exact issue, poor short-term memory :-) I think it was related to input shape or NCHW vs. NHWC but it was with 2D images not 1D case.

nikos

Drakopoulos__Fotis · ‎01-22-2019

Hello again Niko,

I tried changing all the different parameters during the conversion but I still get a buffer overrun error when I try to run the FP32 model using the CPU.

Additionally, I changed the 'ReLu' on my keras model and now most of the unsupported layers on the FP16 model for the MYRIAD are gone, but I still get the Input layer as an unsupported layer and the same noisy output. I was wondering what is the correct representation of the Input layer for a model on MYRIAD, because it's weird that the input layer is unsupported.

I also tried to convert and try out the deepspeech model mentioned above, but when I do the conversion I get the following error:

[ ERROR ] -------------------------------------------------
[ ERROR ] ----------------- INTERNAL ERROR ----------------
[ ERROR ] Unexpected exception happened.
[ ERROR ] Please contact Model Optimizer developers and forward the following information:
[ ERROR ] Exception occurred during running replacer "None (<class 'extensions.front.tf.BlockLSTM.BlockLSTM'>)": 7
[ ERROR ] Traceback (most recent call last):
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/utils/class_registration.py", line 114, in apply_replacements
    replacer.find_and_replace_pattern(graph)
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/front/common/replacement.py", line 125, in find_and_replace_pattern
    apply_pattern(graph, action=self.replace_sub_graph, **self.pattern())
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/middle/pattern_match.py", line 95, in apply_pattern
    action(graph, match)
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/front/common/replacement.py", line 189, in replace_sub_graph
    self.replace_output_edges(graph, self.gen_output_edges_match(node, self.replace_op(graph, node)))
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/extensions/front/tf/BlockLSTM.py", line 84, in replace_op
    [graph.remove_edge(node.in_node(p).id, node.id) for p, input_data in node.in_nodes().items() if p in [5, 6, 7]]
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/extensions/front/tf/BlockLSTM.py", line 84, in <listcomp>
    [graph.remove_edge(node.in_node(p).id, node.id) for p, input_data in node.in_nodes().items() if p in [5, 6, 7]]
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/graph/graph.py", line 329, in in_node
    return self.in_nodes(control_flow=control_flow)[key]
KeyError: 7
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/main.py", line 325, in main
    return driver(argv)
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/main.py", line 267, in driver
    mean_scale_values=mean_scale)
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/pipeline/tf.py", line 248, in tf2nx
    class_registration.apply_replacements(graph, class_registration.ClassType.FRONT_REPLACER)
File "/opt/intel/computer_vision_sdk_2018.5.445/deployment_tools/model_optimizer/mo/utils/class_registration.py", line 127, in apply_replacements
    )) from err
Exception: Exception occurred during running replacer "None (<class 'extensions.front.tf.BlockLSTM.BlockLSTM'>)": 7
[ ERROR ] ---------------- END OF BUG REPORT --------------
[ ERROR ] -------------------------------------------------

EDIT: I finally managed to (maybe) get a correct output from the converted model using the MYRIAD plugin. I used the "--disable_nhwc_to_nchw" parameter in the conversion and now I don't see this noisy output. However, I now get a new list of unsupported layers and the most important part is that the IR model suddenly got really slow (it takes around 190 ms for 1 iteration). What could be the cause? Also, if I compare the two xml files (before and after the "--disable_nhwc_to_nchw" addition) I see different dimensions for each layer.

nikos1 · ‎01-22-2019

Hello Foti,

Good find with the --disable_nhwc_to_nchw parameter. Just for the record were you able to run now on CPU FP32 and get valid results?

> However, I now get a new list of unsupported layers

Are you using LSTM? Not sure if is supported or validated yet for MYRIAD. I will ask this question in my old post ( https://software.intel.com/en-us/forums/computer-vision/topic/755432 )

Based on 2018 R5 release notes:

New Features in the 2018 R5 include:

Extends neural network support to include LSTM (long short-term memory) from ONNX*, TensorFlow*& MXNet* frameworks, & 3D convolutional-based networks in preview mode (CPU-only) to support additional, new use cases beyond computer vision.

> and the most important part is that the IR model suddenly got really slow (it takes around 190 ms for 1 iteration).

For that you may want to use the profiler that reports ms per layer and get a better idea of what slows down the execution. of course functionality first, much higher priority.

nikos

Drakopoulos__Fotis · ‎01-22-2019

Hi Niko,

No, even with the --disable_nhwc_to_nchw parameter the model doesn't work on the CPU. I tried every possible parameter but I am still getting the "cannot create internal buffer. buffer can be overrun" so I don't know how to proceed with this.

The unsupported layers are again of type "Const" but originate from conv layers of the original model. What I did before to remove the unsupported layers was to train the model with a leakyReLu instead, but now I don't really know how to substitute the convolution layers.

The thing is that the model now works on the MYRIAD (I'll verify the output tomorrow but with a first glance I think that it produces a correct output) but it is really slow. How could I find the cause of this at least?

nikos1 · ‎01-22-2019

try to get performance counts (us per layer) using get_perf_counts.

        perf_counts = infer_request_handle.get_perf_counts()
        log.info("Performance counters:")
        print("{:<70} {:<15} {:<15} {:<15} {:<10}".format('name', 'layer_type', 'exet_type', 'status', 'real_time, us'))
        for layer, stats in perf_counts.items():
            print("{:<70} {:<15} {:<15} {:<15} {:<10}".format(layer, stats['layer_type'], stats['exec_type'],
                                                              stats['status'], stats['real_time']))

Some examples in

 grep  perf ./computer_vision_sdk/deployment_tools/inference_engine/samples/python_samples/*

or check the python API docs if more information is needed for performance counters.

cheers

nikos

Drakopoulos__Fotis · ‎01-23-2019

nikos wrote:

try to get performance counts (us per layer) using get_perf_counts.

        perf_counts = infer_request_handle.get_perf_counts()
        log.info("Performance counters:")
        print("{:<70} {:<15} {:<15} {:<15} {:<10}".format('name', 'layer_type', 'exet_type', 'status', 'real_time, us'))
        for layer, stats in perf_counts.items():
            print("{:<70} {:<15} {:<15} {:<15} {:<10}".format(layer, stats['layer_type'], stats['exec_type'],
                                                              stats['status'], stats['real_time']))

Some examples in

 grep  perf ./computer_vision_sdk/deployment_tools/inference_engine/samples/python_samples/*

or check the python API docs if more information is needed for performance counters.

cheers

nikos

Hi Niko,

This is what I got with the performance counters:

[ INFO ] Performance counters:
name                                                                   layer_type      exet_type       status          real_time, us
LeakyReLU_                                                             ReLU            LeakyRelu       EXECUTED        80        
LeakyReLU_1178                                                         ReLU            LeakyRelu       EXECUTED        60        
LeakyReLU_1179                                                         ReLU            LeakyRelu       EXECUTED        51        
LeakyReLU_1180                                                         ReLU            LeakyRelu       EXECUTED        71        
LeakyReLU_1181                                                         ReLU            LeakyRelu       EXECUTED        31        
LeakyReLU_1182                                                         ReLU            LeakyRelu       EXECUTED        60        
LeakyReLU_1183                                                         ReLU            LeakyRelu       EXECUTED        31        
LeakyReLU_1184                                                         ReLU            LeakyRelu       EXECUTED        50        
LeakyReLU_1185                                                         ReLU            LeakyRelu       EXECUTED        35        
LeakyReLU_1186                                                         ReLU            LeakyRelu       EXECUTED        36        
LeakyReLU_1187                                                         ReLU            LeakyRelu       EXECUTED        48        
LeakyReLU_1188                                                         ReLU            LeakyRelu       EXECUTED        51        
LeakyReLU_1189                                                         ReLU            LeakyRelu       EXECUTED        46        
Receive-Tensor                                                         Receive-Tensor  Receive-Tensor  EXECUTED        0         
main_input_noisy@FP16                                                  <Extra>         Convert_f32f16  EXECUTED        54        
model_1/G_gtlayer/add                                                  Eltwise         Sum             EXECUTED        53        
model_1/G_gtlayer/add/Broadcast/                                       Tile            Tile            EXECUTED        41465     
model_1/G_gtlayer/add/Broadcast/Reshape/After                          Reshape         Reshape         EXECUTED        2647      
model_1/G_gtlayer/add/Broadcast/Reshape/Before                         Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/G_gtlayer/convolution/Conv2D                                   Convolution     Im2ColConvolution EXECUTED        369       
model_1/G_gtlayer/convolution/Conv2D/Permute_                          Permute         Permute         EXECUTED        23        
model_1/G_gtlayer/convolution/Conv2D/Permute_1124                      Permute         Permute         EXECUTED        56        
model_1/G_gtlayer/convolution/ExpandDims                               Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/G_gtlayer/convolution/Squeeze                                  Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/concatenate_1/concat@0@compact                                 Concat          Copy            EXECUTED        23        
model_1/concatenate_1/concat@1@compact                                 Concat          Copy            EXECUTED        10        
model_1/concatenate_2/concat@0@compact                                 Concat          Copy            EXECUTED        22        
model_1/concatenate_2/concat@1@compact                                 Concat          Copy            EXECUTED        10        
model_1/concatenate_3/concat@0@compact                                 Concat          Copy            EXECUTED        22        
model_1/concatenate_3/concat@1@compact                                 Concat          Copy            EXECUTED        11        
model_1/concatenate_4/concat@0@compact                                 Concat          Copy            EXECUTED        23        
model_1/concatenate_4/concat@1@compact                                 Concat          Copy            EXECUTED        11        
model_1/concatenate_5/concat@0@compact                                 Concat          Copy            EXECUTED        29        
model_1/concatenate_5/concat@1@compact                                 Concat          Copy            EXECUTED        16        
model_1/concatenate_6/concat@0@compact                                 Concat          Copy            EXECUTED        35        
model_1/concatenate_6/concat@1@compact                                 Concat          Copy            EXECUTED        26        
model_1/conv1d_1/add                                                   Eltwise         Sum             EXECUTED        39        
model_1/conv1d_1/add/Broadcast/                                        Tile            Tile            EXECUTED        20753     
model_1/conv1d_1/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        1325      
model_1/conv1d_1/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_1/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        200       
model_1/conv1d_1/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        50        
model_1/conv1d_1/convolution/Conv2D/Permute_1128                       Permute         Permute         EXECUTED        44        
model_1/conv1d_1/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        25        
model_1/conv1d_1/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_2/add                                                   Eltwise         Sum             EXECUTED        37        
model_1/conv1d_2/add/Broadcast/                                        Tile            Tile            EXECUTED        20769     
model_1/conv1d_2/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        683       
model_1/conv1d_2/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_2/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        179       
model_1/conv1d_2/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        49        
model_1/conv1d_2/convolution/Conv2D/Permute_1132                       Permute         Permute         EXECUTED        54        
model_1/conv1d_2/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        24        
model_1/conv1d_2/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_3/add                                                   Eltwise         Sum             EXECUTED        36        
model_1/conv1d_3/add/Broadcast/                                        Tile            Tile            EXECUTED        10383     
model_1/conv1d_3/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        348       
model_1/conv1d_3/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_3/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        191       
model_1/conv1d_3/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        45        
model_1/conv1d_3/convolution/Conv2D/Permute_1136                       Permute         Permute         EXECUTED        38        
model_1/conv1d_3/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        23        
model_1/conv1d_3/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_4/add                                                   Eltwise         Sum             EXECUTED        37        
model_1/conv1d_4/add/Broadcast/                                        Tile            Tile            EXECUTED        10384     
model_1/conv1d_4/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        185       
model_1/conv1d_4/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_4/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        218       
model_1/conv1d_4/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        42        
model_1/conv1d_4/convolution/Conv2D/Permute_1140                       Permute         Permute         EXECUTED        38        
model_1/conv1d_4/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        22        
model_1/conv1d_4/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_5/add                                                   Eltwise         Sum             EXECUTED        38        
model_1/conv1d_5/add/Broadcast/                                        Tile            Tile            EXECUTED        5202      
model_1/conv1d_5/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        99        
model_1/conv1d_5/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_5/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        209       
model_1/conv1d_5/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        42        
model_1/conv1d_5/convolution/Conv2D/Permute_1144                       Permute         Permute         EXECUTED        38        
model_1/conv1d_5/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        23        
model_1/conv1d_5/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_6/add                                                   Eltwise         Sum             EXECUTED        37        
model_1/conv1d_6/add/Broadcast/                                        Tile            Tile            EXECUTED        5213      
model_1/conv1d_6/add/Broadcast/Reshape/After                           Reshape         Reshape         EXECUTED        56        
model_1/conv1d_6/add/Broadcast/Reshape/Before                          Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_6/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        189       
model_1/conv1d_6/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        41        
model_1/conv1d_6/convolution/Conv2D/Permute_1148                       Permute         Permute         EXECUTED        38        
model_1/conv1d_6/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        24        
model_1/conv1d_6/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv2d_transpose_1/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        37        
model_1/conv2d_transpose_1/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        610       
model_1/conv2d_transpose_1/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        57        
model_1/conv2d_transpose_1/conv2d_transpose/Permute_1152               Permute         Permute         EXECUTED        53        
model_1/conv2d_transpose_2/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        35        
model_1/conv2d_transpose_2/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1251      
model_1/conv2d_transpose_2/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        70        
model_1/conv2d_transpose_2/conv2d_transpose/Permute_1156               Permute         Permute         EXECUTED        61        
model_1/conv2d_transpose_3/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        35        
model_1/conv2d_transpose_3/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1392      
model_1/conv2d_transpose_3/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        97        
model_1/conv2d_transpose_3/conv2d_transpose/Permute_1160               Permute         Permute         EXECUTED        62        
model_1/conv2d_transpose_4/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        38        
model_1/conv2d_transpose_4/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1373      
model_1/conv2d_transpose_4/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        97        
model_1/conv2d_transpose_4/conv2d_transpose/Permute_1164               Permute         Permute         EXECUTED        89        
model_1/conv2d_transpose_5/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        37        
model_1/conv2d_transpose_5/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        2680      
model_1/conv2d_transpose_5/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        151       
model_1/conv2d_transpose_5/conv2d_transpose/Permute_1168               Permute         Permute         EXECUTED        69        
model_1/conv2d_transpose_6/BiasAdd                                     ScaleShift      ScaleShift      EXECUTED        40        
model_1/conv2d_transpose_6/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        5247      
model_1/conv2d_transpose_6/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        103       
model_1/conv2d_transpose_6/conv2d_transpose/Permute_1172               Permute         Permute         EXECUTED        98        
model_1/conv2d_transpose_7/BiasAdd                                     Power           Power           EXECUTED        40        
model_1/conv2d_transpose_7/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        10310     
model_1/conv2d_transpose_7/conv2d_transpose/Permute_                   Permute         Permute         EXECUTED        159       
model_1/conv2d_transpose_7/conv2d_transpose/Permute_1176               Permute         Permute         EXECUTED        22        
model_1/g_output/Reshape                                               Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/g_output/Reshape@FP16                                          <Extra>         Convert_f16f32  EXECUTED        32        
model_1/reshape_1/Reshape                                              Reshape         Reshape         EXECUTED        62        
model_1/reshape_10/Reshape                                             Reshape         Reshape         EXECUTED        1336      
model_1/reshape_11/Reshape                                             Reshape         Reshape         EXECUTED        232       
model_1/reshape_12/Reshape                                             Reshape         Reshape         EXECUTED        2659      
model_1/reshape_13/Reshape                                             Reshape         Reshape         EXECUTED        293       
model_1/reshape_2/Reshape                                              Reshape         Reshape         EXECUTED        107       
model_1/reshape_3/Reshape                                              Reshape         Reshape         EXECUTED        127       
model_1/reshape_4/Reshape                                              Reshape         Reshape         EXECUTED        189       
model_1/reshape_5/Reshape                                              Reshape         Reshape         EXECUTED        241       
model_1/reshape_6/Reshape                                              Reshape         Reshape         EXECUTED        351       
model_1/reshape_7/Reshape                                              Reshape         Reshape         EXECUTED        373       
model_1/reshape_8/Reshape                                              Reshape         Reshape         EXECUTED        693       
model_1/reshape_9/Reshape                                              Reshape         Reshape         EXECUTED        388

And this is what I get from the converted model if I don't use the --disable_nhwc_to_nchw flag

[ INFO ] Performance counters:
name                                                                   layer_type      exet_type       status          real_time, us
LeakyReLU_                                                             ReLU            LeakyRelu       EXECUTED        64        
LeakyReLU_1129                                                         ReLU            LeakyRelu       EXECUTED        44        
LeakyReLU_1130                                                         ReLU            LeakyRelu       EXECUTED        28        
LeakyReLU_1131                                                         ReLU            LeakyRelu       EXECUTED        80        
LeakyReLU_1132                                                         ReLU            LeakyRelu       EXECUTED        60        
LeakyReLU_1133                                                         ReLU            LeakyRelu       EXECUTED        67        
LeakyReLU_1134                                                         ReLU            LeakyRelu       EXECUTED        33        
LeakyReLU_1135                                                         ReLU            LeakyRelu       EXECUTED        28        
LeakyReLU_1136                                                         ReLU            LeakyRelu       EXECUTED        50        
LeakyReLU_1137                                                         ReLU            LeakyRelu       EXECUTED        44        
LeakyReLU_1138                                                         ReLU            LeakyRelu       EXECUTED        44        
LeakyReLU_1139                                                         ReLU            LeakyRelu       EXECUTED        51        
LeakyReLU_1140                                                         ReLU            LeakyRelu       EXECUTED        33        
Receive-Tensor                                                         Receive-Tensor  Receive-Tensor  EXECUTED        0         
main_input_noisy@FP16                                                  <Extra>         Convert_f32f16  EXECUTED        56        
model_1/G_gtlayer/add                                                  ScaleShift      ScaleShift      EXECUTED        83        
model_1/G_gtlayer/convolution/Conv2D                                   Convolution     Conv            EXECUTED        576       
model_1/G_gtlayer/convolution/Conv2D/Permute_                          Permute         Permute         EXECUTED        106       
model_1/G_gtlayer/convolution/ExpandDims                               Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/G_gtlayer/convolution/Squeeze                                  Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/concatenate_1/concat@0@compact                                 Concat          Copy            EXECUTED        24        
model_1/concatenate_1/concat@1@compact                                 Concat          Copy            EXECUTED        10        
model_1/concatenate_2/concat@0@compact                                 Concat          Copy            EXECUTED        22        
model_1/concatenate_2/concat@1@compact                                 Concat          Copy            EXECUTED        11        
model_1/concatenate_3/concat@0@compact                                 Concat          Copy            EXECUTED        23        
model_1/concatenate_3/concat@1@compact                                 Concat          Copy            EXECUTED        10        
model_1/concatenate_4/concat@0@compact                                 Concat          Copy            EXECUTED        24        
model_1/concatenate_4/concat@1@compact                                 Concat          Copy            EXECUTED        12        
model_1/concatenate_5/concat@0@compact                                 Concat          Copy            EXECUTED        28        
model_1/concatenate_5/concat@1@compact                                 Concat          Copy            EXECUTED        20        
model_1/concatenate_6/concat@0@compact                                 Concat          Copy            EXECUTED        35        
model_1/concatenate_6/concat@1@compact                                 Concat          Copy            EXECUTED        26        
model_1/conv1d_1/add                                                   ScaleShift      ScaleShift      EXECUTED        57        
model_1/conv1d_1/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        209       
model_1/conv1d_1/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        43        
model_1/conv1d_1/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        322       
model_1/conv1d_1/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_2/add                                                   ScaleShift      ScaleShift      EXECUTED        54        
model_1/conv1d_2/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        181       
model_1/conv1d_2/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        38        
model_1/conv1d_2/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        182       
model_1/conv1d_2/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_3/add                                                   ScaleShift      ScaleShift      EXECUTED        46        
model_1/conv1d_3/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        198       
model_1/conv1d_3/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        37        
model_1/conv1d_3/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        249       
model_1/conv1d_3/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_4/add                                                   ScaleShift      ScaleShift      EXECUTED        46        
model_1/conv1d_4/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        213       
model_1/conv1d_4/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        36        
model_1/conv1d_4/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        209       
model_1/conv1d_4/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_5/add                                                   ScaleShift      ScaleShift      EXECUTED        42        
model_1/conv1d_5/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        214       
model_1/conv1d_5/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        36        
model_1/conv1d_5/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        195       
model_1/conv1d_5/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv1d_6/add                                                   ScaleShift      ScaleShift      EXECUTED        40        
model_1/conv1d_6/convolution/Conv2D                                    Convolution     Im2ColConvolution EXECUTED        192       
model_1/conv1d_6/convolution/Conv2D/Permute_                           Permute         Permute         EXECUTED        36        
model_1/conv1d_6/convolution/ExpandDims                                Reshape         Reshape         EXECUTED        106       
model_1/conv1d_6/convolution/Squeeze                                   Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/conv2d_transpose_1/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        612       
model_1/conv2d_transpose_1/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        41        
model_1/conv2d_transpose_2/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1258      
model_1/conv2d_transpose_2/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        41        
model_1/conv2d_transpose_3/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1400      
model_1/conv2d_transpose_3/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        42        
model_1/conv2d_transpose_4/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        1381      
model_1/conv2d_transpose_4/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        42        
model_1/conv2d_transpose_5/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        2683      
model_1/conv2d_transpose_5/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        42        
model_1/conv2d_transpose_6/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        5253      
model_1/conv2d_transpose_6/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        44        
model_1/conv2d_transpose_7/conv2d_transpose                            Deconvolution   Deconvolution   EXECUTED        10313     
model_1/conv2d_transpose_7/conv2d_transpose@biases                     Deconvolution   Bias            EXECUTED        46        
model_1/g_output/Reshape                                               Reshape         Reshape         OPTIMIZED_OUT   0         
model_1/g_output/Reshape@FP16                                          <Extra>         Convert_f16f32  EXECUTED        27        
model_1/reshape_1/Reshape                                              Reshape         Reshape         EXECUTED        60        
model_1/reshape_10/Reshape                                             Reshape         Reshape         EXECUTED        1340      
model_1/reshape_10/Reshape/Permute_                                    Permute         Permute         EXECUTED        63        
model_1/reshape_11/Reshape                                             Reshape         Reshape         EXECUTED        396       
model_1/reshape_12/Reshape                                             Reshape         Reshape         EXECUTED        2644      
model_1/reshape_12/Reshape/Permute_                                    Permute         Permute         EXECUTED        93        
model_1/reshape_13/Reshape                                             Reshape         Reshape         EXECUTED        605       
model_1/reshape_2/Reshape                                              Reshape         Reshape         EXECUTED        103       
model_1/reshape_2/Reshape/Permute_                                     Permute         Permute         EXECUTED        41        
model_1/reshape_3/Reshape                                              Reshape         Reshape         EXECUTED        108       
model_1/reshape_4/Reshape                                              Reshape         Reshape         EXECUTED        191       
model_1/reshape_4/Reshape/Permute_                                     Permute         Permute         EXECUTED        55        
model_1/reshape_5/Reshape                                              Reshape         Reshape         EXECUTED        201       
model_1/reshape_6/Reshape                                              Reshape         Reshape         EXECUTED        352       
model_1/reshape_6/Reshape/Permute_                                     Permute         Permute         EXECUTED        55        
model_1/reshape_7/Reshape                                              Reshape         Reshape         EXECUTED        378       
model_1/reshape_8/Reshape                                              Reshape         Reshape         EXECUTED        708       
model_1/reshape_8/Reshape/Permute_                                     Permute         Permute         EXECUTED        83        
model_1/reshape_9/Reshape                                              Reshape         Reshape         EXECUTED        485

Sorry for the long post... The times are not visible here but maybe you could copy the text in a text editor. What I observe is that, except from all the times being higher in the first case, there are also these Broadcast layers added that are really slow to execute.

EDIT: I re-trained the model without a bias in the weights and now it works like a charm with a reasonable time needed for each iteration... It seems that this was the constant operation that was giving all the unsupported layers. However, the input layer still remains in the list of unsupported layers for some reason although it doesn't seem to affect the output. I am not sure though whether it is affecting the performance, because right now each iteration takes about 50 ms to complete for an input/output feature vector of 1024 size, which is reasonable but it's not significantly fast.