topic Dear Shubha, in Intel® Distribution of OpenVINO™ Toolkit

Not able to generate openvino IR for simple model (mnist) using mxnet

Gouveia__César — Wed, 21 Aug 2019 14:41:05 GMT

Hi!

I tried to convert a simple MXNet model (for MNIST) to an optimized Intermediate Representation (IR) using the openVINO toolkit. I use the following command to convert:

python mo_mxnet.py --input_model test_model\mnist_cnn-0000.params --input_shape (1,1,28,28) --reverse_input_channels

But when I try to run it it shows the following error:

[ ERROR ] Unexpected exception happened during extracting attributes for node dense_1/kernel1.
Original exception message: Operation 'dot' not supported. Please register it as custom op.

The toolkit fully supports the dense layers for MXNet? or is there something I'm doing wrong? Attached is the network definition file.

Dear Gouveia, César,

Shubha_R_Intel — Thu, 22 Aug 2019 20:59:17 GMT

Dear Gouveia, César,

While I don't have your full debug log (obtained by --log_level DEBUG) If this is indeed coming from Model Optimizer:

Original exception message: Operation 'dot' not supported. Please register it as custom op.

It means that dot.py is not found under here:

C:\Program Files (x86)\IntelSWTools\openvino_2019.2.242\deployment_tools\model_optimizer\mo\ops

or even under here:

C:\Program Files (x86)\IntelSWTools\openvino_2019.2.242\deployment_tools\model_optimizer\extensions\ops

You can add one though by creating something like "dot.py" in one of those locations. Is dot a "dot product" operation ? There maybe a way to modify the model to avoid the dot operation.

My guess is that the complaint is about not finding NDArray Dot

Please attach your debug log to this forum ticket so that I can investigate further.

Thanks,

Shubha

Hi Shubha,

Gouveia__César — Fri, 23 Aug 2019 09:20:00 GMT

Hi Shubha,

First of all I apologize for the delayed response. I also want to thank you for your answer and for the attention that is being given to this matter.

I did some research and I found that the two mxnet core packages are NDArray and Symbol. The symbol package uses the FullyConnected operation, and the NDArray package uses the dot operation. This dot operation is not listed under the supported operations for mxnet by openVINO and as you said there is no dot.py under \model_optimizer\mo\ops and \model_optimizer\extensions\ops. Yes, the documentation says that the dot operation is a dot product of two arrays.

Yes I have two options either try modifying the network definition and replacing dot operations with fully connected operations, or create a dot.py file as you said and as described in here and here. I don´t know how to create this dot.py file and this is what I am researching now. Any tips you could provide would be helpful!

I attached my full log (--log_level DEBUG as you mentioned).

Thanks,

César.

EDIT: Apparently there is a dot operation in the symbol mxnet API too.

Dear Gouveia, César,

Shubha_R_Intel — Tue, 27 Aug 2019 22:04:34 GMT

Dear Gouveia, César,

I certainly understand your situation. Please read the following IDZ Custom Layers Post where I give detailed information on how to build custom layers for OpenVino. The dldt github links it points to are 2018 but you can find the same links in the 2019 R2 repo.

Please also take a look at The OpenVino Custom layer Tutorial

As for how to build an "dot.py", the best advice I can give you is to study existing ops in those directories I pointed you to earlier. They are all Python code. There is no easy answer. We unfortunately don't have documentation on these.

Hope it helps,

Thanks,

Shubha

Hi Shubha,

Gouveia__César — Wed, 28 Aug 2019 13:38:46 GMT

Hi Shubha,

In the meanwhile, I made some progresses. I'm going to divide my post in two phases:

1º phase:

I can now generate the IR model, using a dot_ext.py and a dot.py. However an error message appears when I try to generate the model with reverse input channels (RGB to BGR) using the --reverse_input_channels flag (without this flag works fine):

[ ERROR ]  Reverse input channels are not applied -- appropriate convolutions were not found

[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: C:\Program Files (x86)\IntelSWTools\openvino_2019.2.242\.\mnist_cnn-0000.xml
[ SUCCESS ] BIN file: C:\Program Files (x86)\IntelSWTools\openvino_2019.2.242\.\mnist_cnn-0000.bin
[ SUCCESS ] Total execution time: 1.65 seconds.

My question is: this flag appears because mxnet already uses the BGR notation, or because I'm doing something wrong while generating the model? I need to clarify this point because of this note:

**NOTE!**: By default, Inference Engine samples and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with `--reverse_input_channels` argument specified.

Attached to this post I provide my dot.py, dot_ext.py and my full log (--log_level DEBUG) for this particular error. Can you please check if the python files (dot.py and dot_ext.py) are correct as well as the reason of this error?

2º phase:

In this phase I tried to execute the Model with the Custom Layer (using the C++ sample). The build_samples_msvc.bat builds correctly using my ext_dot.cpp however the inference stops and doesn't shows any ERROR message which is strange:

[ INFO ] InferenceEngine:
        API version ............ 2.0
        Build .................. 27579
        Description ....... API
[ INFO ] Parsing input parameters
[ INFO ] Parsing input parameters
[ INFO ] Files were added: 1
[ INFO ]     deployment_tools\pics\digit_8.bmp
[ INFO ] Creating Inference Engine
[ INFO ] CPU Extension loaded: C:\Users\cesar.gouveia\Documents\Intel\OpenVINO\inference_engine_samples_build\intel64\Release\cpu_extension.dll
        CPU
        MKLDNNPlugin version ......... 2.0
        Build ........... 27579

[ INFO ] Loading network files
[ INFO ] Preparing input blobs
[ WARNING ] Image is resized from (280, 280) to (28, 28)
[ INFO ] Batch size is 1
[ INFO ] Loading model to the device
[ INFO ] Create infer request
[ INFO ] Start inference (10 asynchronous executions)

I tried to look for a debug flag to provide more information but without success:

%USERPROFILE%\Documents\Intel\OpenVINO\inference_engine_samples_build\intel64\Release\classification_sample_async.exe -h
[ INFO ] InferenceEngine:
        API version ............ 2.0
        Build .................. 27579
        Description ....... API
[ INFO ] Parsing input parameters

classification_sample_async [OPTION]
Options:

    -h                      Print a usage message.
    -i "<path>"             Required. Path to a folder with images or path to an image files: a .ubyte file for LeNetand a .bmp file for the other networks.
    -m "<path>"             Required. Path to an .xml file with a trained model.
      -l "<absolute_path>"  Required for CPU custom layers.Absolute path to a shared library with the kernels implementation
          Or
      -c "<absolute_path>"  Required for GPU custom kernels.Absolute path to the .xml file with kernels description
    -d "<device>"           Optional. Specify the target device to infer on (the list of available devices is shown below). Default value is CPU. Sample will look for a suitable plugin for device specified.
    -nt "<integer>"         Optional. Number of top results. Default value is 10.
    -p_msg                  Optional. Enables messages from a plugin

Available target devices:  CPU  GNA  GPU  HDDL

Attached goes the ext_dot.cpp and the logs of my build (build.log). Can you please verify this files too?

Thanks,

César.

Dearest Gouveia, César,

Shubha_R_Intel — Wed, 28 Aug 2019 22:14:36 GMT

Dearest Gouveia, César,

I am not ignoring you ! I promise to take a look.

Shubha

Dear Shubha,

Gouveia__César — Thu, 29 Aug 2019 09:03:12 GMT

Dear Shubha,

Thank you very much, for your willingness to answer my questions.

I look forward to your answer,

César.

Hi Shubha,

Gouveia__César — Wed, 04 Sep 2019 11:06:00 GMT

Hi Shubha,

I have made some progresses and I think I am very close to the solution! Inference now runs without crashing and I'm able to output a prediction value, however this value is not correct and the model does not predict correctly, which makes me think that there is still something wrong implemented in the dot operation.

I checked if both weights and outputs relative to the dot1 operation/layer were equal/similar to the weights and output values produced by the dense1 keras, by importing slog.hpp which enables info message printing. The weights are being read correctly, however the dot1 output values are different from the ones being produced by keras, is this a normal? Should the values be similar? Or the inference engine does some "out of the box" optimizations? Below are the output values calculated by keras (dense_1) and those calculated by the inference engine on openVINO using the IR model (dot1):

Keras 10 highest output values for the Layer dense_1 array:

[4.927633  4.5581803 4.235109  4.0994024 4.0133104 3.9622984 3.7825406
 3.7398129 3.7224526 3.5236738]

OpenVINO 10 highest output values for the dot1 operation/layer:

Top 10 results:

Image deployment_tools\data\digit_8.png

classid probability
------- -----------
38      1.4662519
62      1.3287330
51      1.3258586
49      1.3234890
4       1.0815108
76      0.9692022
100     0.9581612
61      0.8927912
83      0.8498729
78      0.8213191

Attached is my current version of ext_dot.cpp. Can you please check it?

I look forward to your reply.

Thanks,

César.

Dear Gouveia, César,

Shubha_R_Intel — Thu, 05 Sep 2019 20:32:10 GMT

Dear Gouveia, César,

I am glad that you got so far on your own. But indeed, your results look way off Keras's numbers. I apologize that I haven't gotten back to you sooner but I'm sure you can understand, I'm super busy with other customers.

I'm looking at your code in ext_dot.cpp and it doesn't look correct for dot product.

for(size_t n = 0; n < out_neurons; n++){
            float accum = 0.0;
            for(size_t i = 0; i < in_neurons; i++){
                accum += src*scl[n*in_neurons + i];
            }
            dst = accum;
        }

Dot Product is each row element multiplied by each column element added together. Here are a couple of ways to do that in C++

https://rosettacode.org/wiki/Dot_product#C.2B.2B

https://www.sanfoundry.com/cpp-program-calculate-dot-product-two-matrices/

Your loops look different from this one (which is more readable and looks correct to me):

for (i = 0; i < m; i++)
    {
        C = 0;
        for (j = 0; j < n; j++)
            C +=  A * B;
 
    }

Something you can certainly do though is debug and step through your code (maybe you are already doing that).

Hope it helps.

Thanks,

Shubha

Dear Shubha,

Gouveia__César — Mon, 09 Sep 2019 14:30:51 GMT

Dear Shubha,

Finally my custom operation is working at 100%!

I think dot product can't be done that way (in this context) because the weights and inputs are stored in a long one-dimensional vector, so what I did was implementing the mask to obtain the weight indexes. Here is the complete code for the previously mentioned function:

StatusCode execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs,
                       ResponseDesc *resp) noexcept override {
        
        if (inputs.size() != 1 || outputs.empty()) {
            if (resp) {
                std::string errorMsg = "Incorrect number of input or output edges!";
                errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            }
            return GENERAL_ERROR;
        }

        const float* src = inputs[0]->buffer();
        const float* scl = weights->buffer();
        float* dst = outputs[0]->buffer();

        SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
        SizeVector out_dims = outputs[0]->getTensorDesc().getDims();

        const int in_neurons = static_cast<int>(in_dims[1]);
        const int out_neurons = static_cast<int>(out_dims[1]);

        for(size_t n = 0; n < out_neurons; n++){
            float accum = 0.0;
            for(size_t i = 0; i < in_neurons; i++){
                accum += src*scl[i*out_neurons + n];
            }
            dst = accum;
        }   
        return OK;
    }

However now I have another problem. After running the benchmark script I saw that my custom function (dot) represents the bottleneck of the application:

CONV3-32: [28x28x32]  memory:  28*28*32=25K   weights: (3*3*32)*32 = 9K   nr_operations: 7M   time : 0.049000

FC1: [1x1x512]  memory:  576   weights: 3*3*64*512 = 295K   nr_operations: 295K   time : 0.592000

You can see by the results above that a convolutional layer that has 7M MACs (Multiply-Accumulate operations) runs in 49 ms, while a fully connected layer (implemented with my custom dot operation) runs in 592 ms (x12 slower). This is something that should not occur because the number of calculations (processing time) for the convolutional layers is much higher than the number of calculations in the fully connected.

I think my next step will be to take a look at the Advanced Vector Extensions (in this case AVX2).

Dear Gouveia, César,

Shubha_R_Intel — Mon, 09 Sep 2019 17:09:15 GMT

Dear Gouveia, César,

I think dot product can't be done that way (in this context) because the weights and inputs are stored in a long one-dimensional vector, so what I did was implementing the mask to obtain the weight indexes.

OK I missed that part. Good on you that you finally got it working ! Congrats ! Take a look at the implementation of ext_argmax.cpp . You will see this in the header file section :

#include <ie_parallel.hpp>
#if defined(HAVE_SSE) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
#include <immintrin.h>
#endif

You will see extensive use of AVX as well as stuff like parallel_for.

Hope it helps,

Shubha

Hi again Shubha,

Gouveia__César — Tue, 24 Sep 2019 16:07:26 GMT

Hi again Shubha,

I have made the following code to perform dot product using the AVX vectors to speed up the operation:

const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();

SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();

const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);    

for(size_t n = 0; n < out_neurons; n++){
    float accum = 0.0;
    float temp[4] = {0,0,0,0};
    float *p = temp;

    __m128 in, ws, dp;

    for(size_t i = 0; i < in_neurons; i+=4){

        // read and save the weights correctly by applying the mask
        temp[0] = scl[(i+0)*out_neurons + n];
        temp[1] = scl[(i+1)*out_neurons + n];
        temp[2] = scl[(i+2)*out_neurons + n];
        temp[3] = scl[(i+3)*out_neurons + n];

        // load input neurons sequentially
        in = _mm_load_ps(&src);

        // load weights
        ws = _mm_load_ps(p);

        // dot product
        dp = _mm_dp_ps(in, ws, 0xff);

        // accumulator
        accum += dp.m128_f32[0]; 
    }
    // save the final result
    dst = accum.m128_f32[0];
}

It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX?

**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32 
Input: 28x28x32 = 25K   
Weights: (3*3*32)*32 = 9K   
Number of MACs: 3*3*27*27*32*32 = 7M    
Execution Time on OpenVINO framework: 0.049 ms

**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576    
Weights: 3*3*64*512 = 295K  
Number of MACs: 295K    
Execution Time on OpenVINO framework: 0.197 ms

Dear Gouveia, César,

Shubha_R_Intel — Mon, 30 Sep 2019 20:47:09 GMT

Dear Gouveia, César,

Please have a look at ext_topk.cpp. While I didn't study your code deeply I don't see a SIMD approach (Single Instruction Multiple Data). How do I know this ? I just see a regular for loop. I'd expect to see parallel_for2d. For instance, if you study ext_topk.cpp, you will see something like this:

#if defined(HAVE_SSE) || defined(HAVE_AVX2) || defined(HAVE_AVX512F)
        parallel_for2d(before_num, after_num / block_size, [&](int i0, int ib1) {

Hope it helps,

Thanks,

Shubha

Dear Shubha,

Gouveia__César — Mon, 14 Oct 2019 10:46:00 GMT

Dear Shubha,

After transposing the weights matrix and applying AVX instrinsics I was able to optimize my code to a decent execution time! Here is the final code (which uses the dot weights matrix transposed and without the AVX instrinsics and the parallel_for to simplify the code).

StatusCode execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs,
                       ResponseDesc *resp) noexcept override {
        
    if (inputs.size() != 1 || outputs.empty()) {
        if (resp) {
            std::string errorMsg = "Incorrect number of input or output edges!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
        }
        return GENERAL_ERROR;
    }

    const float* src = inputs[0]->buffer();
    const float* scl = weights->buffer();
    float* dst = outputs[0]->buffer();

    SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
    SizeVector out_dims = outputs[0]->getTensorDesc().getDims();

    const int in_neurons = static_cast<int>(in_dims[1]);
    const int out_neurons = static_cast<int>(out_dims[1]);

    for(size_t n = 0; n < out_neurons; n++){
        float accum = 0.0;
        for(size_t i = 0; i < in_neurons; i++){
            accum += src*scl[n*in_neurons + i];
        }
        dst = accum;
    }
return OK;
}

Thanks,

César.