problems when converting models with more than three channel

hua__wei · ‎06-28-2019

I used mtcnn model trained by mxnet and converted it to openvino. problems happend when i convert LNET with 15 channels.

the input shape of LNET is [1, 15, 24,24], so i used below code to convert model.

"

python3 mo_mxnet.py --input_model lnet.params --mean_values [127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5,127.5] --scale_values [128,128,128,128,128,128,128,128,128,128,128,128,128,128,128] --input_shape [1,15,24,24]

"

the convert is success, but when i put the same image ,the result of openvino-lnet imdel is different from mxnet-lnet model .

i'm not sure the convert code above is right.

but when i used the below code to convert the pnet/rnet/onet in mtcnn，the result is ok.

"

python3 mo_mxnet.py --input_model pnet.params --mean_values [127.5,127.5,127.5] --scale_values [127.8,127.8,127.8] --input_shape [1,3,200,200]

"

Shubha_R_Intel · ‎06-28-2019

Dear hua, wei,

Please tell me more about your model which requires 15 channels. I'm not sure what you mean. OpenVino's inference Engine expects a layout of NCHW, where N = Batch Size, C = Num Channels, H = Height, W = Width . The value of C is 3 if your input is RGB or BGR, or it could be 1 for instance if your input image is Black and White.

Can you kindly clarify your intent ?

Thanks,

Shubha

hua__wei · ‎06-28-2019

Dear Shubha R,

1. this model is the fourth model of mtcnn which is called Lnet. it has 15 channels, because it stack five images as input (which is left eye image, right eye image, nose image, left part of mouth image and right part of mouth image, the Lnet trained of all these images together so it has 5 images include 15 channels), each image is BGR, so the input channel is [b,g,r,b,g,r,b,g,r,b,g,r,b,g,r].

the input of Lnet is [1,15,24,24], batchsize is 1, channel is 15, the height of image is 24, the width of image is 24. this model can be downloaded from “ https://github.com/YYuanAnyVision/mxnet_mtcnn_face_detection/tree/master/model”，det4-0001.params.

2. I also tried to use the below code to convert lnet.

"python3 mo_mxnet.py --input_model det4-0001.params --input_shape [1,15,24,24]"

and then subtract the mean value and divide scales value of the input image and feed to the lnet , however the result is different from the original lnet in mxnet ,too.

3, I have done the two experiment, first experiment ,i convert the model without parameter "--mean values" and "--scales", but subtract the mean value and scales value of the input image ,and then feed the result to the model;

second experiment ,i convert the model with parameter "--mean values" and "--scales", and just feed the original image to the model .

why the above two experiment result is different.

thank u ,

Wei Hua

Shubha_R_Intel · ‎07-01-2019

Dear Wei Hua,

For 2) above, When you say the result is different from the original mxnet model, what do you mean ? How are the results different and how are you comparing the results ?

For 3) I agree, this is strange, as long as you are passing in the correct --input_shape to the Model Optimizer in each case. I assume that when you do this :

but subtract the mean value and scales value of the input image ,and then feed the result to the model;

--input_shape does not change.

I found the interesting article how-to-manually-scale-image-pixel-data-for-deep-learning . In this article it talks about Global Centering versus Local Centering.

Looking at the help for Model Optimizer, we see:

--mean_values Mean values to be used for the input image per channel.

--scale_values Scale values to be used for the input image per channel.

The per channel verbiage implies that Local Centering is performed by Model Optimizer.

Finally, another assumption - is centering (subtracting the mean) done before normalization (scaling) or after by Model Optimizer ? To tell you the truth I'm not sure. My guess is that centering (subtracting the mean) happens first.

There is also --scale All input values coming from original network inputs will be divided by this value. --scale seems more like Global rather than Local, since per channel is not mentioned.

Finally if you look at our C++ samples, you will see code like the following. It's important to set the precision correctly for your input data.

 /** Specifying the precision and layout of input data provided by the user.
         * This should be called before load of the network to the plugin **/
        inputInfoItem.second->setPrecision(Precision::U8);
        inputInfoItem.second->setLayout(Layout::NCHW);

        std::vector<std::shared_ptr<unsigned char>> imagesData;
        for (auto & i : imageNames) {
            FormatReader::ReaderPtr reader(i.c_str());
            if (reader.get() == nullptr) {
                slog::warn << "Image " + i + " cannot be read!" << slog::endl;
                continue;
            }
            /** Store image data **/
            std::shared_ptr<unsigned char> data(
                    reader->getData(inputInfoItem.second->getTensorDesc().getDims()[3],
                                    inputInfoItem.second->getTensorDesc().getDims()[2]));
            if (data.get() != nullptr) {
                imagesData.push_back(data);
            }
        }

My point is that the way you interpreted --mean_values and --scale_values and the way in which Model Optimizer interpreted --mean_values and --scale_values are likely different, which would explain the inconsistent results you're seeing.

Hope it helps,

Thanks,

Shubha