MKL_DNN convolution has the wrong output order on Intel(R) Xeon(R) CPU E5-2650 v3 (Possible bug)

Lennart_S_ · ‎06-12-2018

Hello all,

I recently implemented the convolution of the intel mkl library as described in the example included with the library. Everything is fine and dandy on my Laptop with a i5-3210M. However when I tried to run the code on the big machine, with an Intel(R) Xeon(R) CPU E5-2650 v3 i ran into some bugs/problems.

For outputs that have a channel size that is a multiple of 8 the order of the output is wrong. This is either a mistake on my side (probably with the compile options) or in the worst case a bug in the mkl. I wrote a short test script similar to the example file, that implements a standard forward convolution.

#include <iostream>
#include "mkl_dnn.h"
#include <vector>
using namespace std;


#define dimension (4)
int main() {

	dnnPrimitiveAttributes_t attributes;
	dnnPrimitive_t conv_prim = NULL;


	float* resConv1[dnnResourceNumber] = {0};

	size_t batch_num = 1;


	bool use_bias = false;

	size_t xinp = 4,
		yinp = 4,
		xout = 4,
		yout = 4,
		inpchannels = 1,
		outchannels = 8,
		xfilt = 3,
		yfilt = 3;


    size_t outputSize[dimension] = { xout, yout, outchannels, batch_num };
    size_t outputStrides[dimension] = { 1, xout, xout * yout, xout * yout * outchannels };

    size_t inputSize[dimension] = { xinp, yinp, inpchannels, batch_num };
    size_t inputStrides[dimension] = { 1, xinp, xinp * yinp, xinp * yinp * inpchannels };

    size_t filterSize[dimension] = { xfilt, yfilt, inpchannels, outchannels };
    size_t filterStrides[dimension] = { 1, xfilt, xfilt * yfilt, xfilt * yfilt * inpchannels };

    size_t biasSize[1] = { outputSize[2] };
    size_t biasStrides[1] = { outputStrides[2] };

    size_t convolutionStride[dimension - 2] = { 1, 1 };
    int inputOffset[dimension - 2 ] = { - ( (outputSize[0]/2)) - filterSize[0]/2 + inputSize[0]/2, - ( (outputSize[0]/2)) - filterSize[0]/2 + inputSize[0]/2 };

    dnnLayout_t lt_conv1_input = NULL,
                lt_conv1_filt = NULL,
                lt_conv1_bias = NULL,
                lt_conv1_output = NULL;




	if( dnnPrimitiveAttributesCreate_F32(&attributes)!= E_SUCCESS){
		std::cout << "error" << std::endl;
	}
	dnnError_t err;
	if( use_bias ){
		err= dnnConvolutionCreateForwardBias_F32(&conv_prim, attributes,
	                    dnnAlgorithmConvolutionDirect, dimension, inputSize,
	                    outputSize, filterSize, convolutionStride, inputOffset,
	                    dnnBorderZeros);
	}else{
		err = dnnConvolutionCreateForward_F32(&conv_prim, attributes,
						dnnAlgorithmConvolutionDirect, dimension, inputSize,
						outputSize, filterSize, convolutionStride, inputOffset,
						 dnnBorderZeros);
	}

	if( err != E_SUCCESS){
		switch (err){
		case E_INCORRECT_INPUT_PARAMETER:
				std::cout << "incorrect input parameter while creating the convolution" << std::endl;break;
		default:
			std::cout << "error while creating convolution" << std::endl;
		}

	}

    dnnLayoutCreateFromPrimitive_F32(&lt_conv1_input, conv_prim, dnnResourceSrc);
    dnnLayoutCreateFromPrimitive_F32(&lt_conv1_filt, conv_prim, dnnResourceFilter);
    if( use_bias){
    	dnnLayoutCreateFromPrimitive_F32(&lt_conv1_bias, conv_prim, dnnResourceBias);
    }
    dnnLayoutCreateFromPrimitive_F32(&lt_conv1_output,conv_prim, dnnResourceDst);


    std::vector<float> input(xinp*yinp*inpchannels,1.0);
    std::vector<float> output(xout*yout*outchannels,1.0);
    std::vector<float> filter(xfilt*yfilt*inpchannels*outchannels,1.0);
    std::vector<float> bias(outchannels,1.0);

    resConv1[dnnResourceSrc] = &(input[0]);
    resConv1[dnnResourceFilter] = &filter[0];
    if( use_bias)  resConv1[dnnResourceBias] = &bias[0];
    resConv1[dnnResourceDst]= &output[0];

    dnnError_t err_exe = dnnExecute_F32(conv_prim, (void**) resConv1);
    if( err_exe != E_SUCCESS){
    	std::cout << "Error while forward propagation in convolutional layer" << std::endl;
    	if( err_exe== E_MEMORY_ERROR){
    		std::cout << "Memory Error" << std::endl;
    	}
    	if( err_exe == E_UNIMPLEMENTED){
    		std::cout << "Unimplemented" << std::endl;
    	}
    	if( err_exe == E_UNSUPPORTED_DIMENSION){
    		std::cout << "Unsupported dimension" << std::endl;
    	}
    	if( err_exe == E_INCORRECT_INPUT_PARAMETER){
    		std::cout << "Incorrect input parameter" << std::endl;
    	}
    }

    std::cout << "output" <<std::endl;
    for( int i=0; i < output.size(); i++){
    	std::cout << output << " ";
    }
    std::cout << std::endl;
	return 0;
}

The desired output for a 4x4 image with 8 convolutions and an input of 1s and 3x3 filters of 1s is:

4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4

This is also what my mobile CPU gives me when i run the code. However on the big PC i get

4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4

which is obviously somewhat right, but not in the right order. However when i change the output channel to not be a multiple of 8 the code runs fine even on the Xeon CPU. This might be due to the mkl switching to a slower and different algorithm as explained in this post:

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/761063

Does anybody have an explanation or even a fix for this issue? Is this known behaviour on Xeon CPUs, or a bug in the software? I don't necessarily wan't to switch to the open source implementation, since it would mean a week of new implementing/testing.

For compilation i used the following linkline for both systems :

-L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl

-I${MKLROOT}/include -I${MKLROOT}/../lib/intel64_lin

any help would be appreciated.

Ying_H_Intel · ‎06-14-2018

Hi Lennart,

Thank you a lot for the reporting this. The result is by design.

Actually implementation depends on different machine type and convolution shape etc, so that output layout can be different on different machine (or even different layout on same machine). For the case when number of input channels (ic) =1 and number of output channels (oc) is divisible on SIMD width (8 for avx2) , the function will call optimized code that will produce output in SIMD-friendly format blocked by channels - nChw8c, where n is batch size, C – number of blocks by channels, h is spatial heights and w is spatial widths, instead of the plain format NCHW.

There is a some explanation about data layouts and common programming model in this article https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl

If you hope to see the plain format whatever, you may need to call convert (reorder) at end of the output.

In any case, we will recommend you to try the MKL-DNN instead of NN primitive MKL as better functionality and performance there.

Best Regards,

Ying