tópico Hi Ying, thanks for the quick na Intel® oneAPI Math Kernel Library

Tensorflow performance w/ MKL

Liu__Chao — Sun, 06 May 2018 06:58:58 GMT

Hi,

I am trying to use tensorflow-1.8.0 compiled with MKL-2018.2.199 enabled. I use it to run mobilenet image classification and obj detection models. I compared the performance w/ MKL and w/o MKL. In general, w/ MKL is much slower in most cases. I am posting here to see whether I did sth. wrong or this is what I should expect..

All the following comparison numbers were collected from running the corresponding inference models on an i7-5557U CPU. I also run the tests on other CPUs and got similar results. NOTE: the time in 1-4 is per 16 frames. 5-6 is per a 320x180 frame.

1. Mobilenet_v2_1_4_224 w/ MKL 1463 ms w/o MKL 2486 ms (this is good)

2. Mobilenet_v2_1_0_96 w/ MKL 481 ms w/o MKL 276 ms (~1 time slower!)

3. Mobilenet_v1_1_0_224_quant w/ MKL 903 ms w/o MKL 664 ms (~50% slower)

4. Mobilenet_v1_1_0_128_quant w/ MKL 469 ms w/o MKL 233 ms (~1 time slower)

5. ssd_mobilenet_v1_coco w/ MKL 142 ms w/o MKL 116 ms

6. ssd_mobilenet_v2_coco w/ MKL 212 ms w/o MKL 130 ms

I used "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CALL -march=native -mtune=native" to compile tensorflow .

You can find the code here. The benchmark data is here and here.

Hi Yjl,

Ying_H_Intel — Mon, 07 May 2018 07:03:35 GMT

Hi Yjl,

Thank you for the reports. As i understand with mkl, most of test has performance issues. Let's focus on three of them , for example1, 3. and 6.

1. Could you please do export MKL_VERBOSE=1 and run them, copy the result here.
2. export MKLDNN_VERBOSE=1 and run them and copy the result here.

then let's consider the build processing and configuration.

1. How do you compile the tensorflow and run the benchmark. Could you elaborate the steps of your build tensorflow , for example, I suppose you are using GNU GCC compiler, right?

https://software.intel.com/en-us/articles/intel-optimized-tensorflow-installation-guide
https://ai.intel.com/tensorflow-optimizations-intel-xeon-scalable-processor/

2. how do you run the benchmark. like How do you do the threading setting?
https://www.tensorflow.org/performance/performance_guide

Best Regards

Ying

Hi Ying, thanks for the quick

Liu__Chao — Mon, 07 May 2018 23:42:16 GMT

Hi Ying, thanks for the quick response, appreciate that.

I used one thread to run the models:
tensorflow::SessionOptions sess_opts;
sess_opts.config.set_intra_op_parallelism_threads(1);
sess_opts.config.set_inter_op_parallelism_threads(1);

I used GCC to compile tensorflow. I used tensorflow/contrib/makefile with some modifications, mainly adding these defines "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CAL -DEIGEN_DONT_PARALLELIZE"
I didn't use bazel because I need a static library.

I just realized that DINTEL_MKL_ML will cause it to use the version from mkl instead of mkldnn, which tensorflow is supposed to use. I removed it and got much worse peformance on Mobilenet_v2_1_4_22 (I think it's mainly caused by _MklFusedBatchNorm, the mkldnn version is way too slow).

Anyway, I run tensorflow benchmark_model to get the logs for you:
- benchmark_model --graph=testdata/mobilenet_v1_1.0_224_quant_frozen.pb --show_flops --input_layer=input --input_layer_type=float --input_layer_shape=1,224,224,3 --output_layer=MobilenetV1/Predictions/Reshape_1 --num_threads=1
- benchmark_model --graph=testdata/mobilenet_v2_1.4_224_frozen.pb --show_flops --input_layer=input --input_layer_type=float --input_layer_shape=1,224,224,3 --output_layer=MobilenetV2/Predictions/Reshape_1 --num_threads=1
- benchmark_model --graph=testdata/ssd_mobilenet_v2_coco_2018_03_29_frozen.pb --show_flops --input_layer=image_tensor --input_layer_type=uint8 --input_layer_shape=1,1920,1080,3 --output_layer=num_detections,detection_classes,detection_scores,detection_boxes --num_threads=1

It loooks me the main culprit is op Conv2D (replaced by _MklConv2D and _MklConv2DWithBias using MKL?)
Conv2D _MklConv2D _MklConv2DWithBias
mobilenet_v1_1.0_224_quant 19.303 ms 24.379 ms 7.905 ms
mobilenet_v2_1.4_224 24.969 ms 41.942 ms
ssd_mobilenet_v2_coco 108.692 ms 48.872 ms 143.936 ms

More information:

Liu__Chao — Tue, 08 May 2018 19:10:55 GMT

More information:
I don't think it's related to how I built tensorflow. I run
bazel run --config=mkl --config=opt --config=monolithic //tensorflow/tools/benchmark:benchmark_model
and got similar results. The interesting thing is that
OMP_NUM_THREADS=1 bazel-bin/tensorflow/tools/benchmark/benchmark_model
is two times faster than
bazel-bin/tensorflow/tools/benchmark/benchmark_model

Again, all these tests were run on a i7-5557U CPU.

Hi Yjl,

Ying_H_Intel — Wed, 09 May 2018 01:32:48 GMT

Hi Yjl,

Thank you for the details, just quick review and seems the build was mkl-dnn enabled mainly and 1 thread used. We will look into here.

And could you please sumbit your issue to https://github.com/intel/mkl-dnn/issues? where our developer may check the problem directly with ready environment.

Best Regards,
Ying

Filed https://github.com

Liu__Chao — Wed, 09 May 2018 23:00:28 GMT

Filed https://github.com/intel/mkl-dnn/issues/234 ; ..

More discovery:

1. _MklFusedBatchNorm is slower than FusedBatchNorm 102ms VS 88ms

2. _MklConv2DWithBias is slower than Conv2D + BiasAdd 42ms VS 25+3ms

3. MKL introduced several extra operations that are pretty expensive, like _MklInputConversion and _MklToTf