Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Liu__Chao
Beginner
347 Views

Tensorflow performance w/ MKL

Hi,

I am trying to use tensorflow-1.8.0 compiled with MKL-2018.2.199 enabled. I use it to run mobilenet image classification and obj detection models. I compared the performance w/ MKL and w/o MKL. In general, w/ MKL is much slower in most cases. I am posting here to see whether I did sth. wrong or this is what I should expect..

All the following comparison numbers were collected from running the corresponding inference models on an i7-5557U CPU. I also run the tests on other CPUs and got similar results. NOTE: the time in 1-4 is per 16 frames. 5-6 is per a 320x180 frame.

1. Mobilenet_v2_1_4_224                 w/ MKL 1463 ms   w/o MKL  2486 ms (this is good)

2. Mobilenet_v2_1_0_96                   w/ MKL  481 ms    w/o MKL  276 ms (~1 time slower!)

3. Mobilenet_v1_1_0_224_quant      w/ MKL  903 ms    w/o MKL  664 ms (~50% slower)

4. Mobilenet_v1_1_0_128_quant      w/ MKL  469 ms    w/o MKL  233 ms  (~1 time slower)

5. ssd_mobilenet_v1_coco                w/ MKL  142 ms    w/o MKL  116 ms

6. ssd_mobilenet_v2_coco                w/ MKL  212 ms     w/o MKL 130 ms

I used "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CALL -march=native -mtune=native" to compile tensorflow .

You can find the code here. The benchmark data is here and here.

0 Kudos
5 Replies
Ying_H_Intel
Employee
347 Views

Hi Yjl,

​Thank you for the reports. As i understand with mkl, most of test has performance issues.  Let's focus on three of them , for example1,  3.  and 6.  

1. Could you please do export MKL_VERBOSE=1   and run them, copy the result here.
​2. export MKLDNN_VERBOSE=1  and run them  and copy the result here.

then let's consider the build processing and configuration.

1. How do you compile the tensorflow and run the benchmark.  Could you elaborate the steps of your build tensorflow   , for example, ​I suppose you are using GNU GCC compiler, right? 

https://software.intel.com/en-us/articles/intel-optimized-tensorflow-installation-guide
https://ai.intel.com/tensorflow-optimizations-intel-xeon-scalable-processor/

​2. how do you run the benchmark.  like How do you do the threading setting? 
https://www.tensorflow.org/performance/performance_guide

Best Regards

​Ying

Liu__Chao
Beginner
347 Views

Hi Ying, thanks for the quick response, appreciate that.

I used one thread to run the models:        
        tensorflow::SessionOptions sess_opts;
        sess_opts.config.set_intra_op_parallelism_threads(1);
        sess_opts.config.set_inter_op_parallelism_threads(1);

I used GCC to compile tensorflow. I used tensorflow/contrib/makefile with some modifications, mainly adding these defines "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CAL -DEIGEN_DONT_PARALLELIZE"
I didn't use bazel because I need a static library.

I just realized that DINTEL_MKL_ML will cause it to use the version from mkl instead of mkldnn, which tensorflow is supposed to use. I removed it and got much worse peformance on Mobilenet_v2_1_4_22 (I think it's mainly caused by _MklFusedBatchNorm, the mkldnn version is way too slow).

Anyway, I run tensorflow benchmark_model to get the logs for you:
benchmark_model --graph=testdata/mobilenet_v1_1.0_224_quant_frozen.pb --show_flops --input_layer=input --input_layer_type=float --input_layer_shape=1,224,224,3 --output_layer=MobilenetV1/Predictions/Reshape_1 --num_threads=1
benchmark_model --graph=testdata/mobilenet_v2_1.4_224_frozen.pb --show_flops --input_layer=input --input_layer_type=float --input_layer_shape=1,224,224,3 --output_layer=MobilenetV2/Predictions/Reshape_1 --num_threads=1
- benchmark_model --graph=testdata/ssd_mobilenet_v2_coco_2018_03_29_frozen.pb --show_flops --input_layer=image_tensor --input_layer_type=uint8 --input_layer_shape=1,1920,1080,3 --output_layer=num_detections,detection_classes,detection_scores,detection_boxes --num_threads=1

It loooks me the main culprit is op Conv2D (replaced by _MklConv2D and _MklConv2DWithBias using MKL?)
                                                         Conv2D          _MklConv2D      _MklConv2DWithBias
mobilenet_v1_1.0_224_quant         19.303 ms         24.379 ms           7.905 ms
mobilenet_v2_1.4_224                    24.969 ms                                   41.942 ms
ssd_mobilenet_v2_coco                108.692 ms         48.872 ms       143.936 ms

Liu__Chao
Beginner
347 Views

More information: 
I don't think it's related to how I built tensorflow. I run
bazel run --config=mkl --config=opt --config=monolithic //tensorflow/tools/benchmark:benchmark_model
and got similar results. The interesting thing is that

OMP_NUM_THREADS=1 bazel-bin/tensorflow/tools/benchmark/benchmark_model
is two times faster than
bazel-bin/tensorflow/tools/benchmark/benchmark_model

Again, all these tests were run on a i7-5557U CPU.

Ying_H_Intel
Employee
347 Views

Hi Yjl,

​Thank you for the details, just quick review and seems the build was mkl-dnn enabled mainly and 1 thread used. We will look into here.

And could you ​please sumbit your issue to  https://github.com/intel/mkl-dnn/issues? where our developer may check the problem directly with ready environment.

Best Regards,
​Ying

Liu__Chao
Beginner
347 Views

Filed https://github.com/intel/mkl-dnn/issues/234 ; ..

More discovery:

1.  _MklFusedBatchNorm is slower than FusedBatchNorm   102ms VS 88ms

2. _MklConv2DWithBias is slower than Conv2D + BiasAdd    42ms VS  25+3ms

3. MKL introduced several extra operations that are pretty expensive, like _MklInputConversion and _MklToTf

Reply