Solved: @Fiona Z.

Daniel_H_ · ‎01-24-2017

I am seeing very poor performance with dnnConvolutionCreateForwardBias_F32.

My naive C++ implementation of a forward convolution is 4X faster than when I try to do the same thing with MKL DNN. I was expecting it to be the other way around.

Visual Studio 2013
Use Intel MKL: Sequential
Release x64

I am testing on a Intel Core i7-4712HQ CPU.
I launch my testprogram with Ctrl+F5 from Visual Studio.

I get the following processing times:
naive took 3335 ms
mkl took 12012 ms

What have I missed?

Zhen_Z_Intel · ‎02-26-2017

Dear customer,

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Best regards,
Fiona

View solution in original post

Daniel_H_ · ‎01-24-2017

I have improved the example code slightly, but the performance problem remains.

Zhen_Z_Intel · ‎01-24-2017

Dear customer,

Thanks for posting issue to us. We will investigate on it.

Best regards,
Fiona

Daniel_H_ · ‎02-13-2017

Do you have a time estimate for when I can expect some feedback?

Zhen_Z_Intel · ‎02-15-2017

Dear customer,

We already investigated on this issue. I will update resolution once I get update from developing team. Sorry for let you waiting, hope you could understand. Thanks.

Best regards,
Fiona

Keren_Z_ · ‎02-15-2017

@Fiona Z.

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

Zhen_Z_Intel · ‎02-16-2017

Daniel H. wrote:

Do you have a time estimate for when I can expect some feedback?

Actually, your test matrix isn't good for performance testing. You are using X*1*N input which it will has N layer, but each layer is actually length of x vector. The filter for each layer and each group is actually N. The sizeof N is only 15. Well, every time calculate length of 15 vector... I am afraid there may do not need to use DNN, DNN intends to solve big size calculation for each time. Much less the DNN requires external costs due to MKL kernel re-org matrix. The performance of DNN would be better if you solve a large size 2D or 3D input size and the filter should also be not such small. It is recommended to use parallel calculation with multiple threads.

Best regards,
Fiona

Zhen_Z_Intel · ‎02-16-2017

Keren Z. wrote:

@Fiona Z.

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

The highest instruction set for E5-2670v2 is only avx with 10 cores, but on KNL is avx512 with 68 cores. Already you are using same function, but for different hardware, it is very usual KNL faster than Xeon E5-2670v2.

Daniel_H_ · ‎02-16-2017

Fiona Z. (Intel) wrote:

Actually, your test matrix isn't good for performance testing. You are using X*1*N input which it will has N layer, but each layer is actually length of x vector. The filter for each layer and each group is actually N. The sizeof N is only 15. Well, every time calculate length of 15 vector...

I don't think that is the problem. You can easily test it by setting N to a higher value.

Here are the results for some tests I did on my setup:

N = 15: Naive took 1029ms, MKL took 4078ms.

N = 25: Naive took 1742ms, MKL took 6565ms.

N = 35: Naive took 2521ms, MKL took 9195ms.

N=45: Naive took 3294ms, MKL took 11898ms.

N=499, Naive took 32899ms, MKL took 125688ms.

The naive implementation is consistently 4X faster.

/ Daniel

Keren_Z_ · ‎02-16-2017

Fiona Z. (Intel) wrote:

Quote:

Keren Z. wrote:

@Fiona Z.

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

The highest instruction set for E5-2670v2 is only avx with 10 cores, but on KNL is avx512 with 68 cores. Already you are using same function, but for different hardware, it is very usual KNL faster than Xeon E5-2670v2.

Thanks for your kind reply!

Here I am not comparing E5-2670v2 vs KNL. I mean the relative performance on E5-2670v2 is extremely low. That is, it only utilizes 1%-2% peak computing resources by testing `s_score_example` that you present. But on KNL, it could reach above 20% peak GFLOPS. And after adjusting the batch size, it could reach over 85% peak GFLOPS.

To figure out the difference, I use `perf` tool to inspect what happened internally. On E5-2670v2, it invokes the `parallel_RefDirectConv_Fwd` function which I guess is not an optimized version for AVX instruction set. Indeed, the same problem occurs on E5-2680v3.

Ying_H_Intel · ‎02-19-2017

Hi Keren, Daniel,

Thank you for the reports. We can reproduce the issue. Our developer expected to optimize all kind of size cases constantly. One latest version will targeted to be available in the two weeks. We will notify you when it is ready. Then let's test the performance again.

Best Regards,

Ying

Zhen_Z_Intel · ‎02-26-2017

Dear customer,

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Best regards,
Fiona

Keren_Z_ · ‎02-28-2017

Fiona Z.

Thanks for your quick response!

I have a question about using the DNN api. How to set up inputOffset for paddings? For instance, the second layer of Alexnet has padding size 2. So should I set inputOffset[2] = {-2, -2} and increase the inputSize to 31 from 27?

Keren_Z_ · ‎02-28-2017

An additional question regarding the performance. Yes, I have seen the performance boost--58% peak performance on my AVX machine. But do you have some official benchmark reports?

Vadim_P_Intel · ‎03-01-2017

Hi Keren,

Input size defines the size of physical spatial domain and is directly related to amount of memory allocated for the tensor. So if you have an image size of 27x27 the input size should be 27x27. Offset defines boundaries where the kernel is applied relative to actual spatial domain and affects an output size. In case of second layer in AlexNet the kernel size is 5x5 and the padding is used to get output of 27x27, same as input. To accomplish this in Intel MKL you'll need to set size to 27x27 and use offset of {-2,-2}.

Vadim_P_Intel · ‎03-01-2017

Official benchmark reports are available for Intel optimized Caffe. You can also use DeepBench benchmark to measure performance of convolutions on your system of interest.

Keren_Z_ · ‎03-01-2017

Vadim Pirogov (Intel) wrote:

Official benchmark reports are available for Intel optimized Caffe. You can also use DeepBench benchmark to measure performance of convolutions on your system of interest.

Thanks!

These links are really helpful!

Daniel_H_ · ‎03-07-2017

Fiona Z. (Intel) wrote:

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Great!

MKL is now 8X faster than the naive implementation on my computer.

/ Daniel

MKL DNN performance