Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

MKL DNN performance

Daniel_H_
Beginner
4,468 Views

I am seeing very poor performance with dnnConvolutionCreateForwardBias_F32.

My naive C++ implementation of a forward convolution is 4X faster than when I try to do the same thing with MKL DNN. I was expecting it to be the other way around.

Visual Studio 2013
Use Intel MKL: Sequential
Release x64

I am testing on a Intel Core i7-4712HQ CPU.
I launch my testprogram with Ctrl+F5 from Visual Studio.

I get the following processing times:
naive took 3335 ms
mkl took 12012 ms

What have I missed?

 

0 Kudos
1 Solution
Zhen_Z_Intel
Employee
4,468 Views

Dear customer,

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Best regards,
Fiona

View solution in original post

0 Kudos
17 Replies
Daniel_H_
Beginner
4,468 Views

I have improved the example code slightly, but the performance problem remains.
 

0 Kudos
Zhen_Z_Intel
Employee
4,468 Views

Dear customer,

Thanks for posting issue to us. We will investigate on it.

Best regards,
Fiona

0 Kudos
Daniel_H_
Beginner
4,468 Views

Do you have a time estimate for when I can expect some feedback?

 

0 Kudos
Zhen_Z_Intel
Employee
4,468 Views

Dear customer,

We already investigated on this issue. I will update resolution once I get update from developing team. Sorry for let you waiting, hope you could understand. Thanks.

Best regards,
Fiona

 

0 Kudos
Keren_Z_
Beginner
4,468 Views

@Fiona Z. 

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

0 Kudos
Zhen_Z_Intel
Employee
4,468 Views

Daniel H. wrote:

Do you have a time estimate for when I can expect some feedback?

 

Actually, your test matrix isn't good for performance testing. You are using X*1*N input which it will has N layer, but each layer is actually length of x vector. The filter for each layer and each group is actually N. The sizeof N is only 15. Well, every time calculate length of 15 vector... I am afraid there may do not need to use DNN, DNN intends to solve big size calculation for each time. Much less the DNN requires external costs due to MKL kernel re-org matrix. The performance of DNN would be better if you solve a large size 2D or 3D input size and the filter should also be not such small. It is recommended to use parallel calculation with multiple threads.

Best regards,
Fiona

0 Kudos
Zhen_Z_Intel
Employee
4,468 Views

Keren Z. wrote:

@Fiona Z. 

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

The highest instruction set for E5-2670v2 is only avx with 10 cores, but on KNL is avx512 with 68 cores. Already you are using same function, but for different hardware, it is very usual KNL faster than Xeon E5-2670v2.

0 Kudos
Daniel_H_
Beginner
4,468 Views

Fiona Z. (Intel) wrote:

Actually, your test matrix isn't good for performance testing. You are using X*1*N input which it will has N layer, but each layer is actually length of x vector. The filter for each layer and each group is actually N. The sizeof N is only 15. Well, every time calculate length of 15 vector...

I don't think that is the problem. You can easily test it by setting N to a higher value.

Here are the results for some tests I did on my setup:

N = 15: Naive took 1029ms, MKL took 4078ms.

N = 25: Naive took 1742ms, MKL took 6565ms.

N = 35: Naive took 2521ms, MKL took 9195ms.

N=45: Naive took 3294ms, MKL took 11898ms.

N=499, Naive took 32899ms, MKL took 125688ms.

The naive implementation is consistently 4X faster.

/ Daniel

0 Kudos
Keren_Z_
Beginner
4,468 Views

Fiona Z. (Intel) wrote:

Quote:

Keren Z. wrote:

 

@Fiona Z. 

We came across a similar problem.

We have two machines: E5-2670v2 and KNL-7250.

MKL2017 shows high performance on KNL-7250, while it renders very poor performance on E5-2670v2.

After some inspection, we figure out the doit_fwd_par_avx512_mic function is called on MIC. But on CPU, parallel_RefDirectConv_Fwd is invoked for which I guess no optimization has been done.

I hope the above information could help you.

 

 

The highest instruction set for E5-2670v2 is only avx with 10 cores, but on KNL is avx512 with 68 cores. Already you are using same function, but for different hardware, it is very usual KNL faster than Xeon E5-2670v2.

Thanks for your kind reply!

Here I am not comparing E5-2670v2 vs KNL. I mean the relative performance on E5-2670v2 is extremely low. That is, it only utilizes 1%-2% peak computing resources by testing `s_score_example` that you present. But on KNL, it could reach above 20% peak GFLOPS. And after adjusting the batch size, it could reach over 85% peak GFLOPS. 

To figure out the difference, I use `perf` tool to inspect what happened internally. On E5-2670v2, it invokes the `parallel_RefDirectConv_Fwd` function which I guess is not an optimized version for AVX instruction set. Indeed, the same problem occurs on E5-2680v3.

0 Kudos
Ying_H_Intel
Moderator
4,468 Views

Hi Keren, Daniel,

Thank you for the reports.  We can reproduce the issue.  Our developer expected to optimize all kind of size cases constantly. One latest version will targeted to be available in the two weeks. We will notify you when it is ready. Then let's test the performance again.

Best Regards,

Ying

0 Kudos
Zhen_Z_Intel
Employee
4,469 Views

Dear customer,

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Best regards,
Fiona

0 Kudos
Keren_Z_
Beginner
4,468 Views

Fiona Z.

Thanks for your quick response!

I have a question about using the DNN api. How to set up inputOffset for paddings? For instance, the second layer of Alexnet has padding size 2. So should I set inputOffset[2] = {-2, -2} and increase the inputSize to 31 from 27?

0 Kudos
Keren_Z_
Beginner
4,468 Views

An additional question regarding the performance. Yes, I have seen the performance boost--58% peak performance on my AVX machine. But do you have some official benchmark reports?

0 Kudos
Vadim_P_Intel
Employee
4,468 Views

Hi Keren,

Input size defines the size of physical spatial domain and is directly related to amount of memory allocated for the tensor. So if you have an image size of 27x27 the input size should be 27x27. Offset defines boundaries where the kernel is applied relative to actual spatial domain and affects an output size. In case of second layer in AlexNet the kernel size is 5x5 and the padding is used to get output of 27x27, same as input. To accomplish this in Intel MKL you'll need to set size to 27x27 and use offset of {-2,-2}.

0 Kudos
Vadim_P_Intel
Employee
4,468 Views

Official benchmark reports are available for Intel optimized Caffe. You can also use DeepBench benchmark to measure performance of convolutions on your system of interest.

0 Kudos
Keren_Z_
Beginner
4,468 Views

Vadim Pirogov (Intel) wrote:

Official benchmark reports are available for Intel optimized Caffe. You can also use DeepBench benchmark to measure performance of convolutions on your system of interest.

Thanks!

These links are really helpful!

0 Kudos
Daniel_H_
Beginner
4,468 Views

Fiona Z. (Intel) wrote:

This issue has been fixed in MKL2017u2, I've verified that the performance has been improved. Please upgrade to the latest version and have a check with it. Thank you for your posting.

Great!

MKL is now 8X faster than the naive implementation on my computer.

/ Daniel

 

0 Kudos
Reply