Solved: vzMul performance

ImpPhil · ‎12-10-2022

I've just discovered MKL and have created a few benchmarks in C#/.NET7 comparing vector operations in C# using a simple loop with the equivalent operation using MKL.

Using 8700k, .NET 7 Visual Studio 17.4.2

Nuget package "intelmkl.redist.win-x64" Version="2022.2.1.19754"

400 element vector

Initially things looked good:

multiply double: vsMul 2.5* faster

Cos: 3* faster

Asin: 6* faster

but then:

multiple complex: 1/2 speed

I then found this post: // https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Sequential-VML-performance/m-p/780873#M1461

and assumed that was the answer. However testing with performance mode only achieved speed parity with a C# loop for complex multiplication.

Should I expect vzMul to run faster than the equivalent for loop in C#? If not why not?

ShanmukhS_Intel · ‎02-07-2023

Hi Philip,

A gentle reminder:

Has the information provided helped? Could you please let us know if we could close this case on our end?

Best Regards,

Shanmukh.SS

View solution in original post

ShanmukhS_Intel · ‎12-12-2022

Hi Phillip,

Thanks for posting on Intel communities.

Based on our existing solution, we need to install extra components which takes a bit of time to check and try reproducing at our end. Post-installation, we will update you on the findings.

Best Regards,

Shanmukh.SS

ImpPhil · ‎12-12-2022

Ok, no problem. You need VS 17.4.2, with probably a basic workload.

I've put my benchmark code here; PhilPJL/MklBenchmark (github.com)

In general I notice that performance is better for larger vectors.

Trig functions are faster than a C# loop and much faster for large vectors >10k.

Multiply double is faster for medium vectors say 400 and possibly slower for small ~40

Multiply complex is slower unless using performance mode or large vectors.

ShanmukhS_Intel · ‎12-20-2022

Hi Philip,

We have set up the environment. However, we were unable to build the source code successfully, which might be due to a mismatch in environment support i.e. .NET 6.0 VS17.2.2 vs target sdk .NET 7.0 VS17.4.2. We will get back to you soon with an update.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎12-20-2022

Hi Philip,

We have set up the environment. However, we were unable to build the source code successfully, which might be due to a mismatch in environment support i.e. .NET 6.0 VS17.2.2 vs target sdk .NET 7.0 VS17.4.2. We will get back to you soon with an update.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎12-27-2022

Hi Philip Lee,

As the latest MKL NuGet package is available, Could you please update to the latest NuGet package file(2023.0.0.25930) and let us know if the issue persists?

Best Regards,

Shanmukh.SS

ImpPhil · ‎12-27-2022

Hi, yes it does persist with a vector size of 400. I've updated the repo with the changes.

Also high vs low accuracy makes little difference.

I think your version numbers don't conform to semantic versioning standards and are confusing Visual Studio

ShanmukhS_Intel · ‎01-03-2023

Hi Philip,

Thanks a lot for the update.

Could you please let us know if we could merge the below 2 requests into this Case if it's feasible to you? (as the other 2 are too with a similar code base) So that you could follow up on the issue better?

Best Regards,

Shanmukh.SS

ImpPhil · ‎01-03-2023

Ok go ahead (although they are separate issues in my view).

ShanmukhS_Intel · ‎01-10-2023

Hi Philip,

We are discussing your issue internally. We will get back to you with an update soon!

Regrets for the delay in response.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎01-17-2023

Hi Philip,

Should I expect vzMul to run faster than the equivalent for loop in C#?

>>The VM implementation is usually more accurate than compiler implementations of this operation (compiler implementations are often 'naive', and may cause spurious exceptions). The additional accuracy in the VM implementation comes with the price of somewhat lower performance, at least for certain vector lengths.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎01-23-2023

Hi Philip,

A gentle reminder:

We have heard back from you for a while. Could you please let us know if you need any other information or let us know if we could close this case at our end?

Best Regards,

Shanmukh.SS

ImpPhil · ‎01-24-2023

This is the implementation from disassembling C#:

public static Complex operator *(Complex left, Complex right)
{
// Multiplication: (a + bi)(c + di) = (ac -bd) + (bc + ad)i
double result_realpart = (left.m_real * right.m_real) - (left.m_imaginary * right.m_imaginary);
double result_imaginarypart = (left.m_imaginary * right.m_real) + (left.m_real * right.m_imaginary);
return new Complex(result_realpart, result_imaginarypart);
}

Are you saying that your C code does something in addition to that?

ShanmukhS_Intel · ‎01-31-2023

Hi Philip,

HA/LA zmul is much more accurate than compiler implementation, especially in cases close to cancellation. Moreover, zmul properly reports all overflow, underflow, and error domain cases. This is not the issue, it is the MKL VM feature.
Use VM EP accuracy flavor or compiler's complex multiplication when full accuracy is not required for maximum performance.

There would be a significant performance difference between C# for loop and oneMKL. However, When the problem size increased to ~10^5, vzmul started to be the faster one. We should expect MKL vzmul to be faster, especially vzmul uses all the cores by default.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎02-07-2023

Hi Philip,

A gentle reminder:

Has the information provided helped? Could you please let us know if we could close this case on our end?

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎02-14-2023

Hi Philip,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Best Regards,

Shanmukh.SS