Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Beginner
165 Views
Hi all,

I am new to MKL.
I want to know whether MKL can accelerate large matrix addition calculation.
It tooks me 20ms now. But in my application. I need it finished in 2ms!

short* a = new short[4000*3000];
short* b = new short[4000*3000];
short* c = new short[4000*3000];

clock_t cstart, cend;
double spend;
cstart = clock();

for(int i = 0 ; i < 1000;++i){
for(int x = 0 ; x < 4000*3000;++x){
short value = a * 3000 + b;
c = value;
}
}

cend = clock();
spend = ((double)(cend-cstart)) / (double)CLOCKS_PER_SEC*1000/1000;
printf("spend: %f\\n(ms)", spend);

return 0;

Thanks!

superZZ
5 Replies
Black Belt
165 Views
Your "need" bumps against some hard limits. Let us assume that your code is fully vectorized, and each element is loaded, the new value computed and stored in one CPU cycle (in reality, it will be slower). For your need to be met, you would need 12 X 10 9 CPU cycles to be completed in 2 ms, which would need a CPU to run at 6 THz.

Of course any decent compiler could note that your outer loop performs an invariant calculation and that it would be sufficient to go through it only once. In that case, you would only need 6 GHz, and splitting the work amongst a number of cores could make a single loop feasible.

Silly examples can lead to silly conclusions. Try to figure out why your computer is so fast and what it is doing that it can complete the calculation in 20 ms, assuming that it is not running at 600 GHz.
Beginner
165 Views
Dear mecej4,

Sorry, my expression is not clear.

I ran it for 1000 times. It took 20ms each time. 1000 times took 20*1000 = 2s.

Thanks

superZZ
Moderator
165 Views
even for the case in 2ms based on Your expectation, MKL VML cannot helps You.
For Your need to be met - the performance of this calculation should be < 0.2 CPE ( Clock Per Element ).
Please see here the link to the performance and accuracy data of VML Mul function. The best performance for the single precision is 0.57 CPE.
Valued Contributor II
165 Views

Hi,

I don't think that theMKL is a right way to go in that case.You should rely on your own optimizations.

So, several optimization techniques could be used. They are as follows:

- Unroll Loops +
- SSE+++
- OpenMP +++
- Unroll Loops + SSE ++++
- Unroll Loops + SSE + OpenMP+++++

Complexity marked by a sign '+':

- '+' simple
- '+++++'more complex

It is not clear for me what your platform is and a final decision depends on it.

For example, I don't rely on SSE or OpenMPfor any Embedded platforms.

Also, I wouldn't try to depend on a compiler's optimization. If an algorithm NOT optimized enough,
for example 1,000 code lines instead of 250,a compiler's optimizationwon't improve performance significantly.

Let me know if you're interested to get more details. I have a similar requirements for a
Linear Algebra algorithms on my project.

Best regards,
Sergey

Employee
165 Views

Just notice, the data type is short whereas mkl routines mainly address the floating point type, like real, complex etc. So mkl is not a right way.
As sergey suggested,
1) optimization manually
2) optimized by Intel compiler, for example, vectorization