- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am new to MKL.

I want to know whether MKL can accelerate large matrix addition calculation.

It tooks me 20ms now. But in my application. I need it finished in 2ms!

short* a = new short[4000*3000];

short* b = new short[4000*3000];

short* c = new short[4000*3000];

clock_t cstart, cend;

double spend;

cstart = clock();

for(int i = 0 ; i < 1000;++i){

for(int x = 0 ; x < 4000*3000;++x){

short value = a

c

}

}

cend = clock();

spend = ((double)(cend-cstart)) / (double)CLOCKS_PER_SEC*1000/1000;

printf("spend: %f\\n(ms)", spend);

return 0;

Thanks!

superZZ

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

^{9}CPU cycles to be completed in 2 ms, which would need a CPU to run at 6 THz.

Of course any decent compiler could note that your outer loop performs an invariant calculation and that it would be sufficient to go through it only once. In that case, you would only need 6 GHz, and splitting the work amongst a number of cores could make a single loop feasible.

Silly examples can lead to silly conclusions. Try to figure out why your computer is so fast and what it is doing that it can complete the calculation in 20 ms, assuming that it is not running at 600 GHz.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sorry, my expression is not clear.

I ran it for 1000 times. It took 20ms each time. 1000 times took 20*1000 = 2s.

Thanks

superZZ

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I don't think that theMKL is a right way to go in that case.You should rely on your own optimizations.

So, several optimization techniques could be used. They are as follows:

- Unroll Loops +

- SSE+++

- OpenMP +++

- Unroll Loops + SSE ++++

- Unroll Loops + SSE + OpenMP+++++

Complexity marked by a sign '+':

- '+' simple

- '+++++'more complex

It is not clear for me what your platform is and a final decision depends on it.

For example, I don't rely on SSE or OpenMPfor any Embedded platforms.

Also, I wouldn't try to depend on a compiler's optimization. If an algorithm NOT optimized enough,

for example 1,000 code lines instead of 250,a compiler's optimizationwon't improve performance significantly.

Let me know if you're interested to get more details. I have a similar requirements for a

Linear Algebra algorithms on my project.

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Just notice, the data type is short whereas mkl routines mainly address the floating point type, like real, complex etc. So mkl is not a right way.

As sergey suggested,

1) optimization manually

2) optimized by Intel compiler, for example, vectorization

3) ortry ipp like ippiAdd_16u/16s_

Best Regards.

Ying

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page