- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello,

I created some computations using SSE, AVX and FMA to test the computation time (in assembly language). The codes I made were in 32-bits as well as in 64-bits.

After a DLL with functions was created, I used Matlab to test it. The computation provides Matrix Vector multiplication in both versions - M*V and V*M. The test Matrix size was 4096x4096.

However, the results of the tests surprised me a little bit. I don't see almost any performance boost of AVX/FMA. Even the 64-programming seems to be almost the same.

Why is that?

* Codes with results are in attachment.

** Just ignore the notes, they're in my native language (result sheet note: word ZLAVA means V*M, SPRAVA means M*V).

Thanks for every explanation.

- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You appear to be performing repeated adds to the same register, so that your performance is limited by the latency of the adds (or fma). One might expect AVX code using separate adds and multiplies to be faster than the others, provided that you are using the cache efficiently. In that regard, the early AVX CPUs, up through Ivy Bridge, were notorious for not delivering a gain for AVX over SSE unless the data were local to L1, due to the 128-bit wide path from L2 to L1. I think a 40% speedup going from SSE to AVX-256 is typical of newer CPUs with compiler vectorized code.

You may want to compare your code with a professionally built library, such as MKL, or with code generated by Intel compilers. In these cases you will expect to see riffling, where the inner loop adds alternately to various registers. With cache tiling, this could make peak performance approach the throughput rating of the instructions, at the expense of a longer reduction tree after the loop.

Register usage latency limited code like this is popular when people like to show a higher threaded parallel speedup without as much effort to overcome memory bandwidth limitations.

I haven't seen a method to persuade gnu compilers to riffle dot product reductions, so in that case your asm code should not be inferior to the compiler generated code.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Thank you for you answer Tim, I appreciate that.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**Matlab**is a "monster" software and I wouldn't use it for benchmarking. I did a quick review of

**Results.pdf**and you need to provide more details on data set sizes. Also, performance evaluations with sizes less than

**4096**elements for a row are not too good. So, increase sizes of your data sets. It means, you need to do a

**Stress test**( not just a test ) with data sets that fit

**All**the RAM ( but without using Virtual memory ). >>...You may want to compare your code with a professionally built library, such as MKL, or with code generated by Intel compilers... On a

**KNL**system a

**Classic Matrix Multiplication**algorithm ( sources compiled with ICC v16 ) outperforms

**CBLAS sgemm**of

**Intel MKL**for matrix sizes less than 2048x2048. Take a look at

**Performance of Classic Matrix Multiplication algorithm on a KNL system**article: https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Sergey.

thank you for your reply! Well, I didn't know that Matlab is "bad" for the testing, but it's part of my thesis so I have no other choice. Fortunately, I can see a big difference in terms of computation time, and even without optimizing the code according to Tim's suggestion. I changed the Matrix to be 32768x32768 and the results are now almost twice as good for AVX/FMA than for SSE.

Thank you once again for very helpful reply!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Tim @Sergey

Guys, do you think it's a bad idea to repeat the computation in more cycles and then compute an average time? I mean, for example, to create a for loop of, let's say,100 repeats of a function call and then divide the overall computation time by 100. What do you think?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The strategy for timing an application (or function/loop therein) depends entirely on the application. If the application is only going to have 1 region and enter it only once, then maybe a handful of test runs would be appropriate. These tests to be run from a batch .bat or script .sh so as to include application load time as well as invalidation of initial process virtual memory.

However, if on the other hand the application has a main loop that is run many times, and this loop contains the outermost parallel region, then for code optimization tuning purposes, you'd place an iteration loop around such code, and then produce a list of runtimes per iteration. The number of iterations you make may depend on what else your system is doing during your test runs. When it is relatively idle, then maybe 3 iterations are sufficient, if system is periodically busy, then you may need to take more samples. While you can take the average of the runs, it may be better to look at:

a) first iteration (this includes thread pool instantiation, first touch of virtual memory, cache initialization)

b) fastest iteration (likely the time expected without interference by other processes/services running on the system)

c) average time excluding first iteration

You should also average the first iterations of several runs. These will be your baseline performance.

As you make changes to your code, you can then compare the new code against your baseline.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page