SSE/AVX/FMA Unexpected Test Results

Samuel_Š_ · ‎05-10-2017

Hello,

I created some computations using SSE, AVX and FMA to test the computation time (in assembly language). The codes I made were in 32-bits as well as in 64-bits.

After a DLL with functions was created, I used Matlab to test it. The computation provides Matrix Vector multiplication in both versions - M*V and V*M. The test Matrix size was 4096x4096.

However, the results of the tests surprised me a little bit. I don't see almost any performance boost of AVX/FMA. Even the 64-programming seems to be almost the same.

Why is that?

* Codes with results are in attachment.
** Just ignore the notes, they're in my native language (result sheet note: word ZLAVA means V*M, SPRAVA means M*V).

Thanks for every explanation.

TimP · ‎05-10-2017

You appear to be performing repeated adds to the same register, so that your performance is limited by the latency of the adds (or fma). One might expect AVX code using separate adds and multiplies to be faster than the others, provided that you are using the cache efficiently. In that regard, the early AVX CPUs, up through Ivy Bridge, were notorious for not delivering a gain for AVX over SSE unless the data were local to L1, due to the 128-bit wide path from L2 to L1. I think a 40% speedup going from SSE to AVX-256 is typical of newer CPUs with compiler vectorized code.

You may want to compare your code with a professionally built library, such as MKL, or with code generated by Intel compilers. In these cases you will expect to see riffling, where the inner loop adds alternately to various registers. With cache tiling, this could make peak performance approach the throughput rating of the instructions, at the expense of a longer reduction tree after the loop.

Register usage latency limited code like this is popular when people like to show a higher threaded parallel speedup without as much effort to overcome memory bandwidth limitations.

I haven't seen a method to persuade gnu compilers to riffle dot product reductions, so in that case your asm code should not be inferior to the compiler generated code.

Samuel_Š_ · ‎05-10-2017

Thank you for you answer Tim, I appreciate that.

SergeyKostrov · ‎05-12-2017

>>...After a DLL with functions was created, I used Matlab to test it. Matlab is a "monster" software and I wouldn't use it for benchmarking. I did a quick review of Results.pdf and you need to provide more details on data set sizes. Also, performance evaluations with sizes less than 4096 elements for a row are not too good. So, increase sizes of your data sets. It means, you need to do a Stress test ( not just a test ) with data sets that fit All the RAM ( but without using Virtual memory ). >>...You may want to compare your code with a professionally built library, such as MKL, or with code generated by Intel compilers... On a KNL system a Classic Matrix Multiplication algorithm ( sources compiled with ICC v16 ) outperforms CBLAS sgemm of Intel MKL for matrix sizes less than 2048x2048. Take a look at Performance of Classic Matrix Multiplication algorithm on a KNL system article: https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system

TimP · ‎05-12-2017

Sergei makes an interesting point in that the massive unrolling and block transposition tactics in mkl for knl require a fairly large matrix to become competitive. At the time I was working at intel on knc the Mkl team responded on the weak performance for the case of several moderate sized problems running in parallel. For host "big" cpus mkl has always been advertised to be efficient for any case large enough to take advantage of multiple threads.

SergeyKostrov · ‎05-12-2017

>>For host "big" cpus mkl has always been advertised to be efficient for any case large enough to take advantage of multiple threads. MKL functions for Matrix Multiplication ( MM ) have some initial overheads and that is why for sizes less than 2048x2048 a trivial Classic MM outperforms MKL's cblas_sgemm. This is because Classic MM starts processing with almost no delays ( no any CPU and ISA verifications, memory transformations, etc, except for OpenMP threads initialization ).

Samuel_Š_ · ‎05-13-2017

Hi Sergey.

thank you for your reply! Well, I didn't know that Matlab is "bad" for the testing, but it's part of my thesis so I have no other choice. Fortunately, I can see a big difference in terms of computation time, and even without optimizing the code according to Tim's suggestion. I changed the Matrix to be 32768x32768 and the results are now almost twice as good for AVX/FMA than for SSE.

Thank you once again for very helpful reply!

Samuel_Š_ · ‎05-14-2017

@Tim @Sergey

Guys, do you think it's a bad idea to repeat the computation in more cycles and then compute an average time? I mean, for example, to create a for loop of, let's say,100 repeats of a function call and then divide the overall computation time by 100. What do you think?

TimP · ‎05-14-2017

Matlab may invoke mkl in which case it should perform nearly as well as when called from compiled code.

SergeyKostrov · ‎05-15-2017

>>...do you think it's a bad idea to repeat the computation in more cycles and then compute an average time? I mean, for example, >>to create a for loop of, let's say,100 repeats of a function call and then divide the overall computation time by 100. What do you think? This is a right way to obtain more consistent performance results and take into account that 1st iteration could be considered as a warm up. >>...100 repeats of a function call... In general, a number of repeats should be dependent on a data set size. That is, for small data sets it could be Ns, and for large data sets it could be Nl, where Ns > Nl. I also try to boost a priority of processing to Above Normal or High from Normal ( a boost to Real Time priority is Not recommended ).

jimdempseyatthecove · ‎05-17-2017

The strategy for timing an application (or function/loop therein) depends entirely on the application. If the application is only going to have 1 region and enter it only once, then maybe a handful of test runs would be appropriate. These tests to be run from a batch .bat or script .sh so as to include application load time as well as invalidation of initial process virtual memory.

However, if on the other hand the application has a main loop that is run many times, and this loop contains the outermost parallel region, then for code optimization tuning purposes, you'd place an iteration loop around such code, and then produce a list of runtimes per iteration. The number of iterations you make may depend on what else your system is doing during your test runs. When it is relatively idle, then maybe 3 iterations are sufficient, if system is periodically busy, then you may need to take more samples. While you can take the average of the runs, it may be better to look at:

a) first iteration (this includes thread pool instantiation, first touch of virtual memory, cache initialization)
b) fastest iteration (likely the time expected without interference by other processes/services running on the system)
c) average time excluding first iteration

You should also average the first iterations of several runs. These will be your baseline performance.

As you make changes to your code, you can then compare the new code against your baseline.

Jim Dempsey

SergeyKostrov · ‎05-17-2017

>>...When it is relatively idle, then maybe 3 iterations are sufficient... For small data sets that fit into L3 cache I usually use from 16 to 32 iterations to get consistent results and to reduce non-deterministic effects of OS task scheduler.