vminpd and vmulpd do run concurrently on Haswell and earlier CPUs

Bartek_S_ · ‎04-07-2016

Update:

The problem was in a totally different place, but would only manifest itself on one of my computers.
Nevertheless, it is interesting that matrix-matrix multiplication which uses minimum as reduction operation and plus as pointwise operation cannot be implemented as efficiently as normal plus-reduction times-pointwise product. On the other hand, min-reduction and mul-pointwise can be as fast as the normal matrix product (on CPUs without FMAs). This observation could be used to speed up some graph algorithms, e.g. Floyd-Warshall.
The second point applies to both SSE and AVX operations.

I was modifying ulmBLAS (BLIS) dgemm to work with min-add instead of add-mul operations (tropical matrix multiplication). The performance was about a half of MKL on Sandy Bridge, which led me to discovering that vminpd and vaddpd cannot be executed simultaneously, as opposed to vaddpd and vmulpd operations. So I switched to min-mul matrix product (subtropical) so that I can work with parallel vmulpd and vminpd.

The code is available at: https://github.com/siudej/Ricci/blob/master/cpp/kernel_BLIS_avx.c ;

Compared to normal dgemm each (v)addpd is replaced with (v)minpd. Compiler: g++-5 from Homebrew on Mac. Inline assembly in c, compiled to Python module using spicy.weave.

This worked on Sandy Bridge, and dgemm was now only a fraction slower than MKL. I was able to get the same speedup in min-mul (compared to min-add) when using SSE operations, even on Haswell (SSE kernel is also available in the same GitHub project).

However, my 2014 Macbook Pro with Haswell seems to schedule vminpd and vmulpd in series, resulting in the same performance as vminpd and vaddpd. Is this an expected behavior? Is there a way to optimize the inline assembly to get the parallel execution to work on Haswell?

Thank you for your help,

Bartek

Bernard · ‎04-08-2016

Are you sure that problem is related to those two machine instructions? Have you profiled your code with Intel VTune in order to find major hotspots, moreover detailed analysis of Front-End and Back-End stalls should have been done in order to get clear as much as it possible for code slowdown.

Modern CPU as Haswell is operating in out of order execution so you do not have any control on the instruction ordering inside the CPU front-end. Compiler usually will reorder instruction in order to partially exploit Instruction Level Parallelism , but it is entirely up to CPU scheduler to choose appropriate order.

I think that you should check if there is an interdependency between those two instructions when output from one instruction is an input to the other instruction, then in such scenario (inside CPU) decoded micro-op(s) will have to wait for their results.

Bartek_S_ · ‎04-08-2016

Dear Iliya,

I wish I knew how to do all that checking you suggest. This might be a good opportunity to try to learn something new.

I know the instructions get reordered to optimize the code, and that it is not my choice to do that. Nevertheless, interdependency would not change if I merely replace every mul with add, or every add with min, which is what I was doing. Register numbers do not change at all in the instructions, just their names. And min should be faster than mul.

On AVX CPU the code with vaddpd and vminpd is half as fast as the same code with add replaced with mul. Furthermore, the same code with add-mul operations is as fast as the one with just min-mul operations. Most likely because min and add run on the same p1 port.

On AVX2 CPU there is FMA which I cannot use with minimum operation. However I would expect the code with vmulpd and vminpd to be faster then the one with vaddpd and vminpd. Instead, the speed is the same, and slightly slower than much older (generally slower) AVX machine. The speed is also the same as the SSE encoded kernel with mul and min, which is twice as fast as add and min SSE kernel. Again confirming that add-min is generally a slow combination.

On the same AVX2 machine the regular add-mul kernel runs 4 times faster than min-mul kernel, even without using FMAs. That is really surprising, giving that on AVX CPU these combinations were equally fast.

I wish I could explain this better.

Bartek

Bernard · ‎04-08-2016

As I understand you are comparing following pairs of instructions: vaddpd--vminpd with vmulpd--vminpd. I have found it very interesting that vmulpd-vminpd is faster than vaddpd-vminpd. Initially it seems counterintuitive because vmulpd executes during 5 cycles when vaddpd consumes 3 cycles only. I suppose that on Haswell there are two multiplier units devoted to vmulpd computation so the throughput is only 0.5 cycle. I think that Port1 can indeed be wired to the same unit which is performing addition and computing minimum (maybe by subtracting operands a and b) it is hard to tell without knowing internal layout.

Regarding Intel VTune profiler you should definitely give it try. You can download trial version and run Basic Hotspot Analysis initially playing attention to major code hotspot(s).

You may find interesting following the link below.

https://software.intel.com/en-us/articles/processor-specific-performance-analysis-papers

Bartek_S_ · ‎04-08-2016

Yes, that is what I am doing. Until a week ago, I had no experience with assembly at all, so intricate relations between instructions are definitely surprising to me. I needed my simulations to run faster, so I decided to dive into inline assembly. Exciting new world.

The (v)minpd-(v)mulpd is indeed faster than vminpd-vaddpd, on AVX CPUs (pre-Haswell) and using SSE3 instructions (any CPU). It is in fact as fast as (v)addpd-(v)mulpd, so it can be as fast as MKL matrix multiply.

Weird things happen on Haswell, where the compared options suddenly run the same slower speed.

I will give VTune a try. And I will try to come up with a smaller code sample showing the differences.

Bartek_S_ · ‎04-08-2016

Well, this is embarrassing. After isolating the assembly in a completely separate project I discovered that mul-add and mul-min combinations are equally fast. And nearly twice as fast as add-min combination. As they should be with just one FADD (on port 0) in Haswell (and all earlier CPUs) which also handles comparisons.

So a few hours later I discovered a small bug in my Python code that managed to not propagate on other machines, but it did on my Haswell computer.

Just to summarize:

Tropical matrix product (min-add) is slow compared to subtropical (min-mul). This means that e.g. shortest paths in a graph can be found faster if weights are first exponentiated and later multiplied instead of added in the Floyd-Warshall algorithm (cf. https://en.wikipedia.org/wiki/Min-plus_matrix_multiplication). ;
There is nothing wrong with Haswell. I have edited the title to not cause any more confusion.