I was modifying ulmBLAS (BLIS) dgemm to work with min-add instead of add-mul operations (tropical matrix multiplication). The performance was about a half of MKL on Sandy Bridge, which led me to discovering that vminpd and vaddpd cannot be executed simultaneously, as opposed to vaddpd and vmulpd operations. So I switched to min-mul matrix product (subtropical) so that I can work with parallel vmulpd and vminpd.
The code is available at: https://github.com/siudej/Ricci/blob/master/cpp/kernel_BLIS_avx.c ;
Compared to normal dgemm each (v)addpd is replaced with (v)minpd. Compiler: g++-5 from Homebrew on Mac. Inline assembly in c, compiled to Python module using spicy.weave.
This worked on Sandy Bridge, and dgemm was now only a fraction slower than MKL. I was able to get the same speedup in min-mul (compared to min-add) when using SSE operations, even on Haswell (SSE kernel is also available in the same GitHub project).
However, my 2014 Macbook Pro with Haswell seems to schedule vminpd and vmulpd in series, resulting in the same performance as vminpd and vaddpd. Is this an expected behavior? Is there a way to optimize the inline assembly to get the parallel execution to work on Haswell?
Thank you for your help,
Are you sure that problem is related to those two machine instructions? Have you profiled your code with Intel VTune in order to find major hotspots, moreover detailed analysis of Front-End and Back-End stalls should have been done in order to get clear as much as it possible for code slowdown.
Modern CPU as Haswell is operating in out of order execution so you do not have any control on the instruction ordering inside the CPU front-end. Compiler usually will reorder instruction in order to partially exploit Instruction Level Parallelism , but it is entirely up to CPU scheduler to choose appropriate order.
I think that you should check if there is an interdependency between those two instructions when output from one instruction is an input to the other instruction, then in such scenario (inside CPU) decoded micro-op(s) will have to wait for their results.
I wish I knew how to do all that checking you suggest. This might be a good opportunity to try to learn something new.
I know the instructions get reordered to optimize the code, and that it is not my choice to do that. Nevertheless, interdependency would not change if I merely replace every mul with add, or every add with min, which is what I was doing. Register numbers do not change at all in the instructions, just their names. And min should be faster than mul.
As I understand you are comparing following pairs of instructions: vaddpd--vminpd with vmulpd--vminpd. I have found it very interesting that vmulpd-vminpd is faster than vaddpd-vminpd. Initially it seems counterintuitive because vmulpd executes during 5 cycles when vaddpd consumes 3 cycles only. I suppose that on Haswell there are two multiplier units devoted to vmulpd computation so the throughput is only 0.5 cycle. I think that Port1 can indeed be wired to the same unit which is performing addition and computing minimum (maybe by subtracting operands a and b) it is hard to tell without knowing internal layout.
Regarding Intel VTune profiler you should definitely give it try. You can download trial version and run Basic Hotspot Analysis initially playing attention to major code hotspot(s).
You may find interesting following the link below.
Yes, that is what I am doing. Until a week ago, I had no experience with assembly at all, so intricate relations between instructions are definitely surprising to me. I needed my simulations to run faster, so I decided to dive into inline assembly. Exciting new world.
The (v)minpd-(v)mulpd is indeed faster than vminpd-vaddpd, on AVX CPUs (pre-Haswell) and using SSE3 instructions (any CPU). It is in fact as fast as (v)addpd-(v)mulpd, so it can be as fast as MKL matrix multiply.
Weird things happen on Haswell, where the compared options suddenly run the same slower speed.
I will give VTune a try. And I will try to come up with a smaller code sample showing the differences.
Well, this is embarrassing. After isolating the assembly in a completely separate project I discovered that mul-add and mul-min combinations are equally fast. And nearly twice as fast as add-min combination. As they should be with just one FADD (on port 0) in Haswell (and all earlier CPUs) which also handles comparisons.
So a few hours later I discovered a small bug in my Python code that managed to not propagate on other machines, but it did on my Haswell computer.
Just to summarize: