Here is a simple question. I'm new to IPP and I'm trying to understand how to use it for solving the following problem:
A += B*C + D*E + F*G + ...
A, B, C, D, E, F, G, ... are all matrices of the same size, * represents standard matrix multiply. The sizes of the matrices are small, typically between 3x3 and 35x35.
IPP provides a routine - I'm looking at the ippmMul_mama_64f function - that would operate on two source arrays of matrices, in our case: [B, D, F] and [C, E, G], producing, as far as I understood, three output matrices A1, A2, and A3, storing the results of B*C, D*E, and F*G, respectively. Now I have two correlated questions:
- In my problem, there's a single output matrix A. Is there a function in IPP, or a safe way of using ippmMul_mama_64f, such that the results are *accumulated* in a single output matrix A, rather than in three different matrices A1, A2, and A3?
- If this is not possible, how do I best combine the three temporaries A1, A2, and A3?
Ah, incidentally: is there any document I can look at that compares the performance of IPP MX to hand-crafted implementatios? I've done a bit of research and I couldn't find much.
Unfortunately, there is no special functions for multiplication with accumulation.
To accumulate result you can use ippmAdd_mama_32f adding pairs of temporal matrices.
Your task is to calculate A += B*C + D*E + F*G +H*I+
step0. ippmMul_mama_32f and get N of T0_0,T1_0,T2_0 where T0=A*B, T1=D*E,T2=F*G
step1.ippmAdd_mama_32f and get N/2 of T0_1,T1_1,T2_1 where T0_1=T0+T1, T1_1=T2+T3
step2. ippmAdd_mama_32f and get N/4 of T0_2,T1_2, where T0_1=T0_1+T1_1, T1_1=T2_1+T3_1 and so on while N!=1
Both ippmMul_mama_32f and ippmAdd_mama_32f has SSSE3 optimization and works fast enough for sizes up to 6x6. If you are going to develope hand code you can remove overhead of calling and writing temporal matrices to memory.
Our internal tests demonstrate 1.18X speed up for SNB cpu for ippmAdd_mama_32f and 2.44X for ippmMul_mama_32f for sizes 5x5. But you can get performance result if select additional performance test application during process of installation of IPP at your system.
Thank you for your interest to IPP.
Thanks for this, really appreciated.
I agree with you that I'll have to check whether the tree-like ipp-based sum is more efficient than hand-written code. In any case, the cost of the sum *should* be less than that of the actual matrix multiplies, although I'm not entirely sure given the small size of the involved operands.
A few more straight questions:
- Do you have exhaustive plots/tables comparing the speed up (or slow down) of IPP over hand-written code for a range of matrix sizes? I think these would be useful for the IPP community
- Can you further elaborate your sentence "But you can get performance result if select additional performance test application during process of installation of IPP at your system."?
- What should I expect (speaking about performance) for non-squared matrices? say 14x10 or 8x6.
You may say you could answer these questions by trying IPP yourself: problem is that I'm thinking about writing a code generator that outputs IPP calls, which is not so trivial in my case, and I'd like to understand if IPP is really the answer to my requirements, although I believe I'll never know until I try it :)
if you are thinking about code generator based on ippMX calls - I think it's not good idea as ippMX domain is in sustaining mode for a long time and is not developed or optimized for new IA. I think that for code generator it's better for you to take a look at Spiral tool: http://www.spiralgen.com/whatwedo/spiraltool/