I guess we solved it.
It turns out that one of the input arrays (T_primary) was completely zero.
The C compiler made use of the fact that those zeros were raised to a certain power.
powf(T_primary,(2.0f*path_ratio))
In this way, it could skip most of the difficult math, we think.
The C++ compiler was not making use of this.
It is worth noting that the C compiler skipped that complicated math only if it were next to another expression.
I.e.
powf(T_primary,some algebra)
was slower (about 0.26 arbitrary time units) than
powf(T_primary,some algebra)*expf(some other algebra).
The latter took about 0.11 arbitrary time units.
For the C++ compiler there was no difference. Both took about 0.26 arbitrary time units.
Now that we reverted to mostly nonzero data for T_primary, things have changed completely.
C++ is faster than C, as it should be, since it makes more efficient calls to svml:
call ___svml_powf4 should be faster than call ___svml_pow2, right?
call ___svml_expf4 should be faster than call ___svml_exp2, right?
Our loops take 0.40 arbitary time units when compiled as C++ and 0.62 arbitrary time units when compiled as C source.