large difference in speed between C and C++ code when vectorizi

ploeger__lennert · ‎04-20-2012

My worker code is a loop with a number of computations including transcendental functions which is vectorized (using # pragma ivdep).

I use the Intel Composer XE 2011 SP1 within MS Visual Studio 2010 Ultimate.
It is essentially C code, since it uses nothing C++ specific.
If I change the file name extension from .cpp to .c it becomes twice as fast.

I checked the asm code and it turns out that the fast asm code uses
call ___svml_pow2 instead of call ___svml_powf4
and
call ___svml_exp2 instead of call ___svml_expf4
Could that explain the speed difference?
Why does it call different svml modules when the file name extension is changed from .cpp to .c?

TimP · ‎04-20-2012

You have changed your source code from float to double data types. Maybe you have some conditional compilation there to make the switch according to language. In the C code you would presumably have called powf() and expf() explicitly or used for C++, that could bring about this change.

JenniferJ · ‎04-20-2012

Please attach the code snippets of the code so we can try to duplicate the issue. or if you have a small testcase, it would be great.

Also what compiler options used?

thanks,
Jennifer

ploeger__lennert · ‎04-21-2012

I did not change the source code, just the file name extension.
I also did not change the includes.
I am not aware of any conditional compilation.
I do have powf and expf in my code.
I was just using #include
not tgmath.h.

ploeger__lennert · ‎04-23-2012

I have also checked the preprocessor output, i.e. the .i file.
The fast version has
(float)pow((double)(T_primary), (double)((2.0f*path_ratio)))
The slow version has
powf(T_primary,(2.0f*path_ratio))

ploeger__lennert · ‎04-23-2012

Changing expf to exp and powf to pow in the source code makes the .c version slightly slower, but seems to have no effect on the speed of the .cpp version.
.c version still outperforming the .cpp version, but by somewhat less than a factor 2.

ploeger__lennert · ‎04-24-2012

I guess we solved it.
It turns out that one of the input arrays (T_primary) was completely zero.
The C compiler made use of the fact that those zeros were raised to a certain power.
powf(T_primary,(2.0f*path_ratio))
In this way, it could skip most of the difficult math, we think.

The C++ compiler was not making use of this.

It is worth noting that the C compiler skipped that complicated math only if it were next to another expression.
I.e.
powf(T_primary,some algebra)
was slower (about 0.26 arbitrary time units) than
powf(T_primary,some algebra)*expf(some other algebra).
The latter took about 0.11 arbitrary time units.
For the C++ compiler there was no difference. Both took about 0.26 arbitrary time units.

Now that we reverted to mostly nonzero data for T_primary, things have changed completely.
C++ is faster than C, as it should be, since it makes more efficient calls to svml:

call ___svml_powf4 should be faster than call ___svml_pow2, right?
call ___svml_expf4 should be faster than call ___svml_exp2, right?

Our loops take 0.40 arbitary time units when compiled as C++ and 0.62 arbitrary time units when compiled as C source.

Georg_Z_Intel · ‎04-26-2012

Hello,

I've tried to reproduce this problem but I'm not seeing those effects. It's an interesting sighting that we'd be interested in and would like to understand.
Hence, same as Jennifer, I'd like to ask you whether it would be possible to condense the problem into a simple Visual Studio project so we can take a closer look.

In my opinion it's unlikely that you see such differences because of only changing the suffix. It's true that it makes the compiler switch to a different standard interpretation; but that does not explain what you're seeing.
Also I don't think that the values of the array can have impact, provided those are not known during compile time. If they were the compiler won't need to do calls anyways - it'd calculate the values during compile time already.

Thank you in advance & best regards,

Georg Zitzlsberger

large difference in speed between C and C++ code when vectorizing loops with transcendental functions