- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As it says in the title. ICC 17 generates:
my_copysign_1(float): movss xmm1, DWORD PTR .L_2il0floatpacket.1[rip] #4.12 movss xmm2, DWORD PTR .L_2il0floatpacket.0[rip] #4.12 andps xmm0, xmm2 #4.12 andnps xmm2, xmm1 #4.12 orps xmm0, xmm2 #4.12 ret #4.12 .L_2il0floatpacket.0: .long 0x80000000 .L_2il0floatpacket.1: .long 0x3f800000
A much better result is generated by e.g. GCC:
my_copysign_1(float): andps xmm0, XMMWORD PTR .LC1[rip] orps xmm0, XMMWORD PTR .LC0[rip] ret .LC0: .long 1065353216 .long 0 .long 0 .long 0 .LC1: .long 2147483648 .long 0 .long 0 .long 0
Even if you don't like the extra space used (and you should like it, because my profiles show it's faster), the actual operations (`andps`, `andnps`, and `orps`) can still be reduced by one instruction.
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ian,
Can you provide us a test case to take a look?
Thanks,
Viet Hoang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Er, the testcase would be literally:
#include <cmath> float my_copysign_1(float x) { return std::copysignf(1.0f,x); }
See it live here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I saw the asm as you mentioned; however, the execution times are the same between the 2 compilers. You need to prove that ICC is slower than GCC, then I can submit a perf bug to our developer.
Thanks,
Viet Hoang
vahoang@orcsle139:/tmp$ rm a.out && g++ t.cpp -O2 -c && g++ main.cpp -O2 -c && g++ main.o t.o -O2 && time ./a.out
f is:1.67772e+07
real 0m2.260s
user 0m2.258s
sys 0m0.001s
vahoang@orcsle139:/tmp$ rm a.out && icpc t.cpp -O2 -c && icpc main.cpp -O2 -c && icpc main.o t.o -O2 && time ./a.out
f is:1.67772e+07
real 0m2.260s
user 0m2.256s
sys 0m0.003s
vahoang@orcsle139:/tmp$ cat t.cpp
#include <iostream>
#include <cmath>
float my_copysign_1(float x);
using namespace std;
float my_copysign_1(float x) {
return copysignf(1.0f,x);
}
vahoang@orcsle139:/tmp$ cat main.cpp
#include <iostream>
#include <cmath>
float my_copysign_1(float x);
using namespace std;
int main () {
int i = 0;
int MAX = 1000000000;
float f ;
for (i = 0; i < MAX ; i++)
f = my_copysign_1( 2.0f) + f ;
cout << "f is:" << f << endl;
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The difference here is dwarfed by higher-order effects. When I run your benchmark 64 times (use `perf` from `linux-tools-generic` and elevate priority to reduce overhead):
sudo chrt -f 99 perf stat -r 64 -d ./test-g++ sudo chrt -f 99 perf stat -r 64 -d ./test-icpc
I get:
g++: 3060.461358 ms ± 0.02%
icpc: 3062.379933 ms ± 0.02%
This is statistically significant to better than 99.99% certainty.
The versions involved are:
icpc (ICC) 18.0.0 20170811
g++ (GCC) 7.1.0
The difference isn't larger presumably due to super-scalar optimizations and maybe hardware stuff associated with calling the function or latency stalls. I really don't know. But the above test proves that for whatever reason Intel compiler's code is slower.
It should also, again, be obvious that this is the case from first principles—the Intel compiler code is twice as many opcodes, and is a strict superset of g++'s code. Also, independent of this, the code can be optimized by removing at least the `andnps` instruction, even keeping an identical memory literal access pattern.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page