Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

BTC family intrinsics - code generation needs to be improved

gpseek
Beginner
911 Views
[cpp]// BT.cpp : Test bit intrinsics #include #if defined(_M_X64) #pragma intrinsic(_bittestandcomplement64) #else #pragma intrinsic(_bittestandcomplement) #endif int global_static_integer; __inline int btc_func_style(int mask, int i) { _bittestandcomplement((long*) &mask, i); return mask; } __inline void btc_mem_style(int* mask, int i) { _bittestandcomplement((long*) mask, i); } __declspec(noinline) void test_mem_style(int i) { btc_mem_style(&global_static_integer, i); } __declspec(noinline) void test_func_style(int i) { global_static_integer = btc_func_style(global_static_integer, i); } int main(int argc, char* argv[]) { test_mem_style(argc + 1); test_func_style(argc); return global_static_integer; }
[/cpp]Here is the output of these 2 simple test functions: [cpp]; mark_description "Intel C++ Compiler for applications running on IA-32, Version 12.1.3.300 Build 20120130"; ?test_mem_style@@YAXH@Z PROC NEAR PRIVATE ; parameter 1(i): eax sub esp, 12 ;27.1 mov edx, OFFSET FLAT: ?global_static_integer@@3HA ;28.3 btc DWORD PTR [edx], eax ;28.3 setb al ;28.3 add esp, 12 ;29.1 ret ;29.1 ?test_mem_style@@YAXH@Z ENDP ?test_func_style@@YAXH@Z PROC NEAR PRIVATE ; parameter 1(i): eax sub esp, 12 ;33.1 mov ecx, DWORD PTR [?global_static_integer@@3HA] ;34.3 lea edx, DWORD PTR [esp] ;34.3 mov DWORD PTR [esp], ecx ;34.3 btc DWORD PTR [edx], eax ;34.3 setb al ;34.3 mov eax, DWORD PTR [esp] ;34.3 mov DWORD PTR [?global_static_integer@@3HA], eax ;34.3 add esp, 12 ;35.1 $LN51: ret ;35.1 ?test_func_style@@YAXH@Z ENDP
[/cpp]As you can easily see now, the geneated code is far from being optimal. The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons. Could somebody transfer this message to the complier development group as an improvement request? Thanks!
The same problem exists for BTR and BTS intrinsics too. And the same issue has also been confirmed on x64.
0 Kudos
16 Replies
JenniferJ
Moderator
911 Views

Whatare the compile options used?
the OS used? and gcc or vc version?

Thanks,
Jennifer

0 Kudos
gpseek
Beginner
911 Views
Itested O2, O3 and Ox in Visual Studio 2005. I think the compiler will do the same thing with different settups.
I can also test VS 2010. However, I don't think itcan make any difference at all.
0 Kudos
gpseek
Beginner
911 Views

I just ran the test again with the newest version at this moment to see these get improved.

Unfortuately, there is no improvements on these at all for almost a year. The generated code is the same.

Could Jennifer help me cry about this again:) Thanks!

Complier and options:

Intel(R) C++ Compiler for applications running on IA-32, Version 13.1.0.149 Build 20130118

; mark_description "-c -Qvc10 -Qlocation,link,$(VCInstallDir)\\bin -Zi -nologo -W3 -O2 -Oi -Qipo -Qftz- -D __INTEL_COMPILER=1310";

; mark_description " -D WIN32 -D NDEBUG -D _CONSOLE -D _UNICODE -D UNICODE -EHs -EHc -MD -GS -Gy -fp:precise -Zc:wchar_t -Zc:for";

; mark_description "Scope -Qansi-alias -YuStdAfx.h -FpRelease\\bt.pch -FA -FaRelease\\ -FoRelease\\ -FdRelease\\vc100.pdb -Gd -T";

; mark_description "P";

0 Kudos
Marián__VooDooMan__M
New Contributor II
911 Views

OT: I am very sorry to make off-topic post, but I just want to ask how come

#pragma intrinsic(_bittestandcomplement64)

can gpseek use, while I need to use:

#if defined(_MSC_VER) && !defined(__INTEL_COMPILER)
#   pragma intrinsic(abs)
#endif

and for others too (memset, memcpy, etc...), since in other case I get warning from ICC (or maybe an error, IIRC, I don't recall exactly why I disabled these MS-specific pragma's in the past).

0 Kudos
JenniferJ
Moderator
911 Views

I wasn't aware of your posting on the compiler options previously. But I've got it now, and is checking on it.

Jennifer

0 Kudos
JenniferJ
Moderator
911 Views

gpseek wrote:

Unfortuately, there is no improvements on these at all for almost a year. The generated code is the same.

From our compiler engineer, the reason for this is because the 2nd paramenter of "_bittestandcomplement()" is not immediate number, otherwise it would be optimized.
The memory forms of these “bit test” instructions do not perform well. The best is to use the following to get better performance:

// Make sure n is an unsigned int, not a signed int
x[n / 32] |= (1 << (n % 32));

Is this work-around working for you?

Jennifer

0 Kudos
gpseek
Beginner
911 Views

Jennifer J. (Intel) wrote:

From our compiler engineer, the reason for this is because the 2nd paramenter of "_bittestandcomplement()" is not immediate number, otherwise it would be optimized.
The memory forms of these “bit test” instructions do not perform well. The best is to use the following to get better performance:

// Make sure n is an unsigned int, not a signed int
x[n / 32] |= (1 << (n % 32));

Is this work-around working for you?

Jennifer

Jennifer,

These can hardly be said as workaround.

 1. Using unsigned won't help at all. This is because these intrinsic functions are not defined as unsigned. I believe, in this case, it is MS but not Intel to blame. Intel  just used MS definitions to keep compatiblities:

unsigned char _bittestandcomplement(    long *a,    long b );

unsigned char _bittestandcomplement64(    __int64 *a,    __int64 b );

Refer to http://msdn.microsoft.com/en-us/library/zbdxdb11(v=vs.90).aspx

2. Your engineer's example is out of the question, or is out of the problem scope. What it does is just as everybody does now: to avoid BT family instructions! This is no longer best practice, as I believe, because BT family instructions are really fast since core 2 duo. But the compiler is still not so good at generating these instructions. What I say is to use these fast BTs instead in some cases. Using BTs can lessen port competition, for example, thus create further optimizing oppotunities.   

 

0 Kudos
SergeyKostrov
Valued Contributor II
911 Views
>>... >>unsigned char _bittestandcomplement( long *a, long b ); >>unsigned char _bittestandcomplement64( __int64 *a, __int64 b ); >>... You can't blame Intel for declarations of these intrinsic functions because they are Microsoft specific and take a look at a Copyright note in intrin.h header file. In Intel specific headers for intrinsic functions they are not declared at all. In order to make your statements about effectiveness of code generation more valuable ( and fair! ) you need to provide examples of code generation with different C++ compilers. I hope that your comment will be taken into account by Intel software engineers. Thanks.
0 Kudos
gpseek
Beginner
911 Views

Sergey Kostrov wrote:

In order to make your statements about effectiveness of code generation more valuable ( and fair! ) you need to provide examples of code generation with different C++ compilers.

I hope that your comment will be taken into account by Intel software engineers. Thanks.

I don't think code generation examples from other compliers are really needed for my case at all.

I'm saying that opitimizing BT family instructions generations help Intel C/C
++ compiler and more importantly Intel Core platform. Is this still not enough?

The last time I checked with MS complier, the BT code are bad/slow too.  However, this should not be an excuse for intel complier at all. If nothing has been changed since last time I checked AMD platform, AMD's BTs are not as fast as Intel's BTs since C2D introduction. However, these fast BTs are mostly wasted simply because of lousy compliers, both MS and Intel's included.  

 

0 Kudos
SergeyKostrov
Valued Contributor II
911 Views
>>The last time I checked with MS complier, the BT code are bad/slow too. However, this should not be an excuse for intel >>complier at all. If nothing has been changed since last time I checked AMD platform, AMD's BTs are not as fast as Intel's BTs >>since C2D introduction. However, these fast BTs are mostly wasted simply because of lousy compliers, both MS and Intel's included. It is really intersting to know.
0 Kudos
Bernard
Valued Contributor I
911 Views

>>>The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons>>>

I can agree that stack  reads/writes are useless,but usage of setb instruction could be related to btc instruction and inserted automatically by compiler when CF == 1.Maybe usage of setb is hardcoded by compiler designers?

0 Kudos
gpseek
Beginner
911 Views

iliyapolak wrote:

>>>The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons>>>

I can agree that stack  reads/writes are useless,but usage of setb instruction could be related to btc instruction and inserted automatically by compiler when CF == 1.Maybe usage of setb is hardcoded by compiler designers?

The generated setb instructions are useless here for these calls.

I think I know why they choose to generate a setb instruction there in the first place. Take a look at _bittestandcomplement declariation at the MS site. You will see _bittestandcomplement returns "the bit at the position specified". I think MS over-did the job.  _bittestandcomplement does not really need such a return value at all in most of the cases you can think of.  In my test cases shown in the original post, the return value is not used at all.

However, the complier is not complicated enough to know that if no reference to return value, then dont bother even to try to calculate.

That is why you see the redundant setb there.

 

0 Kudos
Bernard
Valued Contributor I
911 Views

>>>The generated setb instructions are useless here for these calls>>>

Yes I agree with you on setb instruction.

0 Kudos
gpseek
Beginner
911 Views

bump up:)

Any update on this? Thanks

0 Kudos
gpseek
Beginner
911 Views

Any update on this?! Thanks!

0 Kudos
Matthew_Oliver
Beginner
911 Views

I to would be interested in the compiler being updated to properly optimize the BT family of intrinsics. If both inputs are in register then this intrinsic should be able to generate a single BTR etc. instruction and just leave it at that. The way it is currently handled is making the whole thing far slower than it needs to be.

Also given that these instructions are actually being used by Intel for there Embree RT code then i would have thought that this sort of thing would have been fixed as its also affecting them.

0 Kudos
Reply