Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Bit test intrinsics functions wanted!

gpseek
Beginner
2,415 Views

BT, BTS and BTC instructions are fast again in Core 2.smiley [:-)] Could your compiler guys impleament those bit test intrinsics? I think BT instruction should be impleamented at least.

The intrinsics benifits are quite obvious. Suppose we want to test if bit i is set in an integerbitmap, we usually do this in C/C++:

if (bitmap & (1 << i))

........

The problems of the above C test are

1. more intructions genereated and

2. register cl is needed, thus increasingregister pressure. And moreregister swap/save instructions often neededbecause rcx/ecx is often used as an function parameter.

Another plus for _bit_test(integer, index) is that it reduces code size.

One additional suggestion to the compiler optimization:

Sometimes(not always) bitmap & (1 << i) should be compiledas a BT instruction.

Thanks!

0 Kudos
31 Replies
JenniferJ
Moderator
755 Views
We have added more support for new intrinsics of VS 2005. This feature is in the nextversion of 10.0 product. I'll post the release news here when available.
0 Kudos
JenniferJ
Moderator
755 Views

Hello,

A new compiler 10.1 has been just released yesterday. You should be able to download it from the Registration center now.

This 10.1 compiler supports the following new intrinsics that might be interested to you. The definitions of those intrinsics are in the VS2005 header file:

_BitScanForward

_BitScanReverse

_bittest

_bittestandcomplement

_bittestandreset

_bittestandset

_BitScanForward64

_BitScanReverse64

_bittest64

_bittestandcomplement64

_bittestandreset64

_bittestandset64

0 Kudos
levicki
Valued Contributor I
755 Views

Why there are no short names and why the capitalization is not consistent?

0 Kudos
JenniferJ
Moderator
755 Views
Note that those intrinsics are defined by VS2005. ICL supports them now in 10.1 release.
0 Kudos
gpseek
Beginner
755 Views

Jennifer, Thank you a lot to get them implemented those intrinsicssmiley [:-)].

I dont't think the long name is a big deal here though it is a liitle bit hard to remember the long names. I can use inline functions to wrap them up to the way I like. The good thing is one does not need to add #if stuff todifferenciate Intel orMS build.

However, more optimization work needs to be done with those intrinsics to make them useful.

Here is a simple and quick test:

// BT.cpp : Test bt intrinsics

//

#include

inline

int bt(const long m, int i)

{

return _bittest(&m, i);

}

int

main(int argc, char* argv[])

{

return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;

}

The asm output from Intel C/C++ 10.1 comipler for the x64 build (32 bit output is basically the same thing):

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 10.1 Build 20070913 %s";
.686P
.387
OPTION DOTNAME
ASSUME CS:FLAT,DS:FLAT,SS:FLAT
;ident "/manifestdependency:"type='win32' name='Microsoft.VC80.CRT' version='8.0.50727.762' processorArchitecture='amd64' publicKeyToken='1fc8b3b9a1e18e3b'""
_TEXT SEGMENT DWORD PUBLIC FLAT 'CODE'
; COMDAT main
TXTST0:
; -- Begin main
; mark_begin;
IF @Version GE 800
.MMX
ELSEIF @Version GE 612
.MMX
MMWORD TEXTEQU 
ENDIF
IF @Version GE 800
.XMM
ELSEIF @Version GE 614
.XMM
XMMWORD TEXTEQU 
ENDIF
ALIGN 2
PUBLIC main
main PROC NEAR
; parameter 1: ecx
; parameter 2: rdx
$B1$1: ; Preds $B1$0
;;; {
sub rsp, 40 ;12.1
mov DWORD PTR [rsp+48], ecx ;12.1
mov ecx, 3 ;12.1
call __intel_new_proc_init ;12.1
; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$5: ; Preds $B1$1
stmxcsr DWORD PTR [rsp+32] ;12.1
or DWORD PTR [rsp+32], 32832 ;12.1
ldmxcsr DWORD PTR [rsp+32] ;12.1
;;; return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
lea rdx, QWORD PTR [rsp+48] ;13.9
mov eax, 24 ;13.9
bt DWORD PTR [rdx], eax ;13.9
setb al ;13.9
; LOE rbx rbp rsi rdi r12 r13 r14 r15 eax xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$2: ; Preds $B1$5
movzx eax, al ;13.9
xor ecx, ecx ;13.46
mov edx, -1 ;13.46
cmp eax, 0 ;13.46
cmovne ecx, edx ;13.46
mov eax, ecx ;13.46
add rsp, 40 ;13.46
ret ;13.46
ALIGN 2
; LOE
; mark_end;
main ENDP
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; COMDAT pdata
pdata SEGMENT
$pdata$main DD @imagerel(main#)
DD @imagerel(main#+76)
DD @imagerel($unwind$main#)
pdata ENDS
;main ENDS
_TEXT ENDS
_DATA SEGMENT DWORD PUBLIC FLAT 'DATA'
_DATA ENDS
; -- End main
_DATA SEGMENT DWORD PUBLIC FLAT 'DATA'
_DATA ENDS
EXTRN __intel_new_proc_init:PROC
END
As you see it, it is not optimized at all, which defeats the purpose of the intrinsic function.
_bittest only uses memory form of bt instructions, even the index is literal constatan and within length of an integer.
lea rdx, QWORD PTR [rsp+48]
mov eax, 24
setb al 
movzx eax, al
The above instructions are not needed at all if bt reg reg/const is chosen here and a conditional jump/set/move would the job nicely right after the bt intruction which changes the flag.Sad smiley [:(]
Microsoft VC compiler do
es the same thing:

; Listing generated by Microsoft Optimizing Compiler Version 14.00.50727.762

include listing.inc

INCLUDELIB OLDNAMES

PUBLIC main

; Function compile flags: /Ogtpy

; File c:documents and settingsdzhaomy documentsvisual studio 2005projectsttt.cpp

; COMDAT main

_TEXT SEGMENT

argc$ = 8

argv$ = 16

main PROC ; COMDAT

; 13 : return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;

00000 48 8d 44 24 08 lea rax, QWORD PTR argc$[rsp]

00005 89 4c 24 08 mov DWORD PTR [rsp+8], ecx

00009 0f ba 20 18 bt DWORD PTR [rax], 24

0000d 0f 92 c0 setb al

00010 f6 d8 neg al

00012 1b c0 sbb eax, eax

; 14 : }

00014 c3 ret 0

main ENDP

_TEXT ENDS

END

0 Kudos
JenniferJ
Moderator
755 Views

You're right. More improvements will be in the next major release and icl will support some bitscan and bittest intrinsics without memory reference. Sorry no schedule on next release.

Example is below.

Source code - a.c
unsigned char _bittest(long *, long);
unsigned char my_bittest(long num)
{
return _bittest(&num, 15);
}

With the next version icl, you'll get something like below:
my_bittest PROC
; parameter 1: ecx
$B1$1:: ; Preds $B1$0
mov edx, 15 ;25.12
xor eax, eax ;25.12
mov r8d, 1 ;25.12
bt ecx, edx ;25.12
cmovc eax, r8d ;25.12
ret ;25.12

0 Kudos
levicki
Valued Contributor I
755 Views

Jennifer, Instead of:

mov	edx, 15
bt	ecx, edx

If 15 is a constant known at compile time it would be better just:

bt	ecx, 15

Note that you can also save some bytes in long mode by using edx instead of r8d:

	xor	eax, eax
	mov	edx, 1
	bt	ecx, 15
	cmovc	eax, edx
	ret

Or how about this:

	xor	eax, eax
	bt	ecx, 15
	rcl	eax, 1	; this shifts the carry in giving 0 or 1 in eax
	ret

Seems a lot shorter, no? If there are no false dependencies on the flags (from partial access) it would probably also run faster.

0 Kudos
gpseek
Beginner
755 Views
I ran the same test again today and wanted to report the great results fromIntel C++ Compiler 12.0.2.154:

// BT.cpp : Test bt intrinsics
#include

inline int bt(const long m, int i)
{
return _bittest(&m, i);
}

int main(int argc, char* argv[])
{
return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
}


Intel asm output snip:

xor eax, eax
mov ecx, -1
bt edx, 24 ;18.2
cmovc eax, ecx

The IntelASM outputis simply great!

now, compare it to Microsoft C++ Compiler Version 16.00.31118.01 (shipped with VS10) asm code snip:

lea eax, DWORD PTR _argc$[ebp]
bt DWORD PTR [eax], 24
setb al
movzx eax, al
neg eax
sbb eax, eax

Great work! Thanks everybody!

0 Kudos
jimdempseyatthecove
Honored Contributor III
755 Views
How about

xor eax,eax
bt edx,24
adc eax,eax

Jim Dempsey
0 Kudos
gpseek
Beginner
755 Views
How about

xor eax,eax
bt edx,24
adc eax,eax

Jim Dempsey

That is better. However, it is not really bit test specific. You're addressing a more general problem: How to handle conditional jump more efficiently.

The compiler's conditional handling handling is OKbut not perfect.
Anyways,I'm already pleased by the bit test code generation: it avoids memory whenever possible.

0 Kudos
jimdempseyatthecove
Honored Contributor III
755 Views
The above has three attractive features

1) it removes one instruction
2) it removes on immediate value
3) works on processors that do not support conditional move

Jim Dempsey
0 Kudos
Reply