- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BT, BTS and BTC instructions are fast again in Core 2. Could your compiler guys impleament those bit test intrinsics? I think BT instruction should be impleamented at least.
The intrinsics benifits are quite obvious. Suppose we want to test if bit i is set in an integerbitmap, we usually do this in C/C++:
if (bitmap & (1 << i))
........
The problems of the above C test are
1. more intructions genereated and
2. register cl is needed, thus increasingregister pressure. And moreregister swap/save instructions often neededbecause rcx/ecx is often used as an function parameter.
Another plus for _bit_test(integer, index) is that it reduces code size.
One additional suggestion to the compiler optimization:
Sometimes(not always) bitmap & (1 << i) should be compiledas a BT instruction.
Thanks!
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
A new compiler 10.1 has been just released yesterday. You should be able to download it from the Registration center now.
This 10.1 compiler supports the following new intrinsics that might be interested to you. The definitions of those intrinsics are in the VS2005 header file:
_BitScanForward
_BitScanReverse
_bittest
_bittestandcomplement
_bittestandreset
_bittestandset
_BitScanForward64
_BitScanReverse64
_bittest64
_bittestandcomplement64
_bittestandreset64
_bittestandset64
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why there are no short names and why the capitalization is not consistent?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jennifer, Thank you a lot to get them implemented those intrinsics.
I dont't think the long name is a big deal here though it is a liitle bit hard to remember the long names. I can use inline functions to wrap them up to the way I like. The good thing is one does not need to add #if stuff todifferenciate Intel orMS build.
However, more optimization work needs to be done with those intrinsics to make them useful.
Here is a simple and quick test:
// BT.cpp : Test bt intrinsics
//
#include
inline
int bt(const long m, int i){
return _bittest(&m, i);}
int
main(int argc, char* argv[]){
return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;}
The asm output from Intel C/C++ 10.1 comipler for the x64 build (32 bit output is basically the same thing):
; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 10.1 Build 20070913 %s";
.686P
.387
OPTION DOTNAME
ASSUME CS:FLAT,DS:FLAT,SS:FLAT
;ident "/manifestdependency:"type='win32' name='Microsoft.VC80.CRT' version='8.0.50727.762' processorArchitecture='amd64' publicKeyToken='1fc8b3b9a1e18e3b'""
_TEXT SEGMENT DWORD PUBLIC FLAT 'CODE'
; COMDAT main
TXTST0:
; -- Begin main
; mark_begin;
IF @Version GE 800
.MMX
ELSEIF @Version GE 612
.MMX
MMWORD TEXTEQU
ENDIF
IF @Version GE 800
.XMM
ELSEIF @Version GE 614
.XMM
XMMWORD TEXTEQU
ENDIF
ALIGN 2
PUBLIC main
main PROC NEAR
; parameter 1: ecx
; parameter 2: rdx
$B1$1: ; Preds $B1$0
;;; {
sub rsp, 40 ;12.1
mov DWORD PTR [rsp+48], ecx ;12.1
mov ecx, 3 ;12.1
call __intel_new_proc_init ;12.1
; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$5: ; Preds $B1$1
stmxcsr DWORD PTR [rsp+32] ;12.1
or DWORD PTR [rsp+32], 32832 ;12.1
ldmxcsr DWORD PTR [rsp+32] ;12.1
;;; return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
lea rdx, QWORD PTR [rsp+48] ;13.9
mov eax, 24 ;13.9
bt DWORD PTR [rdx], eax ;13.9
setb al ;13.9
; LOE rbx rbp rsi rdi r12 r13 r14 r15 eax xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$2: ; Preds $B1$5
movzx eax, al ;13.9
xor ecx, ecx ;13.46
mov edx, -1 ;13.46
cmp eax, 0 ;13.46
cmovne ecx, edx ;13.46
mov eax, ecx ;13.46
add rsp, 40 ;13.46
ret ;13.46
ALIGN 2
; LOE
; mark_end;
main ENDP
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; COMDAT pdata
pdata SEGMENT
$pdata$main DD @imagerel(main#)
DD @imagerel(main#+76)
DD @imagerel($unwind$main#)
pdata ENDS
;main ENDS
_TEXT ENDS
_DATA SEGMENT DWORD PUBLIC FLAT 'DATA'
_DATA ENDS
; -- End main
_DATA SEGMENT DWORD PUBLIC FLAT 'DATA'
_DATA ENDS
EXTRN __intel_new_proc_init:PROC
END
As you see it, it is not optimized at all, which defeats the purpose of the intrinsic function.
_bittest only uses memory form of bt instructions, even the index is literal constatan and within length of an integer.
lea rdx, QWORD PTR [rsp+48]
mov eax, 24
setb al
movzx eax, al
The above instructions are not needed at all if bt reg reg/const is chosen here and a conditional jump/set/move would the job nicely right after the bt intruction which changes the flag.
Microsoft VC compiler do es the same thing:
; Listing generated by Microsoft Optimizing Compiler Version 14.00.50727.762
include listing.inc
INCLUDELIB OLDNAMES
PUBLIC main
; Function compile flags: /Ogtpy
; File c:documents and settingsdzhaomy documentsvisual studio 2005projectsttt.cpp
; COMDAT main
_TEXT SEGMENT
argc$ = 8
argv$ = 16
main PROC ; COMDAT
; 13 : return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
00000 48 8d 44 24 08 lea rax, QWORD PTR argc$[rsp]
00005 89 4c 24 08 mov DWORD PTR [rsp+8], ecx
00009 0f ba 20 18 bt DWORD PTR [rax], 24
0000d 0f 92 c0 setb al
00010 f6 d8 neg al
00012 1b c0 sbb eax, eax
; 14 : }
00014 c3 ret 0
main ENDP
_TEXT ENDS
END
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're right. More improvements will be in the next major release and icl will support some bitscan and bittest intrinsics without memory reference. Sorry no schedule on next release.
Example is below.
Source code - a.c
unsigned char _bittest(long *, long);
unsigned char my_bittest(long num)
{
return _bittest(&num, 15);
}
With the next version icl, you'll get something like below:
my_bittest PROC
; parameter 1: ecx
$B1$1:: ; Preds $B1$0
mov edx, 15 ;25.12
xor eax, eax ;25.12
mov r8d, 1 ;25.12
bt ecx, edx ;25.12
cmovc eax, r8d ;25.12
ret ;25.12
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jennifer, Instead of:
mov edx, 15 bt ecx, edx
If 15 is a constant known at compile time it would be better just:
bt ecx, 15
Note that you can also save some bytes in long mode by using edx instead of r8d:
xor eax, eax mov edx, 1 bt ecx, 15 cmovc eax, edx ret
Or how about this:
xor eax, eax bt ecx, 15 rcl eax, 1 ; this shifts the carry in giving 0 or 1 in eax ret
Seems a lot shorter, no? If there are no false dependencies on the flags (from partial access) it would probably also run faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
// BT.cpp : Test bt intrinsics
#include
inline int bt(const long m, int i)
{
return _bittest(&m, i);
}
int main(int argc, char* argv[])
{
return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
}
Intel asm output snip:
xor eax, eax
mov ecx, -1
bt edx, 24 ;18.2
cmovc eax, ecx
The IntelASM outputis simply great!
now, compare it to Microsoft C++ Compiler Version 16.00.31118.01 (shipped with VS10) asm code snip:
lea eax, DWORD PTR _argc$[ebp]
bt DWORD PTR [eax], 24
setb al
movzx eax, al
neg eax
sbb eax, eax
Great work! Thanks everybody!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
xor eax,eax
bt edx,24
adc eax,eax
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
xor eax,eax
bt edx,24
adc eax,eax
Jim Dempsey
That is better. However, it is not really bit test specific. You're addressing a more general problem: How to handle conditional jump more efficiently.
The compiler's conditional handling handling is OKbut not perfect.
Anyways,I'm already pleased by the bit test code generation: it avoids memory whenever possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) it removes one instruction
2) it removes on immediate value
3) works on processors that do not support conditional move
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »