- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would like to inmplenet the code by SSE/AVX:
A is a array pointer, which may be float or int with size n:
B is a integer array pointer, to save the index of A which element is greater than zero.
int ith = 0;
for(int i = 0; i< n; i++){
if(A > 0)
B[ith++] = i;
}/*for i*/
as you know, the bottleneck is the cashe exchange, A and B could mapping the same cashe line, that make slow.
how should I do to implement this ?
thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
as far as i know, it's right only on parallel execution "world". I any case, it's possible only for some tail and head bytes of arrays
on sequential execution "world" you should pay attention to associativity of the cache
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]Vc::MemoryI don't think this will give you a speedup. But at least it's somewhat possible to do this with SSE/AVX. What is really missing (LRBni apparently has it), is a compressing store. I.e. you have a vector, a mask and a store address. Instead of storing at it's usual location in memory (as maskmovdqu does) you'd need an instruction that aligns the masked values to the left and stores them consecutively at the store address. Or well, scatters...A; for(int i = 0; i < A.vectorsCount(); ++i) { foreach_bit(int j, A.vector(i) > 0) { B[bpos++] = i * float_v::Size + j; } }[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can achieve a similar effect like a compressing store with a permute or shuffle instruction. The only problem is the shuffle control mask. However, this can be retrieved from a look-up table as there are only 16 possible entries. First, you convert your sign bits from the comparison resultto an integer using _mm_movemask_ps (movmskps).This integer is then the indexfor your lookup table. popcnt or a second look-up table can tell you, how much you need to advance bpos.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Suppose given n is the multipler of 4 (size of integer).
Basic idea to use pcmpgtd, movmskpd and then bt.
//////////////////////////////////////////////////////////////////////
;===================================================================
; int search_array_sse2(int *arr_dst, int *arr_src, const int n)
;===================================================================
search_array_sse2:
push ebx
push esi
push edi
%define STACK_SIZE 12
mov edi, [esp+STACK_SIZE+4] ; dst
mov esi, [esp+STACK_SIZE+8] ; src
mov ecx, [esp+STACK_SIZE+12] ; n
xor ebx, ebx ; turn_idx
pxor xmm7, xmm7
.loops:
movdqa xmm0, [esi]
pcmpgtd xmm0, xmm7
; movmskps - get the index of positive number
movmskps edx, xmm0
bt edx, 0
jc .carry_bit0
.try_bt1:
bt edx, 1
jc .carry_bit1
.try_bt2:
bt edx, 2
jc .carry_bit2
.try_bt3:
bt edx, 3
jc .carry_bit3
.end_bt:
lea esi, [esi+16]
add ebx, 4
cmp ebx, ecx
jb near .loops
jmp .post_proc
.carry_bit0:
mov [edi], ebx
lea edi, [edi+4]
jmp .try_bt1
.carry_bit1:
mov eax, ebx
inc eax
mov [edi], eax
lea edi, [edi+4]
jmp .try_bt2
.carry_bit2:
mov eax, ebx
add eax, 2
mov [edi], eax
lea edi, [edi+4]
jmp .try_bt3
.carry_bit3:
mov eax, ebx
add eax, 3
mov [edi], eax
lea edi, [edi+4]
jmp .end_bt
.post_proc:
; count_positive_num
mov esi, [esp+STACK_SIZE+4] ; dst base
mov eax, edi
sub eax, esi
sar eax, 2
%undef STACK_SIZE
pop edi
pop esi
pop ebx
ret
//////////////////////////////////////////////////////////////////////
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anyway, I would study this , thank you.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page