Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Technologies
- Intel® ISA Extensions
- how to use SSE/AVX to find the array index for element > 0 ?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Hi:

I would like to inmplenet the code by SSE/AVX:

A is a array pointer, which may be float or int with size n:

B is a integer array pointer, to save the index of A which element is greater than zero.

int ith = 0;

for(int i = 0; i< n; i++){

if(A* > 0) *

B[ith++] = i;

}/*for i*/

as you know, the bottleneck is the cashe exchange, A and B could mapping the same cashe line, that make slow.

how should I do to implement this ?

thank you.

IDZ_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-07-2011
09:08 AM

86 Views

how to use SSE/AVX to find the array index for element > 0 ?

I would like to inmplenet the code by SSE/AVX:

A is a array pointer, which may be float or int with size n:

B is a integer array pointer, to save the index of A which element is greater than zero.

int ith = 0;

for(int i = 0; i< n; i++){

if(A

B[ith++] = i;

}/*for i*/

as you know, the bottleneck is the cashe exchange, A and B could mapping the same cashe line, that make slow.

how should I do to implement this ?

thank you.

5 Replies

Highlighted
##

>> A and B could mapping the same cashe line, that make slow

as far as i know, it's right only on parallel execution "world". I any case, it's possible only for some tail and head bytes of arrays

on sequential execution "world" you should pay attention to associativity of the cache

Ilnar

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-07-2011
10:36 AM

86 Views

as far as i know, it's right only on parallel execution "world". I any case, it's possible only for some tail and head bytes of arrays

on sequential execution "world" you should pay attention to associativity of the cache

Highlighted
##

Matthias_Kretz

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-08-2011
02:13 AM

86 Views

[cpp]Vc::MemoryI don't think this will give you a speedup. But at least it's somewhat possible to do this with SSE/AVX. What is really missing (LRBni apparently has it), is a compressing store. I.e. you have a vector, a mask and a store address. Instead of storing at it's usual location in memory (as maskmovdqu does) you'd need an instruction that aligns the masked values to the left and stores them consecutively at the store address. Or well, scatters...A; for(int i = 0; i < A.vectorsCount(); ++i) { foreach_bit(int j, A.vector(i) > 0) { B[bpos++] = i * float_v::Size + j; } }[/cpp]

Highlighted
##

Thomas_W_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-18-2011
12:39 PM

86 Views

Highlighted
##

This is the implementation with NASM compileras below.

Suppose given n is the multipler of 4 (size of integer).

Basic idea to use pcmpgtd, movmskpd and then bt.

//////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////

denbianh

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-01-2011
06:56 PM

86 Views

Suppose given n is the multipler of 4 (size of integer).

Basic idea to use pcmpgtd, movmskpd and then bt.

//////////////////////////////////////////////////////////////////////

;===================================================================

; int search_array_sse2(int *arr_dst, int *arr_src, const int n)

;===================================================================

search_array_sse2:

push ebx

push esi

push edi

%define STACK_SIZE 12

mov edi, [esp+STACK_SIZE+4] ; dst

mov esi, [esp+STACK_SIZE+8] ; src

mov ecx, [esp+STACK_SIZE+12] ; n

xor ebx, ebx ; turn_idx

pxor xmm7, xmm7

.loops:

movdqa xmm0, [esi]

pcmpgtd xmm0, xmm7

; movmskps - get the index of positive number

movmskps edx, xmm0

bt edx, 0

jc .carry_bit0

.try_bt1:

bt edx, 1

jc .carry_bit1

.try_bt2:

bt edx, 2

jc .carry_bit2

.try_bt3:

bt edx, 3

jc .carry_bit3

.end_bt:

lea esi, [esi+16]

add ebx, 4

cmp ebx, ecx

jb near .loops

jmp .post_proc

.carry_bit0:

mov [edi], ebx

lea edi, [edi+4]

jmp .try_bt1

.carry_bit1:

mov eax, ebx

inc eax

mov [edi], eax

lea edi, [edi+4]

jmp .try_bt2

.carry_bit2:

mov eax, ebx

add eax, 2

mov [edi], eax

lea edi, [edi+4]

jmp .try_bt3

.carry_bit3:

mov eax, ebx

add eax, 3

mov [edi], eax

lea edi, [edi+4]

jmp .end_bt

.post_proc:

; count_positive_num

mov esi, [esp+STACK_SIZE+4] ; dst base

mov eax, edi

sub eax, esi

sar eax, 2

%undef STACK_SIZE

pop edi

pop esi

pop ebx

ret

//////////////////////////////////////////////////////////////////////

Highlighted
##

Thank you lots, though I have given up using SSE to optimizing this code.

Anyway, I would study this , thank you.

IDZ_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-28-2011
09:31 PM

86 Views

Anyway, I would study this , thank you.

For more complete information about compiler optimizations, see our Optimization Notice.