Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646
1052 Discussions

how to use SSE/AVX to find the array index for element > 0 ?


I would like to inmplenet the code by SSE/AVX:

A is a array pointer, which may be float or int with size n:

B is a integer array pointer, to save the index of A which element is greater than zero.

int ith = 0;

for(int i = 0; i< n; i++){

if(A > 0)
B[ith++] = i;

}/*for i*/

as you know, the bottleneck is the cashe exchange, A and B could mapping the same cashe line, that make slow.

how should I do to implement this ?

thank you.
0 Kudos
5 Replies
>> A and B could mapping the same cashe line, that make slow
as far as i know, it's right only on parallel execution "world". I any case, it's possible only for some tail and head bytes of arrays
on sequential execution "world" you should pay attention to associativity of the cache
New Contributor I
[cpp]Vc::Memory A;

for(int i = 0; i < A.vectorsCount(); ++i) {
  foreach_bit(int j, A.vector(i) > 0) {
    B[bpos++] = i * float_v::Size + j;
I don't think this will give you a speedup. But at least it's somewhat possible to do this with SSE/AVX. What is really missing (LRBni apparently has it), is a compressing store. I.e. you have a vector, a mask and a store address. Instead of storing at it's usual location in memory (as maskmovdqu does) you'd need an instruction that aligns the masked values to the left and stores them consecutively at the store address. Or well, scatters...

You can achieve a similar effect like a compressing store with a permute or shuffle instruction. The only problem is the shuffle control mask. However, this can be retrieved from a look-up table as there are only 16 possible entries. First, you convert your sign bits from the comparison resultto an integer using _mm_movemask_ps (movmskps).This integer is then the indexfor your lookup table. popcnt or a second look-up table can tell you, how much you need to advance bpos.

This is the implementation with NASM compileras below.
Suppose given n is the multipler of 4 (size of integer).
Basic idea to use pcmpgtd, movmskpd and then bt.



; int search_array_sse2(int *arr_dst, int *arr_src, const int n)



push ebx

push esi

push edi

%define STACK_SIZE 12

mov edi, [esp+STACK_SIZE+4] ; dst

mov esi, [esp+STACK_SIZE+8] ; src

mov ecx, [esp+STACK_SIZE+12] ; n

xor ebx, ebx ; turn_idx

pxor xmm7, xmm7


movdqa xmm0, [esi]

pcmpgtd xmm0, xmm7

; movmskps - get the index of positive number

movmskps edx, xmm0

bt edx, 0

jc .carry_bit0


bt edx, 1

jc .carry_bit1


bt edx, 2

jc .carry_bit2


bt edx, 3

jc .carry_bit3


lea esi, [esi+16]

add ebx, 4

cmp ebx, ecx

jb near .loops

jmp .post_proc


mov [edi], ebx

lea edi, [edi+4]

jmp .try_bt1


mov eax, ebx

inc eax

mov [edi], eax

lea edi, [edi+4]

jmp .try_bt2


mov eax, ebx

add eax, 2

mov [edi], eax

lea edi, [edi+4]

jmp .try_bt3


mov eax, ebx

add eax, 3

mov [edi], eax

lea edi, [edi+4]

jmp .end_bt


; count_positive_num

mov esi, [esp+STACK_SIZE+4] ; dst base

mov eax, edi

sub eax, esi

sar eax, 2


pop edi

pop esi

pop ebx


Thank you lots, though I have given up using SSE to optimizing this code.

Anyway, I would study this , thank you.