Showing results for 
Search instead for 
Did you mean: 

AVX best performance min function with usigned char

Hi everybody and thanks for your help!

I have this piece of code :

unsigned char A,B,C;

// init A,B,C with mm_malloc, 64 bit aligned 

       C = fminf(255,255-(A*B));

Considering that A,B,C are 8 bit datatype so with AVX vectorization I should have 16 operation per clock cycle, but the function fmin work with 32 bit float datatype so the operation per clock cycle are 8. I see in Intel intrinsic function exist a min between u8 datatype. 

I try to translate the loop in intrinsic but I have a problem to find a load and mul function to u8 packed datatype (epu8).

How can obtain the maximum performance in this loop?


Best regards



0 Kudos
2 Replies
Black Belt

Did you check for vectorization with std::min() ?  fminf(), as you indicate, implies promotion to float data type, along with handling of NaN and Inf operands, for which icc doesn't offer shortcuts as gcc does.  OTOH, Intel C++ offers a range of vectorization possibilities with std::min(), and, with recent versions of Intel C, even with comparisons written out with ? operator. 

As you suggested, evaluation of your expression probably has to be done in promotion to a more suitable data type such as signed int as implied in your code; your compiler's translation of C code ought to be a good indication.

The max and min operations are notorious for requiring different source code to optimize with each of the popular compilers.

Black Belt

I assume the omission of the "*" on the type of A,B and C was a typographical error. This said, the use of fminf is ambiguous in addition to non-optimal.

The arguments to fminf will be promoted from unsigned char to float, it therefore can return negative numbers when A*B exceed 255. Though you did not state this, I think that you may intend for the product to saturate to 255.

*** Untested code:

unsigned char *A,*B,*C;

// init A,B,C with mm_malloc, 64 bit aligned 
#pragma vector aligned
for(int j=0;j<size;j++)
  short temp; // *** inside scope of for
  temp = ((short)A)*((short)B); // hopefully compiler generates _mm256_cvtepu8_epi16
  temp = temp < (short)0 ? (short)256 : temp; // protect against product producing negative
  C = temp > (short)255 ? (unsigned char)255 : (unsigned char)255-(unsigned char)temp;

Note, the above code is designed to assist the compiler in identifying vectorization opportunities. Placing temp inside the scope of the for declares temp is disposable on exit of the scope of the for (last value need not be retained). This also makes it easier for the compiler to repurpose the scalar temp into a vector temp. The loop (hopefully) now promotes vectors of unsigned chars to vectors of shorts, produces a product, replaces the product should the product saturate short max. Then produce the final result with overflow protection.

I have not tested the above with full optimizations, this is for you to do. I see no reason why good compiler optimization could not fully vectorize the above.

Note, the seemingly excessive use of (cast) is intended to inhibit the compiler from the unnecessary promotion of the operations to int.

If the compiler does not vectorize the above satisfactorily, then consider using the intrinsics.

Jim Dempsey