Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1081 Discussions

Looking for efficient way to convert float (32 bit) aligned buffer to short (16 bit) aligned buffer

I wrote a c-code and an AVX code to convert an alignedbuffer of size 1920*1280*3 from float to short.
The AVX implementation is 3 times slower than the c-code.

Here is the AVX code for the float2short:

for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)


__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));

__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));

__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));

__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);

__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);

pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);

__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);

__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);

pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);

__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);

__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);

pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);


As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).

0 Kudos
1 Reply
New Contributor I
This looks like an algorithm a decent compiler can auto-vectorize. Have you checked whether it did?
I can't see a problem from the code you posted. I recommend you take a look at the generated asm. Probably something stupid is happening somewhere.
Finally I can recommend you use IACA (Intel Architecture Code Analyzer) to let it annotate the asm a bit.
0 Kudos