Looking for efficient way to convert float (32 bit) aligned buffer to short (16 bit) aligned buffer

gilgil · ‎07-05-2011

I wrote a c-code and an AVX code to convert an alignedbuffer of size 1920*1280*3 from float to short.
The AVX implementation is 3 times slower than the c-code.

Here is the AVX code for the float2short:

for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)

{

__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));

__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));

__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));

__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);

__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);

pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);

__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);

__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);

pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);

__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);

__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);

pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);

}

As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).

Matthias_Kretz · ‎07-05-2011

This looks like an algorithm a decent compiler can auto-vectorize. Have you checked whether it did?
I can't see a problem from the code you posted. I recommend you take a look at the generated asm. Probably something stupid is happening somewhere.
Finally I can recommend you use IACA (Intel Architecture Code Analyzer) to let it annotate the asm a bit.