- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The AVX implementation is 3 times slower than the c-code.
Here is the AVX code for the float2short:
for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)
{
__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));
__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));
__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));
__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);
__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);
pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);
__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);
__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);
pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);
__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);
__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);
pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);
}
As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't see a problem from the code you posted. I recommend you take a look at the generated asm. Probably something stupid is happening somewhere.
Finally I can recommend you use IACA (Intel Architecture Code Analyzer) to let it annotate the asm a bit.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page