- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have been doing some basic tests on ipps functions, and have found some surprising results:
Summary: call to ippsAnd_8u_I() appears to take same execution time as naive looping code; better performance obtained with calls to the avx intrinsic api.
I have appended some sample code. My timing results from this are:
And explicit, time: 12866755 us
And ipps, time: 12488325 us
And sse(128), time: 10131219 us
And avx(256), time: 5723750 us
so I was rather surprised to find that the ipp function was no better than the basic looping code, and also that my own rather crude usage of avx intrinsic calls was considerably better.
I am using the standalone version of ipp installed with visual studio 2015 (windows 10), from w_ipp_2019.5.281, using a release 64 bit build on a i7-7700K.
Any comments, or can anyone explain what's going on?
Best regards,
Steve.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
void test()
{
size_t nData = 100000000;
size_t nRep = 1000;
size_t align = 32;
uint8_t *pData = static_cast<uint8_t*>(_aligned_malloc(nData, align)) ;
// new (std::nothrow) uint8_t[nData];
uint8_t *pDst = static_cast<uint8_t*>(_aligned_malloc(nData, align));
//new (std::nothrow) uint8_t[nData];
if (pData == nullptr || pDst == nullptr)
{
cerr << "could not allocate memory" << endl;
return;
}
// explicit &=
{
time_point<steady_clock> start_time = steady_clock::now();
for (size_t n = 0; n < nRep; ++n)
{
for (size_t i = 0; i < nData; ++i)
pDst &= pData;
}
cout << "And explicit, time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
<< " us " << endl;
}
// ippsAnd_8u_I
{
time_point<steady_clock> start_time = steady_clock::now();
for (size_t n = 0; n < nRep; ++n)
{
ippsAnd_8u_I(pData, pDst, nData);
}
cout << "And ipps, time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
<< " us " << endl;
}
// sse (128) and version
{
time_point<steady_clock> start_time = steady_clock::now();
for (size_t n = 0; n < nRep; ++n)
{
__m128i a, b, c;
size_t m = nData / 16, offset;
uint8_t *pT1 = nullptr, *pT2 = nullptr;
for (size_t i = 0; i < m; ++i)
{
offset = m * 16;
pT1 = pData + offset;
pT2 = pDst + offset;
a = _mm_load_si128(reinterpret_cast<__m128i*>(pT1));
b = _mm_load_si128(reinterpret_cast<__m128i*>(pT2));
c = _mm_and_si128(a, b);
_mm_store_si128(reinterpret_cast<__m128i*>(pT2), c);
}
}
cout << "And sse(128), time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
<< " us " << endl;
}
// avx (256) version
{
time_point<steady_clock> start_time = steady_clock::now();
for (size_t n = 0; n < nRep; ++n)
{
__m256i a, b, c;
size_t m = nData / 32, offset;
uint8_t *pT1 = nullptr, *pT2 = nullptr;
for (size_t i = 0; i < m; ++i)
{
offset = m * 32;
pT1 = pData + offset;
pT2 = pDst + offset;
a = _mm256_load_si256(reinterpret_cast<__m256i*>(pT1));
b = _mm256_load_si256(reinterpret_cast<__m256i*>(pT2));
c = _mm256_and_si256(a, b);
_mm256_store_si256(reinterpret_cast<__m256i*>(pT2), c);
}
}
cout << "And avx(256), time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
<< " us " << endl;
}
_aligned_free(pData);
_aligned_free(pDst);
}
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it might be the optimization problem with this function. could you add ippsGetLibVersion and give us the output?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your reply; I have inserted the following in the test program:
const IppLibraryVersion * version = ippsGetLibVersion();
cout << "version->major = " << version->major << endl
<< "version->minor = " << version->minor << endl
<< "version->majorBuild = " << version->majorBuild << endl
<< "version->build = " << version->build << endl
<< "version->targetCpu = " << version->targetCpu << endl
<< "version->Name = " << version->Name << endl
<< "version->Version = " << version->Version << endl
<< "version->BuildDate = " << version->BuildDate << endl;
and the result is as follows:
version->major = 2019
version->minor = 0
version->majorBuild = 5
version->build = -916463777
version->targetCpu = l9
version->Name = ippSP AVX2 (l9)
version->Version = 2019.0.5 (r0xc95fdf5f)
version->BuildDate = Aug 12 2019
It looks from this that there is an error with the value for version->build, since in the declaration of this struct in ippbase.h, it says:
int build; /* e.g. 10, always >= majorBuild */
so this condition is not satisfied in my case. Perhaps there has been an error with the build?
Steve.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the problem is reproduced on our side and would be escalated. I would also recommend you to submit this issue into Intel Online Service Center to get the official support.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page