Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Timing/performance problem

miller__steve1
Beginner
467 Views

Hi,

I have been doing some basic tests on ipps functions, and have found some surprising results: 

Summary: call to ippsAnd_8u_I() appears to take same execution time as naive looping code; better performance obtained with calls to the avx intrinsic api.

I have appended some sample code. My timing results from this are: 

And explicit, time: 12866755 us
And ipps, time: 12488325 us
And sse(128), time: 10131219 us
And avx(256), time: 5723750 us

so I was rather surprised to find that the ipp function was no better than the basic looping code, and also that my own rather crude usage of avx intrinsic calls was considerably better. 

 

I am using the standalone version of ipp installed with visual studio 2015 (windows 10), from w_ipp_2019.5.281, using a release 64 bit build on a i7-7700K. 

 

Any comments, or can anyone explain what's going on? 

 

Best regards,

 

Steve.

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------

void test()

{

    size_t nData = 100000000;
    size_t nRep = 1000;
    size_t align = 32;
    uint8_t *pData = static_cast<uint8_t*>(_aligned_malloc(nData, align)) ; 
                     // new (std::nothrow) uint8_t[nData];
    uint8_t *pDst = static_cast<uint8_t*>(_aligned_malloc(nData, align)); 
                    //new (std::nothrow) uint8_t[nData];
    if (pData == nullptr || pDst == nullptr)
    {
        cerr << "could not allocate memory" << endl;
        return;
    }

    // explicit &= 
    {
        time_point<steady_clock> start_time = steady_clock::now();
        for (size_t n = 0; n < nRep; ++n)
        {
            for (size_t i = 0; i < nData; ++i)
                pDst &= pData;
        }
        cout << "And explicit, time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
            << " us " << endl;
    }

    // ippsAnd_8u_I
    {
        time_point<steady_clock> start_time = steady_clock::now();
        for (size_t n = 0; n < nRep; ++n)
        {
            ippsAnd_8u_I(pData, pDst, nData);
        }
        cout << "And ipps, time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
            << " us " << endl;
    }

    // sse (128) and version
    {
        time_point<steady_clock> start_time = steady_clock::now();
        for (size_t n = 0; n < nRep; ++n)
        {
            __m128i a, b, c;
            size_t m = nData / 16, offset;
            uint8_t *pT1 = nullptr, *pT2 = nullptr;
            for (size_t i = 0; i < m; ++i)
            {
                offset = m * 16;
                pT1 = pData + offset;
                pT2 = pDst + offset;
                a = _mm_load_si128(reinterpret_cast<__m128i*>(pT1));
                b = _mm_load_si128(reinterpret_cast<__m128i*>(pT2));
                c = _mm_and_si128(a, b);
                _mm_store_si128(reinterpret_cast<__m128i*>(pT2), c);
            }
        }
        cout << "And sse(128), time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
            << " us " << endl;
    }
    
    // avx (256) version 
    {
        time_point<steady_clock> start_time = steady_clock::now();
        for (size_t n = 0; n < nRep; ++n)
        {
            __m256i a, b, c;
            size_t m = nData / 32, offset;
            uint8_t *pT1 = nullptr, *pT2 = nullptr;
            for (size_t i = 0; i < m; ++i)
            {
                offset = m * 32;
                pT1 = pData + offset;
                pT2 = pDst + offset;
                a = _mm256_load_si256(reinterpret_cast<__m256i*>(pT1));
                b = _mm256_load_si256(reinterpret_cast<__m256i*>(pT2));
                c = _mm256_and_si256(a, b);
                _mm256_store_si256(reinterpret_cast<__m256i*>(pT2), c);
            }
        }
        cout << "And avx(256), time: " << duration_cast<microseconds>(steady_clock::now() - start_time).count()
            << " us " << endl;
    }

    _aligned_free(pData);
    _aligned_free(pDst);

}

 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

0 Kudos
3 Replies
Gennady_F_Intel
Moderator
467 Views

it might be the optimization problem with this function. could you add ippsGetLibVersion and give us the output?

0 Kudos
miller__steve1
Beginner
467 Views

Hi,

Thanks for your reply; I have inserted the following in the test program: 

 

const IppLibraryVersion * version = ippsGetLibVersion();
    cout << "version->major = " << version->major << endl
        << "version->minor = " << version->minor << endl
        << "version->majorBuild = " << version->majorBuild << endl
        << "version->build = " << version->build << endl
        << "version->targetCpu = " << version->targetCpu << endl
        << "version->Name = " << version->Name << endl
        << "version->Version = " << version->Version << endl 
        << "version->BuildDate = " << version->BuildDate << endl; 

 

and the result is as follows: 

 

version->major = 2019
version->minor = 0
version->majorBuild = 5
version->build = -916463777
version->targetCpu = l9
version->Name = ippSP AVX2 (l9)
version->Version = 2019.0.5 (r0xc95fdf5f)
version->BuildDate = Aug 12 2019

 

It looks from this that there is an error with the value for version->build, since in the declaration of this struct in ippbase.h, it says:

 

  int    build;                     /* e.g. 10, always >= majorBuild        */

 

so this condition is not satisfied in my case. Perhaps there has been an error with the build?

 

Steve. 

 

 

0 Kudos
Gennady_F_Intel
Moderator
467 Views

the problem is reproduced on our side and would be escalated. I would also recommend you to submit this issue into Intel Online Service Center to get the official support. 

0 Kudos
Reply