IPP - Performance Issue

AF · ‎03-13-2020

HI,

I'm using IPP dynamic linked with clang 11.0.0.

Hardware:

Processor Name: 6-Core Intel Core i5
Processor Speed: 3 GHz
Number of Processors: 1
Total Number of Cores: 6
L2 Cache (per Core): 256 KB
L3 Cache: 9 MB

I might be missing something but should this be slower than with simple std:: functions?

If yes, how can I make it faster?

struct Vect3DArray
{
    Ipp64f* x_;
    Ipp64f* y_;
    Ipp64f* z_;

    Vect3DArray(int size)
    {
        x_ = ippsMalloc_64f(size * sizeof(Ipp64f));
        y_ = ippsMalloc_64f(size * sizeof(Ipp64f));
        z_ = ippsMalloc_64f(size * sizeof(Ipp64f));
    }

    ~Vect3DArray() { ippFree(x_); ippFree(y_); ippFree(z_); }
};

int main() {
    Vect3DArray vectArray(kAmount);
    Vect3DArray dstVectArray(kAmount);
    Ipp64f* sums = ippsMalloc_64f(kAmount * sizeof(Ipp64f));
    for (std::size_t i = 1; i < kAmount; ++i) {
        vectArray.x_ = i * 2.5;
        vectArray.y_ = i * 3.3;
        vectArray.z_ = i * 4.7;
    }

    auto start = std::chrono::high_resolution_clock::now();

    ippsMul_64f(vectArray.x_, vectArray.x_, dstVectArray.x_, static_cast<int>(kAmount));
    ippsMul_64f(vectArray.y_, vectArray.y_, dstVectArray.y_, static_cast<int>(kAmount));
    ippsMul_64f(vectArray.z_, vectArray.z_, dstVectArray.z_, static_cast<int>(kAmount));

    ippsAdd_64f(dstVectArray.x_, dstVectArray.y_, sums, kAmount);
    ippsAdd_64f(sums, vectArray.z_, sums, kAmount);
    ippsSqr_64f_I(sums, kAmount);

    ippsDiv_64f_I(sums, vectArray.x_, kAmount);
    ippsDiv_64f_I(sums, vectArray.y_, kAmount);
    ippsDiv_64f_I(sums, vectArray.z_, kAmount);

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    std::cout << "#" << duration << std::endl;
}

Adriaan_van_Os · ‎03-16-2020

What is the value of kAmount ? Do the vectors fit in L1 cache ? If not, try to do the various operations on chunks that do fit in L1 cache rather than on the whole vector.

Regards,

Adriaan van Os

AF · ‎03-17-2020

Hi, thanks for the reply.

The vectors don't fit L1, as kAmount was 12000 in my tests.

What kind of improvement can I look to obtain in the best case scenario?

Alexandre F.

Adriaan_van_Os · ‎03-17-2020

Well, that depends on a lot of factors, most important cache usage. And it could be that the system (CPU) is doing some background work, etcetera.

Note that the first call to IPP may be "very slow" (like 1 millisecond) due to library initialization. So, keep that call out of the timing.

Based on limited tests I did, the speed improvement with Float32 is typically 3x (that number is probably better on a CPU with bigger vector registers, like AVX-512). With Float64 the speed improvement is disappointing (typically up to 50% or at most 100%). I my limited tests, some ipps Float64 functions were slower than their vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc counterparts. Again, that may be better on a CPU with bigger vector registers, like AVX-512.

In my opinion, with Float64, it pays more to make your code threaded (I mean explicit with pthreads, not semi-automatic with OpenMP). But then it depends how stupid (sorry) the thread synchronisation is. Use "lock-free" synchronisation, never critical sections, they spoil everything.

Regards,

Adriaan van Os

Adriaan_van_Os · ‎03-17-2020

Also note that Clang has two built-in vectorizers https://www.llvm.org/docs/Vectorizers.html, that for comparison you can put on and off.

Regards,

Adriaan van Os