Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

IPP - Performance Issue

AF
Beginner
520 Views

HI,

I'm using IPP dynamic linked with clang 11.0.0.

Hardware:

  •   Processor Name: 6-Core Intel Core i5
  •   Processor Speed: 3 GHz
  •   Number of Processors: 1
  •   Total Number of Cores: 6
  •   L2 Cache (per Core): 256 KB
  •   L3 Cache: 9 MB

I might be missing something but should this be slower than with simple std:: functions? 

If yes, how can I make it faster?

 

struct Vect3DArray
{
    Ipp64f* x_;
    Ipp64f* y_;
    Ipp64f* z_;

    Vect3DArray(int size)
    {
        x_ = ippsMalloc_64f(size * sizeof(Ipp64f));
        y_ = ippsMalloc_64f(size * sizeof(Ipp64f));
        z_ = ippsMalloc_64f(size * sizeof(Ipp64f));
    }

    ~Vect3DArray() { ippFree(x_); ippFree(y_); ippFree(z_); }
};

int main() {
    Vect3DArray vectArray(kAmount);
    Vect3DArray dstVectArray(kAmount);
    Ipp64f* sums = ippsMalloc_64f(kAmount * sizeof(Ipp64f));
    for (std::size_t i = 1; i < kAmount; ++i) {
        vectArray.x_ = i * 2.5;
        vectArray.y_ = i * 3.3;
        vectArray.z_ = i * 4.7;
    }

    auto start = std::chrono::high_resolution_clock::now();

    ippsMul_64f(vectArray.x_, vectArray.x_, dstVectArray.x_, static_cast<int>(kAmount));
    ippsMul_64f(vectArray.y_, vectArray.y_, dstVectArray.y_, static_cast<int>(kAmount));
    ippsMul_64f(vectArray.z_, vectArray.z_, dstVectArray.z_, static_cast<int>(kAmount));

    ippsAdd_64f(dstVectArray.x_, dstVectArray.y_, sums, kAmount);
    ippsAdd_64f(sums, vectArray.z_, sums, kAmount);
    ippsSqr_64f_I(sums, kAmount);

    ippsDiv_64f_I(sums, vectArray.x_, kAmount);
    ippsDiv_64f_I(sums, vectArray.y_, kAmount);
    ippsDiv_64f_I(sums, vectArray.z_, kAmount);

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    std::cout << "#" << duration << std::endl;
}

 

0 Kudos
4 Replies
Adriaan_van_Os
New Contributor I
520 Views

What is the value of kAmount ?  Do the vectors fit in L1 cache ? If not, try to do the various operations on chunks that do fit in L1 cache rather than on the whole vector.

Regards,

Adriaan van  Os

0 Kudos
AF
Beginner
520 Views

Hi, thanks for the reply.

The vectors don't fit L1, as kAmount was 12000 in my tests.

 

What kind of improvement can I look to obtain in the best case scenario?

 

Alexandre F.

0 Kudos
Adriaan_van_Os
New Contributor I
520 Views

Well, that depends on a lot of factors, most important cache usage. And it could be that the system (CPU) is doing some background work, etcetera.

Note that the first call to IPP may be "very slow" (like 1 millisecond) due to library initialization. So, keep that call out of the timing.

Based on limited tests I did, the speed improvement with Float32 is typically 3x (that number is probably better on a CPU with bigger vector registers, like AVX-512). With Float64 the speed improvement is disappointing (typically up to 50% or at most 100%). I my limited tests, some ipps Float64 functions were slower than their vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc counterparts. Again, that may be better on a CPU with bigger vector registers, like AVX-512.

In  my opinion, with Float64, it pays more to make your code threaded (I mean explicit with pthreads, not semi-automatic with OpenMP). But then it depends how stupid (sorry) the thread synchronisation is. Use "lock-free" synchronisation, never critical sections, they spoil everything.

Regards,

Adriaan van Os

0 Kudos
Adriaan_van_Os
New Contributor I
520 Views

Also note that Clang has two built-in vectorizers https://www.llvm.org/docs/Vectorizers.html, that for comparison you can put on and off.

Regards,

Adriaan van Os

 

0 Kudos
Reply