Multi Threading Performance in Multiplication of 2 Arrays / Images - Intel IPP

Royi · ‎05-02-2016

I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.

I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.

Here is the simple project (I also attached the complete project form Visual Studio):

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

const int height = 6000;
const int width  = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    for (int i = 0; i < 200; i++)
        ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size); 

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.

I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).

I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?

I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

#include <omp.h>

const int height = 5000;
const int width  = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    IppiSize blockSize = {width, height / 4};

    const int NUM_BLOCK = 4;
    omp_set_num_threads(NUM_BLOCK);

    Ipp32f*  in;
    Ipp32f*  out;

    //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);

    #pragma omp parallel            \
    shared(mInput_image, mOutput_image, blockSize) \
    private(in, out)
    {
        int id   = omp_get_thread_num();
        int step = blockSize.width * blockSize.height * id;
        in       = mInput_image  + step;
        out      = mOutput_image + step;
        ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
    }

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

The results were the same, again, no gain of performance.

Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it? Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?

The Computers I tried it on is based on Core i7 4770k (Haswell).

Here is a link to the Project in Visual Studio 2013.

Thank You.

Royi · ‎05-03-2016

Anyone?

By the way, it happens with the boxFilter implementation as well.
What's the point in the Multi Threaded implementation if there are no gains?

Thank You.

Jonghak_K_Intel · ‎05-08-2016

Hi Royi,

I assume you read about IPP's internal multithreading that it has been deprecated.

and your OpenMP example works fine on my machine showing the benefit of using multi threads. ( Changed a bit to match your first example, a 'for' added for 200 iteration )

You can change "NUM_BLOCK = 4;" to see how the number of threads affect the results.

#include "stdafx.h"
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

#include <omp.h>

const int height = 5000;
const int width = 5000;
Ipp32f mInput_image[1 * width * height];
Ipp32f mOutput_image[1 * width * height] = { 0 };

int main()
{
 IppiSize size = { width, height };

 double start = clock();

 const int NUM_BLOCK = 4;

 IppiSize blockSize = { width, height / NUM_BLOCK };


 omp_set_num_threads(NUM_BLOCK);

 Ipp32f*  in;
 Ipp32f*  out;

 //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);
 int i;
 for (i = 0; i < 200; i++){
#pragma omp parallel            \
 shared(mInput_image, mOutput_image, blockSize) \
 private(in, out)
  {
  int id = omp_get_thread_num();
  int step = blockSize.width * blockSize.height * id;
  in = mInput_image + step;
  out = mOutput_image + step;
  ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
 }
 }

 double end = clock();
 double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

 cout << douration << endl;
 cin.get();

 return 0;
}

Royi · ‎05-09-2016

Hi Jon,

Yes, we are aware the Multi Threaded libraries are deprecated.
I wish you reversed that, many of us need low level Multi Threaded functions.

What is the run time gain you have?
Are you on fast Desktop CPU?
Could you try the same using filterBox?

On our Haswell we have zero gains.
We'll try your code again to verify.

Thank You.

Jonghak_K_Intel · ‎05-09-2016

Hi Royi,

with "NUM_BLOCK = 1" I get about 5.5xx

with "NUM_BLOCK = 2" it is 2.6xx

with "NUM_BLOCK = 4" it is 1.3xx

it is Haswell with 4 logical CPU count.

Could you let me know what filterBox is?

Royi · ‎05-09-2016

Hi Jon,

First, what's the image size you're using?
Could you use 4000 x 4000 or bigger?

Moreover, I don't understand your gains.
Is it the fastest for "NUM_BLOCK = 1"?
This means the slowest is "NUM_BLOCK = 4"?

Box Filter is given by Intel IPP "FilterBox" - https://software.intel.com/en-us/node/504124.

Thank You.

Jonghak_K_Intel · ‎05-09-2016

Hi Royi,

Sorry if I made it confusing, those numbers I put were the outputs of your example.

So I meant that with NUM_BLOCK = 1 , I got about 5.5xx as the output and with NUM_BLOCK=4, I got about 1.3xx as output.

which means NUM_BLOCK = 4 showed the fastest performance.

I also used 5000 x 5000 image in 200 loops, as written in the code above.

Thanks !

Royi · ‎05-09-2016

Hi Jon,

Could you try the code in its original form?
I suspect the loop makes the data available in cache.

Could you try without the loop you added?

Thank You.

Jonghak_K_Intel · ‎05-09-2016

Cache wouldn't matter here I believe if you are worried about multithreading.

If cache works the same for all those experiments, fetching data from cache wouldn't make any difference.

I did test with 'for (i=0; i < 1; i ++ )' then the results were

with 4 threads -> 0.016, with 2 -> 0.031, with 1 -> 0.063.

did you try it yourself?