OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Some benchmarking experience with Phi: OpenCL vs OpenMP 2D Image Convolution

David_O_
Innovator
993 Views

 


The performance of Xeon Phi benchmarked with 2D convolution in opnecl seems much better than an openmp implementation even with compiler-enabled vectorization.  Openmp version was run in phi native mode, and timing measured only computation part: For-loop. For the opencl implementation, timing was only for kernel computation as well: no data transfer included.  OpenMp-enbaled version was tested with 2,4,60,120,240 threads. - 240 threads gave the best performance for a balanced thread affinity setting. But Opencl was around 17x better even for the 240-thread openmp baseline with pragma-enbled vectorization is source code. Input image size is for 1024x1024 up to 16384x16384, and filter size of 3x3 up to 17x17. In call runs, opencl was better than openmp. Is this an expected speedup of opencl?? Seems too good to be true.

0 Kudos
7 Replies
David_O_
Innovator
993 Views

EDIT:

Compilation (openmp)

icc Convolve.cpp -fopenmp -mmic -O3 -vec-report1 -o conv.mic
Convolve.cpp(24): (col. 17) remark: LOOP WAS VECTORIZED


Source (Convole.cpp):

   

void Convolution_Threaded(float * pInput, float * pFilter, float * pOutput,
              const int nInWidth, const int nWidth, const int nHeight,
              const int nFilterWidth, const int nNumThreads)
    {
        #pragma omp parallel for num_threads(nNumThreads)
        for (int yOut = 0; yOut < nHeight; yOut++)
        {
            const int yInTopLeft = yOut;
    
            for (int xOut = 0; xOut < nWidth; xOut++)
            {
                const int xInTopLeft = xOut;
    
                float sum = 0;
                for (int r = 0; r < nFilterWidth; r++)
                {
                    const int idxFtmp = r * nFilterWidth;
    
                    const int yIn = yInTopLeft + r;
                    const int idxIntmp = yIn * nInWidth + xInTopLeft;
    
                    #pragma ivdep  //discards any data dependencies assumed by compiler                                        
                    #pragma vector aligned //all data accessed in the loop is properly aligned
                    for (int c = 0; c < nFilterWidth; c++)
                    {
                        const int idxF  = idxFtmp  + c;
                        const int idxIn = idxIntmp + c;    
                        sum += pFilter[idxF]*pInput[idxIn];
                    }
                }
    
                const int idxOut = yOut * nWidth + xOut;
                pOutput[idxOut] = sum;
            }
        }
    }


Result of OpenMP (in comparison with OpenCL):

          image filter  exec Time (ms)
    OpenMP    2048x2048    3x3      23.4
    OpenCL    2048x2048    3x3      1.04*


*Raw kernel execution time. Data transfer time over PCI bus not included.

0 Kudos
Dmitry_K_Intel
Employee
993 Views
Hi David, OpenMP: was it the first executed pragma omp parallel in your program? If yes, it may include workers creation time. Run some simple omp parallel for with the same number of workers before the measurement to ensure workers already exists. OpenCL: How did you measure? Try the host time difference method with NDRange of interest surrounded with clFinish.
0 Kudos
Dave_O_
Beginner
993 Views

  **sorry for the authoring error on my side. Please kindly see comment before.

0 Kudos
Dave_O_
Beginner
993 Views

OpenMP: was it the first executed pragma omp parallel in your program? If yes, it may include workers creation time. Run some simple omp parallel for with the same number of workers before the measurement to ensure workers already exists
     -> The timing is before the omp pragma, so it includes the worker creation. Actually the entire function is encapsulated in a method.

        startTiming;

       Conv();

    stopTiming;

void Conv(float * pInput, float * pFilter, float * pOutput,
          const int nInWidth, const int nWidth, const int nHeight,
          const int nFilterWidth, const int nNumThreads)
{
    #pragma omp parallel for num_threads(nNumThreads)
    for (int yOut = 0; yOut < nHeight; yOut++)
    {
        const int yInTopLeft = yOut;

        for (int xOut = 0; xOut < nWidth; xOut++)
        {
            const int xInTopLeft = xOut;

            float sum = 0;
            for (int r = 0; r < nFilterWidth; r++)
            {
                const int idxFtmp = r * nFilterWidth;

                const int yIn = yInTopLeft + r;
                const int idxIntmp = yIn * nInWidth + xInTopLeft;

                #pragma ivdep           //discards any data dependencies assumed by compiler                                        
                #pragma vector aligned      //all data accessed in the loop is properly aligned
                for (int c = 0; c < nFilterWidth; c++)
                {
                    const int idxF  = idxFtmp  + c;
                    const int idxIn = idxIntmp + c;    
                    sum += pFilter[idxF]*pInput[idxIn];
                }
            } 

            const int idxOut = yOut * nWidth + xOut;
            pOutput[idxOut] = sum;
        } 
    } 
}

 

OpenCL: How did you measure? Try the host time difference method with NDRange of interest surrounded with clFinish.

             -> Used host side timing.

       startTiminig;

      clEnqueueNDRangeKernel();

      clFinish();

     stopTiming;

0 Kudos
Dave_O_
Beginner
993 Views

PS: number of iterations(runs) is set high enough, e.g 25 or 50, and then averaged to get the execution. This should take care of warming up of threads involved with the first iteration. 

0 Kudos
Dmitry_K_Intel
Employee
993 Views
Do you run OpenMP and OpenCL in the same process? It is known that they hurt each other. But you are running native phi OpenMP, so you cannot mix them... I'm giving up :).
0 Kudos
Dave_O_
Beginner
993 Views

Lol! ok. Phi is running in native mode, and does not indeed mix up with the separate OpenCL process. :)

Thanks
 

0 Kudos
Reply