OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

optimize kernel for vector addition

Fu_J_Intel
Employee
688 Views

I have 2 vectors (float) with size of 1024*1024*8.  I want to do vector addition. My first kernel vec_add_1() has Gx=1024*1024*8 and Lx=0

__kernel void vec_add_1(__global const float* in1, __global const float* in2, __global float* out) {

    int i=get_global_id(0);

    out=in1+in2;

}

Kernel vec_add_1() takes about 10msec.

To reduce schedule time,  I created second kernel vec_add_2(). vec_add_1() has Gx=1024*1024*8 /4, Lx=0.

__kernel void vec_add_2(__global const float* in1, __global const float* in2, __global float* out) {

    int i=get_global_id(0);

    int j=(i<<2);

    out=in1+in2;
    out[j+1]=in1[j+1]+in2[j+1];
    out[j+2]=in1[j+2]+in2[j+2];
    out[j+3]=in1[j+3]+in2[j+3];
}

However,  I got 2 quite different results

- Running vec_add_2() in code builder session, vec_add_2() takes ~13msec, which is slower than vec_add_1() 

- Running vec_add_2() with host code together, vec_add_2() takes ~7msec, which is faster than vec_add_1()

So my questions are

- why running vec_add_2() with and without code builder session give quite different results?

- is vec_add_2() an optimized version than vec_add_1()?

 

thanks,

Jeffrey

0 Kudos
3 Replies
Robert_I_Intel
Employee
688 Views

Hi Jeffrey,

I think it would be a good idea to see my presentations on Optimizing Simple OpenCL Kernels here https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels - there is also code there that you can download and play with. Basically, what you are doing is not very optimal. You are better off switching to float4 datatype.

0 Kudos
Fu_J_Intel
Employee
688 Views

Hi Robert,

I changed my kernel to use float4 as below

__kernel void vec_add_3(__global const float4* in1, __global const float4* in2, __global float4* out)
{
    int i=get_global_id(0);

    out=in1+in2;
}

Gx is correspondingly changed to be 1/4 of original vector size (original vector size is 1024*1024*8.  Gx for vec_add_3() is 1024*1024*8 / 4).  Then I did "run analysis" in code-builder session. Surprisingly,  I execution time is still ~10ms,  almost no improvement.

0 Kudos
Robert_I_Intel
Employee
688 Views

Hi Jeffrey,

Couple of more things two try:

1. try to combine what you are doing in vec_add_2 with float4

2. try using float8 or even float16

3. try a combination of vec_add_2 and float4 and/or float8.

The basic problem is that for the simple kernels you need to pack much more compute onto a hardware thread.

0 Kudos
Reply