topic gemm sample program is black out in OpenCL* for CPU

gemm sample program is black out

naoki_o_ — Sun, 15 Mar 2015 02:15:49 GMT

I tried General Matrix Multiply (GEMM) sample (https://software.intel.com/en-us/articles/gemm) on my pc.

But, if the matrix size is more than 2048 x 2048 program stops with black out.

Black out back in a few seconds and .

I think program stops at EnqueNDRange.

According to my calculations, there is no problem on the memory size, work item size, work group size, etc...

pc specifications

Core i5-4440

memory 4GB

Windows 7 64bit

Why can I increase matrix size?

Hi Naoki,

Robert_I_Intel — Mon, 16 Mar 2015 17:03:55 GMT

Hi Naoki,

Could you please specify the following:

1. The version of the graphics driver you are using

2. the exact parameters you feed to the program when you experience the blackout

Thanks!

Actually our GEMM sample is

Lingyi_K_Intel — Tue, 17 Mar 2015 02:28:39 GMT

Actually our GEMM sample is not fully optimized for our GPU. The kernel nn/nt and default tiling parameters are CPU-optimized, thus the execution of the default matrix size (3968x3968) with default tiling parameters is very slow.

Try to run kernel nn with --tile-size-N 4. It reduces the private memory usage and improves the performance on GPU.

Thanks, Lingyi

naoki_o_ — Tue, 17 Mar 2015 04:45:13 GMT

Graphics driver version is 10.18.10.3907

I had tried default parameters and change matrix size.

I tried to run kernel nn with -tile-size-N 4, but I got same result. I think memory is enough.

Dear Naoki,

Robert_I_Intel — Tue, 17 Mar 2015 15:52:06 GMT

Dear Naoki,

It appears your driver version is fairly outdated. Please update to 10.18.10.4061 or above: https://downloadcenter.intel.com/search?keyword=3rd+generation+intel+processors

Let me know whether that worked for you.

Thank you for pointing it out

naoki_o_ — Wed, 18 Mar 2015 02:31:25 GMT

Thank you for pointing it out.

This GPU is HD graphics 4600, so I installed version 10.18.10.4156

https://downloadcenter.intel.com/download/24785/Intel-Iris-and-HD-Graphics-Driver-for-Windows-7-8-8-1-64-bit

But, it does not improve.

On my HD 4400 GPU, 3968x3968

Lingyi_K_Intel — Wed, 18 Mar 2015 06:09:47 GMT

On my HD 4400 GPU, 3968x3968 takes over 26 sec to finish with default tile-size-N = 128. During the execution, the system seems to be "hanging".

With tile-size-N = 4, it takes 9 sec to finish. Not sure the behavior on HD 4600, but it should be able to finish...

Again, the performance of this kernel is not that good on GPU, and we have done some optimization work for GEMM on GPU and it performs such better than the sample.

Ummm....

naoki_o_ — Wed, 18 Mar 2015 10:14:39 GMT

Ummm....

When I start 3968x3968, black out back after a few seconds and back to the normal after a few seconds.

But the program has stopped.

This sample repeats calculation 10 times.

Sometimes calculation is finished one or two times. At that time, it takes about 8 seconds to finish once of calculation.

However, never finished to the end of the loop.

When CPU is selected, it runs normal.

I took a video of the

naoki_o_ — Wed, 18 Mar 2015 10:23:53 GMT

I took a video of the operation.

You could always add "-cl-mad

Lingyi_K_Intel — Thu, 19 Mar 2015 02:30:51 GMT

You could always add "-cl-mad-enable" to make multiply+add operations faster, and also try to use "-cl-fast-relaxed-math" to get performance gains but you may lose control of numeric accuracy. Try them in build program option.

Below kernel performs much better than current nn/nt kernels on GPU, and need following changes in host code.

1. Update kernel args and add "-DSIZE=3968" in build program option.

2. Update global size to {3968 / 4, 3968 / 8} because each work items calculates 8x4 data.

3. Update verification code to check the output in row-major order.

// C := alpha*A*B + beta*C
// A is in row-major form
// B is in row-major form
// C is in row-major form
__kernel void gemm_8x4(__global const float4 * restrict A,
                       __global const float4 * restrict B,
                       __global float4 * restrict C,
		        float alpha,
        		float beta)
{
    int x = get_global_id(0);
    int y = get_global_id(1);

    float4 sum[8] = {(float4)(0)};
    float4 tempA;
    float4 tempB[4];

    for(uint i = 0; i < SIZE; i=i+4)
    {
  	for(uint k = 0; k < 4; k++)
  	{
   		tempB = B[x + (i + k) * (SIZE / 4)];
  	}
  
  	for (uint k = 0; k < 8; k++)
  	{
   		tempA = A[i/4 + ((y * 8) + k) * (SIZE / 4)];
   		sum += tempB[0] * tempA.x;
   		sum += tempB[1] * tempA.y;
   		sum += tempB[2] * tempA.z;
   		sum += tempB[3] * tempA.w;
  	}
  	//improve cache accesses
  	barrier(CLK_LOCAL_MEM_FENCE); 
    }
    for(uint i = 0; i < 8; i++)
    {
  	C[x + (y * 8 + i) * (SIZE / 4)] = alpha * sum + beta * C[x + (y * 8 + i) * (SIZE / 4)];
    } 
}

BTW, another optimization is to use local memory. In each work group after loading sub-matrix of input matrix into local memory, do the synchronization, and then read the data from local memory and do the calculation. With this solution we should get further performance gains.

Please have a try and let me the result. Thanks, Lingyi

I've tried it, but it was the

naoki_o_ — Mon, 23 Mar 2015 08:59:15 GMT

I've tried it, but it was the same behavior.

If a program isn't optimized, is there a case that the program would stop?

If the case causes graphic

Lingyi_K_Intel — Fri, 27 Mar 2015 05:35:45 GMT

If the case causes graphic driver reset, then the execution would stop. Do you see the graphics driver stopped responding and recovered in your case?

Yes.

naoki_o_ — Fri, 27 Mar 2015 10:32:24 GMT

Yes.

After recovery, it's displayed that the graphics driver stopped responding and recovered.

Ok, this shall be the problem

Lingyi_K_Intel — Tue, 31 Mar 2015 02:30:28 GMT

Ok, this shall be the problem. GPU stops/resets for large workload.

You could try a longer TDR time in regedit (HKEY_LOCAL_MACHINE\SYSTEM|ControlSet001\Control\GraphicsDrivers\TdrDelay) to see if larger delay time could make your case finish.

It went well.

naoki_o_ — Fri, 03 Apr 2015 03:49:56 GMT

It went well.

The problem was resolved.

Thank you very much!