OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1675 Discussions

gemm sample program is black out

naoki_o_
Beginner
432 Views

I tried General Matrix Multiply (GEMM) sample (https://software.intel.com/en-us/articles/gemm) on my pc.

But, if the matrix size is more than 2048 x 2048 program stops with black out.

Black out back in a few seconds and .

I think program stops at EnqueNDRange.

According to my calculations, there is no problem on the memory size, work item size, work group size, etc...

pc specifications

Core i5-4440

memory 4GB

Windows 7 64bit

Why can I increase matrix size?

0 Kudos
1 Solution
Lingyi_K_Intel
Employee
432 Views

Ok, this shall be the problem. GPU stops/resets for large workload.

You could try a longer TDR time in regedit (HKEY_LOCAL_MACHINE\SYSTEM|ControlSet001\Control\GraphicsDrivers\TdrDelay) to see if larger delay time could make your case finish.

View solution in original post

14 Replies
Robert_I_Intel
Employee
432 Views

Hi Naoki,

Could you please specify the following:

1. The version of the graphics driver you are using

2. the exact parameters you feed to the program when you experience the blackout

 

Thanks!

Lingyi_K_Intel
Employee
432 Views

Actually our GEMM sample is not fully optimized for our GPU. The kernel nn/nt and default tiling parameters are CPU-optimized, thus the execution of the default matrix size (3968x3968) with default tiling parameters is very slow.

Try to run kernel nn with --tile-size-N 4. It reduces the private memory usage and improves the performance on GPU.

Thanks, Lingyi

naoki_o_
Beginner
432 Views

 

Graphics driver version is 10.18.10.3907

I had tried default parameters and change matrix size.

I tried to run kernel nn with -tile-size-N 4, but I got same result. I think memory is enough.

 

Robert_I_Intel
Employee
432 Views

Dear Naoki,

 

It appears your driver version is fairly outdated. Please update to 10.18.10.4061 or above: https://downloadcenter.intel.com/search?keyword=3rd+generation+intel+processors

Let me know whether that worked for you.

naoki_o_
Beginner
432 Views

Thank you for pointing it out.

This GPU is HD graphics 4600, so I installed version 10.18.10.4156

https://downloadcenter.intel.com/download/24785/Intel-Iris-and-HD-Graphics-Driver-for-Windows-7-8-8-...

But, it does not improve.

 

Lingyi_K_Intel
Employee
432 Views

On my HD 4400 GPU, 3968x3968 takes over 26 sec to finish with default tile-size-N = 128. During the execution, the system seems to be "hanging".

With tile-size-N = 4, it takes 9 sec to finish. Not sure the behavior on HD 4600, but it should be able to finish...

Again, the performance of this kernel is not that good on GPU, and we have done some optimization work for GEMM on GPU and it performs such better than the sample.

naoki_o_
Beginner
432 Views

Ummm....

When I start 3968x3968, black out back after a few seconds and back to the normal after a few seconds.

But the program has stopped.

This sample repeats calculation 10 times.

Sometimes calculation is finished one or two times. At that time, it takes about 8 seconds to finish once of calculation.

However, never finished to the end of the loop.

When CPU is selected, it runs normal.

naoki_o_
Beginner
432 Views

I took a video of the operation.

Lingyi_K_Intel
Employee
432 Views

You could always add "-cl-mad-enable" to make multiply+add operations faster, and also try to use "-cl-fast-relaxed-math" to get performance gains but you may lose control of numeric accuracy. Try them in build program option.

Below kernel performs much better than current nn/nt kernels on GPU, and need following changes in host code.

1. Update kernel args and add "-DSIZE=3968" in build program option.

2. Update global size to {3968 / 4, 3968 / 8} because each work items calculates 8x4 data.

3. Update verification code to check the output in row-major order.

// C := alpha*A*B + beta*C
// A is in row-major form
// B is in row-major form
// C is in row-major form
__kernel void gemm_8x4(__global const float4 * restrict A,
                       __global const float4 * restrict B,
                       __global float4 * restrict C,
		        float alpha,
        		float beta)
{
    int x = get_global_id(0);
    int y = get_global_id(1);

    float4 sum[8] = {(float4)(0)};
    float4 tempA;
    float4 tempB[4];

    for(uint i = 0; i < SIZE; i=i+4)
    {
  	for(uint k = 0; k < 4; k++)
  	{
   		tempB = B[x + (i + k) * (SIZE / 4)];
  	}
  
  	for (uint k = 0; k < 8; k++)
  	{
   		tempA = A[i/4 + ((y * 8) + k) * (SIZE / 4)];
   		sum += tempB[0] * tempA.x;
   		sum += tempB[1] * tempA.y;
   		sum += tempB[2] * tempA.z;
   		sum += tempB[3] * tempA.w;
  	}
  	//improve cache accesses
  	barrier(CLK_LOCAL_MEM_FENCE); 
    }
    for(uint i = 0; i < 8; i++)
    {
  	C[x + (y * 8 + i) * (SIZE / 4)] = alpha * sum + beta * C[x + (y * 8 + i) * (SIZE / 4)];
    } 
}

BTW, another optimization is to use local memory. In each work group after loading sub-matrix of input matrix into local memory, do the synchronization, and then read the data from local memory and do the calculation. With this solution we should get further performance gains.

Please have a try and let me the result. Thanks, Lingyi

naoki_o_
Beginner
432 Views
I've tried it, but it was the same behavior.
 
If a program isn't optimized, is there a case that the program would stop?
Lingyi_K_Intel
Employee
432 Views

If the case causes graphic driver reset, then the execution would stop. Do you see the graphics driver stopped responding and recovered in your case?

naoki_o_
Beginner
432 Views

Yes.

After recovery, it's displayed that the graphics driver stopped responding and recovered.

Lingyi_K_Intel
Employee
433 Views

Ok, this shall be the problem. GPU stops/resets for large workload.

You could try a longer TDR time in regedit (HKEY_LOCAL_MACHINE\SYSTEM|ControlSet001\Control\GraphicsDrivers\TdrDelay) to see if larger delay time could make your case finish.

naoki_o_
Beginner
432 Views

It went well.

The problem was resolved.

Thank you very much!
 

Reply