- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried General Matrix Multiply (GEMM) sample (https://software.intel.com/en-us/articles/gemm) on my pc.
But, if the matrix size is more than 2048 x 2048 program stops with black out.
Black out back in a few seconds and .
I think program stops at EnqueNDRange.
According to my calculations, there is no problem on the memory size, work item size, work group size, etc...
pc specifications
Core i5-4440
memory 4GB
Windows 7 64bit
Why can I increase matrix size?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, this shall be the problem. GPU stops/resets for large workload.
You could try a longer TDR time in regedit (HKEY_LOCAL_MACHINE\SYSTEM|ControlSet001\Control\GraphicsDrivers\TdrDelay) to see if larger delay time could make your case finish.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Naoki,
Could you please specify the following:
1. The version of the graphics driver you are using
2. the exact parameters you feed to the program when you experience the blackout
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually our GEMM sample is not fully optimized for our GPU. The kernel nn/nt and default tiling parameters are CPU-optimized, thus the execution of the default matrix size (3968x3968) with default tiling parameters is very slow.
Try to run kernel nn with --tile-size-N 4. It reduces the private memory usage and improves the performance on GPU.
Thanks, Lingyi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Naoki,
It appears your driver version is fairly outdated. Please update to 10.18.10.4061 or above: https://downloadcenter.intel.com/search?keyword=3rd+generation+intel+processors
Let me know whether that worked for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for pointing it out.
This GPU is HD graphics 4600, so I installed version 10.18.10.4156
But, it does not improve.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On my HD 4400 GPU, 3968x3968 takes over 26 sec to finish with default tile-size-N = 128. During the execution, the system seems to be "hanging".
With tile-size-N = 4, it takes 9 sec to finish. Not sure the behavior on HD 4600, but it should be able to finish...
Again, the performance of this kernel is not that good on GPU, and we have done some optimization work for GEMM on GPU and it performs such better than the sample.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ummm....
When I start 3968x3968, black out back after a few seconds and back to the normal after a few seconds.
But the program has stopped.
This sample repeats calculation 10 times.
Sometimes calculation is finished one or two times. At that time, it takes about 8 seconds to finish once of calculation.
However, never finished to the end of the loop.
When CPU is selected, it runs normal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could always add "-cl-mad-enable" to make multiply+add operations faster, and also try to use "-cl-fast-relaxed-math" to get performance gains but you may lose control of numeric accuracy. Try them in build program option.
Below kernel performs much better than current nn/nt kernels on GPU, and need following changes in host code.
1. Update kernel args and add "-DSIZE=3968" in build program option.
2. Update global size to {3968 / 4, 3968 / 8} because each work items calculates 8x4 data.
3. Update verification code to check the output in row-major order.
// C := alpha*A*B + beta*C // A is in row-major form // B is in row-major form // C is in row-major form __kernel void gemm_8x4(__global const float4 * restrict A, __global const float4 * restrict B, __global float4 * restrict C, float alpha, float beta) { int x = get_global_id(0); int y = get_global_id(1); float4 sum[8] = {(float4)(0)}; float4 tempA; float4 tempB[4]; for(uint i = 0; i < SIZE; i=i+4) { for(uint k = 0; k < 4; k++) { tempB= B[x + (i + k) * (SIZE / 4)]; } for (uint k = 0; k < 8; k++) { tempA = A[i/4 + ((y * 8) + k) * (SIZE / 4)]; sum += tempB[0] * tempA.x; sum += tempB[1] * tempA.y; sum += tempB[2] * tempA.z; sum += tempB[3] * tempA.w; } //improve cache accesses barrier(CLK_LOCAL_MEM_FENCE); } for(uint i = 0; i < 8; i++) { C[x + (y * 8 + i) * (SIZE / 4)] = alpha * sum + beta * C[x + (y * 8 + i) * (SIZE / 4)]; } }
BTW, another optimization is to use local memory. In each work group after loading sub-matrix of input matrix into local memory, do the synchronization, and then read the data from local memory and do the calculation. With this solution we should get further performance gains.
Please have a try and let me the result. Thanks, Lingyi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the case causes graphic driver reset, then the execution would stop. Do you see the graphics driver stopped responding and recovered in your case?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes.
After recovery, it's displayed that the graphics driver stopped responding and recovered.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, this shall be the problem. GPU stops/resets for large workload.
You could try a longer TDR time in regedit (HKEY_LOCAL_MACHINE\SYSTEM|ControlSet001\Control\GraphicsDrivers\TdrDelay) to see if larger delay time could make your case finish.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It went well.
The problem was resolved.
Thank you very much!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page