Graphics
Intel® graphics drivers and software, compatibility, troubleshooting, performance, and optimization
22724 Discussions

Radix Sort - OpenGL Compute Shader

Oleg_A_1
Beginner
4,344 Views

Hello, I implement radix-sort on OpenGL Compute Shader.
https://github.com/cNoNim/radix-sort

But with Intel GPA I have some problems. On same code from this branch
https://github.com/cNoNim/radix-sort/tree/simple

I have AMD GPU and algorithm works perfectly on it. But fails on NVidia and Intel.
I'm already post question on NVidia devtalk.
https://devtalk.nvidia.com/default/topic/916998/cuda-programming-and-performance/radix-sort-opengl-compute-shader/

In branch above I use key-only sort of increasing sequence of unsigned integers...
And I get:

OpenGL 4.3.0 - Build 10.18.14.4332
        Intel
        Intel(R) HD Graphics 4600
count   67108864 elapsed   33442256 ticks 3.34422560 sec speed   20067086 per sec        - FAILED
count   33554432 elapsed   30997096 ticks 3.09970960 sec speed   10825024 per sec        - FAILED
count   16777216 elapsed   15963538 ticks 1.59635380 sec speed   10509710 per sec        - FAILED
count    8388608 elapsed    7868773 ticks 0.78687730 sec speed   10660630 per sec        - FAILED
count    4194304 elapsed    3936232 ticks 0.39362320 sec speed   10655632 per sec        - FAILED
count    2097152 elapsed    2028931 ticks 0.20289310 sec speed   10336241 per sec        - FAILED
count    1048576 elapsed    1044249 ticks 0.10442490 sec speed   10041436 per sec        - FAILED
count     524288 elapsed     536672 ticks 0.05366720 sec speed    9769244 per sec        - FAILED
count     262144 elapsed     282782 ticks 0.02827820 sec speed    9270179 per sec        - FAILED
count     131072 elapsed     157309 ticks 0.01573090 sec speed    8332136 per sec        - FAILED
count      65536 elapsed     101779 ticks 0.01017790 sec speed    6439049 per sec        - FAILED
count      32768 elapsed      74790 ticks 0.00747900 sec speed    4381334 per sec        - FAILED
count      16384 elapsed      61738 ticks 0.00617380 sec speed    2653795 per sec        - FAILED
count       8192 elapsed      64297 ticks 0.00642970 sec speed    1274087 per sec        - FAILED
count       4096 elapsed      57811 ticks 0.00578110 sec speed     708515 per sec        - FAILED
count       2048 elapsed      98717 ticks 0.00987170 sec speed     207461 per sec        - FAILED
count       1024 elapsed      52636 ticks 0.00526360 sec speed     194543 per sec        - FAILED

reference result from AMD GPU:

OpenGL 4.5.13399 Compatibility Profile Context 16.201.1151.1007
        ATI Technologies Inc.
        AMD Radeon HD 6700M Series
count   67108864 elapsed   49687237 ticks 4.96872370 sec speed   13506257 per sec        - PASSED
count   33554432 elapsed   29207774 ticks 2.92077740 sec speed   11488185 per sec        - PASSED
count   16777216 elapsed   14705172 ticks 1.47051720 sec speed   11409057 per sec        - PASSED
count    8388608 elapsed    7428293 ticks 0.74282930 sec speed   11292780 per sec        - PASSED
count    4194304 elapsed    3587719 ticks 0.35877190 sec speed   11690726 per sec        - PASSED
count    2097152 elapsed    1815771 ticks 0.18157710 sec speed   11549650 per sec        - PASSED
count    1048576 elapsed     934891 ticks 0.09348910 sec speed   11216024 per sec        - PASSED
count     524288 elapsed     631452 ticks 0.06314520 sec speed    8302895 per sec        - PASSED
count     262144 elapsed     266753 ticks 0.02667530 sec speed    9827218 per sec        - PASSED
count     131072 elapsed     142823 ticks 0.01428230 sec speed    9177233 per sec        - PASSED
count      65536 elapsed      92056 ticks 0.00920560 sec speed    7119144 per sec        - PASSED
count      32768 elapsed      66577 ticks 0.00665770 sec speed    4921819 per sec        - PASSED
count      16384 elapsed      51747 ticks 0.00517470 sec speed    3166173 per sec        - PASSED
count       8192 elapsed      47519 ticks 0.00475190 sec speed    1723942 per sec        - PASSED
count       4096 elapsed      42577 ticks 0.00425770 sec speed     962021 per sec        - PASSED
count       2048 elapsed      40735 ticks 0.00407350 sec speed     502761 per sec        - PASSED
count       1024 elapsed      41904 ticks 0.00419040 sec speed     244368 per sec        - PASSED
COMPLETE

Can somebody explain/help?

Why I get such behavior? And what totally wrong In my code?
When I debug algorithm on simple case and array with 1024 elements, I get wrong intermediate result, but not on first stage of radix sort.
And some time I get correct result and test PASSED for 1024 elements, but for other array sizes I get fails, and if I try several tests like above I all time get FAILED.

0 Kudos
6 Replies
Seth_S_Intel
Employee
4,344 Views

Hi Oleg, 

This is most likely a question for the Intel GPU driver team.  Let me move this thread to their forum.  

When we get the point where your shader is passing and want to analyze it with GPA, I can help then. 

Best,

Seth

0 Kudos
Oleg_A_1
Beginner
4,344 Views

Hi Seth,

And thanks for response. May be i can explain some places in code, if this needed.

I use visual studio 2015 community edition for project building. And code have some macroses for embeding glsl code.

May be it will complicate debugging. But i can divide code if this needed.

And i have another question. On AMD GPU algorithm perfectly works with BARRIER defined like

#define BARRIER groupMemoryBarrier()

But on NVidia it lead to fails all time. And I try define BARRIER like 

#define BARRIER groupMemoryBarrier(); barrier()

But I don't understand why first definition not enough.

And I try place glMemoryBarrier between compute shader dispatch invocation. But on AMD code works without it.

If you will fix or explain behavior of driver may be you also can check these places

Regards,

Oleg

0 Kudos
Oleg_A_1
Beginner
4,344 Views

I can tell that sometimes it works on master branch.
In master branch used key-value sort for signed integer key.
And in master branch I try implement algorithm on C++ AMP for comparison.

OpenGL 4.3.0 - Build 10.18.14.4332
        Intel
        Intel(R) HD Graphics 4600
count    1048576 elapsed    2304529 ticks  0.2304529 sec speed    4550066 per sec - FAILED
count     524288 elapsed     605071 ticks  0.0605071 sec speed    8664900 per sec - FAILED
count     262144 elapsed     335384 ticks  0.0335384 sec speed    7816234 per sec - FAILED
count     131072 elapsed     197828 ticks  0.0197828 sec speed    6625553 per sec - PASSED
count      65536 elapsed     130822 ticks  0.0130822 sec speed    5009554 per sec - PASSED
count      32768 elapsed      92115 ticks  0.0092115 sec speed    3557292 per sec - PASSED
count      16384 elapsed      75177 ticks  0.0075177 sec speed    2179389 per sec - PASSED
count       8192 elapsed      73296 ticks  0.0073296 sec speed    1117659 per sec - PASSED
count       4096 elapsed      66101 ticks  0.0066101 sec speed     619657 per sec - PASSED
count       2048 elapsed      56784 ticks  0.0056784 sec speed     360664 per sec - PASSED
count       1024 elapsed      52732 ticks  0.0052732 sec speed     194189 per sec - PASSED

 

0 Kudos
Michael_C_Intel2
Employee
4,344 Views

Hi Oleg,

We were able to reproduce  the issue  with 5th and 6th Generation Core Processors with the latest 15.40 driver. We also saw the issue with Nvidia's latest drivers (364.47)

We tested a few options: compiler  (IGC/USC), SIMD mode, disable compiler optimizations,  disable low precision, none of them had and effect on the issue as it still failed. We think it could be a synchronization problem, the compute program uses groupMemoryBarrier() and barrier()  functions for synchronization.You are using 4 compute programs, perhaps checking intermediate results before going to the next dispatch may help. 

-Michael 

 

0 Kudos
Oleg_A_1
Beginner
4,344 Views

Hi Michael,

Thanx for response,
I understand that checking of intermediate results can help, and I tried dump them.
I can tell that issue has random errors on some blocks of data. And errors can be not on first step of algorithm.
Finnaly, I don't have enough tools to debug glsl code, and on big data dump it's very slow.

I can implement algorithm in OpenCL or in something else, if it can help to understand root of problem.
Can you suggest how to do better, maybe?

Regards
Oleg Ageev
 

0 Kudos
Michael_C_Intel2
Employee
4,344 Views

Hi Oleg,

The observed  problems are random errors of block of data

To root cause the problem, you should divide it to smaller parts:

  1. Test only data flow  (read/write)  between compute shader dispatch- it can help identify synchronization issues or other memory related problems (invalid : layout, size, bindings,  etc) and isolate it  from sorting algorithm.
  2. If data flow is correct check the sorting algorithm

-Michael

0 Kudos
Reply