- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I implement radix-sort on OpenGL Compute Shader.
https://github.com/cNoNim/radix-sort
But with Intel GPA I have some problems. On same code from this branch
https://github.com/cNoNim/radix-sort/tree/simple
I have AMD GPU and algorithm works perfectly on it. But fails on NVidia and Intel.
I'm already post question on NVidia devtalk.
https://devtalk.nvidia.com/default/topic/916998/cuda-programming-and-performance/radix-sort-opengl-compute-shader/
In branch above I use key-only sort of increasing sequence of unsigned integers...
And I get:
OpenGL 4.3.0 - Build 10.18.14.4332 Intel Intel(R) HD Graphics 4600 count 67108864 elapsed 33442256 ticks 3.34422560 sec speed 20067086 per sec - FAILED count 33554432 elapsed 30997096 ticks 3.09970960 sec speed 10825024 per sec - FAILED count 16777216 elapsed 15963538 ticks 1.59635380 sec speed 10509710 per sec - FAILED count 8388608 elapsed 7868773 ticks 0.78687730 sec speed 10660630 per sec - FAILED count 4194304 elapsed 3936232 ticks 0.39362320 sec speed 10655632 per sec - FAILED count 2097152 elapsed 2028931 ticks 0.20289310 sec speed 10336241 per sec - FAILED count 1048576 elapsed 1044249 ticks 0.10442490 sec speed 10041436 per sec - FAILED count 524288 elapsed 536672 ticks 0.05366720 sec speed 9769244 per sec - FAILED count 262144 elapsed 282782 ticks 0.02827820 sec speed 9270179 per sec - FAILED count 131072 elapsed 157309 ticks 0.01573090 sec speed 8332136 per sec - FAILED count 65536 elapsed 101779 ticks 0.01017790 sec speed 6439049 per sec - FAILED count 32768 elapsed 74790 ticks 0.00747900 sec speed 4381334 per sec - FAILED count 16384 elapsed 61738 ticks 0.00617380 sec speed 2653795 per sec - FAILED count 8192 elapsed 64297 ticks 0.00642970 sec speed 1274087 per sec - FAILED count 4096 elapsed 57811 ticks 0.00578110 sec speed 708515 per sec - FAILED count 2048 elapsed 98717 ticks 0.00987170 sec speed 207461 per sec - FAILED count 1024 elapsed 52636 ticks 0.00526360 sec speed 194543 per sec - FAILED
reference result from AMD GPU:
OpenGL 4.5.13399 Compatibility Profile Context 16.201.1151.1007 ATI Technologies Inc. AMD Radeon HD 6700M Series count 67108864 elapsed 49687237 ticks 4.96872370 sec speed 13506257 per sec - PASSED count 33554432 elapsed 29207774 ticks 2.92077740 sec speed 11488185 per sec - PASSED count 16777216 elapsed 14705172 ticks 1.47051720 sec speed 11409057 per sec - PASSED count 8388608 elapsed 7428293 ticks 0.74282930 sec speed 11292780 per sec - PASSED count 4194304 elapsed 3587719 ticks 0.35877190 sec speed 11690726 per sec - PASSED count 2097152 elapsed 1815771 ticks 0.18157710 sec speed 11549650 per sec - PASSED count 1048576 elapsed 934891 ticks 0.09348910 sec speed 11216024 per sec - PASSED count 524288 elapsed 631452 ticks 0.06314520 sec speed 8302895 per sec - PASSED count 262144 elapsed 266753 ticks 0.02667530 sec speed 9827218 per sec - PASSED count 131072 elapsed 142823 ticks 0.01428230 sec speed 9177233 per sec - PASSED count 65536 elapsed 92056 ticks 0.00920560 sec speed 7119144 per sec - PASSED count 32768 elapsed 66577 ticks 0.00665770 sec speed 4921819 per sec - PASSED count 16384 elapsed 51747 ticks 0.00517470 sec speed 3166173 per sec - PASSED count 8192 elapsed 47519 ticks 0.00475190 sec speed 1723942 per sec - PASSED count 4096 elapsed 42577 ticks 0.00425770 sec speed 962021 per sec - PASSED count 2048 elapsed 40735 ticks 0.00407350 sec speed 502761 per sec - PASSED count 1024 elapsed 41904 ticks 0.00419040 sec speed 244368 per sec - PASSED COMPLETE |
Can somebody explain/help?
Why I get such behavior? And what totally wrong In my code?
When I debug algorithm on simple case and array with 1024 elements, I get wrong intermediate result, but not on first stage of radix sort.
And some time I get correct result and test PASSED for 1024 elements, but for other array sizes I get fails, and if I try several tests like above I all time get FAILED.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Oleg,
This is most likely a question for the Intel GPU driver team. Let me move this thread to their forum.
When we get the point where your shader is passing and want to analyze it with GPA, I can help then.
Best,
Seth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Seth,
And thanks for response. May be i can explain some places in code, if this needed.
I use visual studio 2015 community edition for project building. And code have some macroses for embeding glsl code.
May be it will complicate debugging. But i can divide code if this needed.
And i have another question. On AMD GPU algorithm perfectly works with BARRIER defined like
#define BARRIER groupMemoryBarrier()
But on NVidia it lead to fails all time. And I try define BARRIER like
#define BARRIER groupMemoryBarrier(); barrier()
But I don't understand why first definition not enough.
And I try place glMemoryBarrier between compute shader dispatch invocation. But on AMD code works without it.
If you will fix or explain behavior of driver may be you also can check these places
Regards,
Oleg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can tell that sometimes it works on master branch.
In master branch used key-value sort for signed integer key.
And in master branch I try implement algorithm on C++ AMP for comparison.
OpenGL 4.3.0 - Build 10.18.14.4332 Intel Intel(R) HD Graphics 4600 count 1048576 elapsed 2304529 ticks 0.2304529 sec speed 4550066 per sec - FAILED count 524288 elapsed 605071 ticks 0.0605071 sec speed 8664900 per sec - FAILED count 262144 elapsed 335384 ticks 0.0335384 sec speed 7816234 per sec - FAILED count 131072 elapsed 197828 ticks 0.0197828 sec speed 6625553 per sec - PASSED count 65536 elapsed 130822 ticks 0.0130822 sec speed 5009554 per sec - PASSED count 32768 elapsed 92115 ticks 0.0092115 sec speed 3557292 per sec - PASSED count 16384 elapsed 75177 ticks 0.0075177 sec speed 2179389 per sec - PASSED count 8192 elapsed 73296 ticks 0.0073296 sec speed 1117659 per sec - PASSED count 4096 elapsed 66101 ticks 0.0066101 sec speed 619657 per sec - PASSED count 2048 elapsed 56784 ticks 0.0056784 sec speed 360664 per sec - PASSED count 1024 elapsed 52732 ticks 0.0052732 sec speed 194189 per sec - PASSED
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Oleg,
We were able to reproduce the issue with 5th and 6th Generation Core Processors with the latest 15.40 driver. We also saw the issue with Nvidia's latest drivers (364.47)
We tested a few options: compiler (IGC/USC), SIMD mode, disable compiler optimizations, disable low precision, none of them had and effect on the issue as it still failed. We think it could be a synchronization problem, the compute program uses groupMemoryBarrier() and barrier() functions for synchronization.You are using 4 compute programs, perhaps checking intermediate results before going to the next dispatch may help.
-Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Michael,
Thanx for response,
I understand that checking of intermediate results can help, and I tried dump them.
I can tell that issue has random errors on some blocks of data. And errors can be not on first step of algorithm.
Finnaly, I don't have enough tools to debug glsl code, and on big data dump it's very slow.
I can implement algorithm in OpenCL or in something else, if it can help to understand root of problem.
Can you suggest how to do better, maybe?
Regards
Oleg Ageev
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Oleg,
The observed problems are random errors of block of data
To root cause the problem, you should divide it to smaller parts:
- Test only data flow (read/write) between compute shader dispatch- it can help identify synchronization issues or other memory related problems (invalid : layout, size, bindings, etc) and isolate it from sorting algorithm.
- If data flow is correct check the sorting algorithm
-Michael

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page