- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I investigatedon rev41.
Well, the issue here is not the __local memory but the memory overrun during write in line:
line 175, clppScan.cl : blockSums[bid] = localBuffer[localBufferFullSize-1];
Looking on the host code I saw that you allocate memory buffer that is not sufficient for the operation of the algorithm.
One of the issues is buffer size calculation (line 219, clppScan.cpp). You are use workgroup size of 128, while providing local size of 64 (line 72) to NDRange. Thus, causes number of workgroups to be greater than size of the allocated buffer and as a result you have memory overrun.
After the change first NDRange passed, I added clFinish() after it, but then the next NDRange failed. This is because the same reason. The intermidiate buffer size doesn't match the number of work groups, probably you should decrease the global size in the next pass.
Regards,
Evgeny
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the crash on compile or execution of kernel?
I looked at the link you gave and didnt find anything in downloads tab.
Where should we look for sources?
Thanks, Shiri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your input is very important and we are investigating it.
I will return to you with our findings.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I investigatedon rev41.
Well, the issue here is not the __local memory but the memory overrun during write in line:
line 175, clppScan.cl : blockSums[bid] = localBuffer[localBufferFullSize-1];
Looking on the host code I saw that you allocate memory buffer that is not sufficient for the operation of the algorithm.
One of the issues is buffer size calculation (line 219, clppScan.cpp). You are use workgroup size of 128, while providing local size of 64 (line 72) to NDRange. Thus, causes number of workgroups to be greater than size of the allocated buffer and as a result you have memory overrun.
After the change first NDRange passed, I added clFinish() after it, but then the next NDRange failed. This is because the same reason. The intermidiate buffer size doesn't match the number of work groups, probably you should decrease the global size in the next pass.
Regards,
Evgeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're right, I have miss theses error. First I was using the wrong size for my buffers and secondly I have forgot to put the work-size into the loop !
So, no I have no more crash but still unable to get the correct values :-P
Hope that a day we will have an debugger for visual studio by example :-p
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
probably debuger will help.
For now i can only advice you to use printf().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But not with the intel sdk :-(
I don't understand how I can have correct result in AMD but not with Intel, if the error is on my side !
Do you have an advice about this ? Or maybe there is an issue with the intel SDK ?
Also, it will be interesting to use the AVX instructions, I have see that some peoples are using float4 to scan, maybe it will be great to have a version that is aware of the new AVX instructions set ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I propose you to check for the event status after NDRange completion. Is it SUCCESS, or has an error?
Do you have C reference code that validates the result?
Thanks,
Evgeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sounds great,
Could you please add this validation code to the project and report the validation result?
Thanks,
Evgeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The validation code is already on the benchmark.cpp class. It is just a scan in c++ (very simple) and then I compare the 2 results !
The results are wrong with the Intel SDK !!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The C++ code does not use the same algorithm, it just create the same result.
So, I think that there is some error in the Intel SDK, because NVidia GPU, AMD GPU+CPU are giving me the right results. And even, sometimes Intel give me the right result. So, the result with the Intel SDK are RANDOM!!! It is not normal !!!!
I know that you have expert in scan algorithm at Intel (Intel Parallel primitives), so maybe they can help you ? It is just an idea !!
Right ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From your question it sounds like you're utilising local memory in your algorithm. Local memory has some surprising properties according to the OpenCL spec, and tends to not behave the same between different vendor implementations.
What you could try to hunt for is a missing barrier() built-in instruction. For the sake of debugging, add a barrier (CLK_LOCAL_MEM) after every write to local memory and before every read. It will hurt performance, but it will add some predictability to the kernel behaviour. If the SDK works 100% okay after this modification, consider maybe one of these barriers was actually required (remember: read after write is not guaranteed to be up to date if you don't issue a mem_fence or a barrier).
If this doesn't solve the problem, you could try an additional debug step of running everything in a single work-group, by defining the local size as equal to the global size. You could also try disabling the vectorization module by using the vec_type_hint (see the optimization guide for more details on this) and if after all of these steps the SDK's behaviour is still unpredictable, please try and create a reproduction and send it to us.
Thanks,
Doron Singer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, I have isolate the problem.
The scan is done in 2 phase, the second phase is the "uniform addition", a simple way to write it is :
uint gid = get_global_id(0) * 2;
const uint blockId = get_group_id(0);
output[gid] += blockSums[blockId];output[gid+1] += blockSums[blockId];
This version works, but is slow. So, I try the following :
uint gid = get_global_id(0) * 2;
const uint tid = get_local_id(0);
const uint blockId = get_group_id(0);
__local T localBuffer[1];
if (tid < 1)localBuffer[0] = blockSums[blockId];
barrier(CLK_LOCAL_MEM_FENCE);if (gid < outputSize)
output[gid] += localBuffer[0];
gid++;
if (gid < outputSize)
output[gid] += localBuffer[0];
And then, I have the problem ! I use a local buffer that is initialized only when tid < 1.BUT this local buffer is also available to the other work-items and after the barrier I should use the same value for all the work-items.
So, for me the bug is in the Intel SDK, the "__local T localBuffer[1];" is not shared between all the work items ! It is the difference between "__local int b[1];" and "int b[1];"
Do you agree that it is a SDK bug ?
Thanks for your help
Krys
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please add validation phase to your code, against native C/C++?
Please report the validation result on program termination.
Thanks,
Evgeny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was capable to reproduce the issue.
The fix will be published in the next public release.
Thank you again,
Evgeny
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page