Solved: Where does the limit of 1024 items per work-group come from?

sschuberth · ‎02-07-2011

Hi,

I'm just curious, is there a particular reason why the maximum work-group size on my Core i7 920 is 1024? I currently cannot think of any hardware-specific limit that explains that exact number.

More concretely, I've read in "Writing Optimal OpenCL Code with the Intel OpenCL SDK" that it is suggested to use 64-128 bytes per work-group (for kernels without barriers). However, in my case 4096 bytes per work-group (1024 work-items that operate on floats) lead to the best performance, and I'm trying to explain why (it's a simple filtered backprojection, with all output voxels being calculated independently).

Any hints what could explain the better performance with 4096 bytes instead of the suggested 64-128 bytes? What factors might have an influence?

Thanks!

Maxim_S_Intel · ‎02-08-2011

Hello!
this is still maxxxim, but now with my Intel hat on :).
Indeed keeping work-groups is of larger overhead in compare to work-items.

In your opinion, what hard-specifics of the Core i7 limit the amount of private memory? The cache line size?

Private variables/memory uses stack, that is the answer.

Maybe in my case, for a quite trivial kernel, the amount of instructions per work-groups is also important: I probably need to use such large wrk-groups in order to get into the range of the recommended 10000 - 100000 instructions per work-group.

Exactly!

View solution in original post

Maxim_S_Intel · ‎02-07-2011

hi,
I believe that having larger work-groups is beneficial since it reduces overhead.But it might also affect the load-balancing whichperforms at the granularity of work-groups. It means that you still need to have sufficient number of work-groups to keep all your CPU cores busy as explicitly stated in the PerfGuide.
Also I believe that amount of private (per work-item) memorylimits maximum number of work-items that run-time can keep in the flight between barriers, hence the limitation on work-group size.

If you believe you need even larger work-groups, then you emulate them by proccesing several elements per work-item, i.e. with a simple for-loop in your kernel code.

Pls note that the Perf Guide was updated with SDK next release (available here: http://software.intel.com/en-us/articles/intel-opencl-sdk/).Now it uses the notion of work-groups size only (not memory footprint), which is easier. It also adviced to rely on the run-time to determine the proper work-group size:

"...Generally, the recommended work-group size for kernels without a barrier instruction is 64-128. For kernels with a barrier instruction, the recommended work-group size is 32-64. However, some experimentation is advised. We also recommend letting the OpenCL implementation to automatically determine the optimal work-group size for a given kernel. Simply pass NULL for a pointer to the local work size when calling clEnqueueNDRangeKernel..."

sschuberth · ‎02-07-2011

First of all, thanks for the reply. (Unfortunately, line-wise quoting does not seems to be possible with this forum software, so I'll do it manually.)

"I believe that having larger work-groups is beneficial since it reduces overhead."

What exactly is the overhead with smaller and thus more work-groups? I assume the OpenCL runtime processes one work-group at a time per compute unit and works like this: Use a global work size of 256 x 256 x 256, and a local work size of 64 x 4 x 4. That partitions the global work into 16384 work-groups. On my Core i7 920 with 8 compute units, 8 of a total of 16384 work-groups are processed in parallel. When a compute unit is finished with its current work-group, it picks the next one out of the remaining ones from the total 16384 work-groups. That is, once the kernel is launched, there is no ongoing thread creating / destruction which would involve a bigger overhead once there were smaller and thus more work-groups to process. Are you saying that advancing from one work-group to the next within a thared / compute unit involes a (noticeable) overhead?

"It means that you still need to have sufficient number of work-groups to keep all your CPU cores busy as explicitly stated in the PerfGuide."

I know, but in my example, there are plenty of work-groups to process, so even with work-groups of maximum size there are enough work-groups to keep all 8 compute units busy all the time.

"Also I believe that amount of private (per work-item) memory limits maximum number of work-items that run-time can keep in the flight between barriers, hence the limitation on work-group size."

In your opinion, what hard-specifics of the Core i7 limit the amount of private memory? The cache line size?

"If you believe you need even larger work-groups [...]"

I don't need them. I was just wondering why I achieve best performance with a work-group size that is much larger than the recommended one. Maybe in my case, for a quite trivial kernel, the amount of instructions per work-groups is also important: I probably need to use such large wrk-groups in order to get into the range of the recommended 10000 - 100000 instructions per work-group.

As a side note, I was reading an earlier version of the "Writing Optimal OpenCL Code with the Intel OpenCL SDK" document which speaks of "64 - 128 bytes". The current version makes more sense to me, speaking of the number of work-items instead of bytes they process / "contain".

Maxim_S_Intel · ‎02-08-2011

Hello!
this is still maxxxim, but now with my Intel hat on :).
Indeed keeping work-groups is of larger overhead in compare to work-items.

In your opinion, what hard-specifics of the Core i7 limit the amount of private memory? The cache line size?

Private variables/memory uses stack, that is the answer.

Maybe in my case, for a quite trivial kernel, the amount of instructions per work-groups is also important: I probably need to use such large wrk-groups in order to get into the range of the recommended 10000 - 100000 instructions per work-group.

Exactly!