I tried to work with subdevices on the CPU. I was assuming that this could help me control the mapping of the kernel to core assignment. To be more precise, I have some host threads that need to give some performance guarantees, and I want to be sure that the OpenCL kernels do not run on the same core. For the host threads I can set the cpu affinity. However, for the OpenCL kernels I cannot. I thought that subdevices could solve that problem in some way.
So, while I was digging into that topic, I have come over some peculiarities.
1. I figured out that the Intel OpenCL runtime creates one thread per CPU core, each of them having set a specific cpu affinity. This can be seen in gdb or htop. It is, however, strange that the device affinity of those threads is not constant for the whole runtime (i.e. it is reset from time to time).
2. I also figured out, that some of those threads seem to be set to the same affinity. This can also be a side effect of the refresh rate in htop, so that I am not able to see when the affinity has changed.
4. Subdevices cannot be created by affinity domain (CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN), although clGetDeviceInfo(...CL_DEVICE_PARTITION_PROPERTIES..) returns so.
5. When subdevices are created using CL_DEVICE_PARTITION_EQUALLY, the number of utilized cores seems to be one less than actually specified (i.e. partition equally to subdevices each having 4 compute units, will actually only use 3 cpu cores).
- Is it possible to set the cpu affinity per subdevice or to a running OpenCL kernel?
- Can you reproduce and explain the behavior above? Does it make sense that two running kernels are sharing one CPU core, even though they are running on a seperate subdevice?
- Linux 3.7.6-1-ARCH #1 SMP PREEMPT x86_64 GNU/Linux
- Intel Core i7 2600K
- Intel OpenCL SDK 2013 Build 56860
Attachments: A minimal running code example, some additional information to my system setup.
We have found a bug in thread affinitization - both in root device and sub-devices. We have fixed the root device problem for the next release (gold). For the sub-device problem we have filed a high-priority bug.
Probably related: on a Windows platform and a dual E5-2687W platform we see for a given load the following timing when hyper threading is turned on:
1 thread: 2870.352 ms; 2 thread: 2497.295 ms; 4 thread: 961.296 ms; 8 thread: 540.111 ms; 16 thread: 337.201 ms 32 thread: 212.293 ms
The odd behaviour here is that 2 threads takes about as much time as 1 thread, and that after that we have a fairly good parallelization.
With hyper threading turned off (in machine BIOS) we see the following performance:
1 thread: 2858.425 ms; 2 thread: 1526.758 ms; 4 thread: 812.663 ms; 8 thread: 454.19 ms; 16 thread: 251.211 ms
where we also see a very good scaling between 1 and 2 threads -- and a better performance for 4, 8 anf 16 threads as well. We observed this behaviour both when specifying USE_CL_DEVICE_PARTITION_EQUALLY and when specifying USE_CL_PARTITION_BY_COUNTS. It seems that eac two CPU threads always are scheduled on the same real core, instead of having a preference to freely use available resources. We want to use only a subset of threads as we intend to use other CPU resources for other purposes.
Is there a way to get better balancing of the opencl threads over the available real cores, while retaining hyperthreading options?
I think I have a similar problem with the difference that querying the device for NUMA affinity doesn't say that the NUMA affinity is supported.
I installed Intel OpenCL version "intel_sdk_for_ocl_applications_xe_2013_r2_sdk_126.96.36.19985_x64" on an HPC server with a dual socket Xeon i5-2650, Xeon Phi coprocessor, 64GB host memory and Red Hat Enterprise Server 6.4.
I would like to do device fission with OpenCL to get around the NUMA issue. Unfortunately the device query doesn't say CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN would be supported on the CPU. The output is attached in a file.
My questions are:
1. Is CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN generally not supported on specific Intel CPUs?
2. Would AMD APP SDK be able to utilize CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN on Intel CPUs?
3. (a bit off-topic) I found 3 clinfo implementations on the internet, but non of them has a detailed output like in the initial post. Where could I get a "proper" clinfo version?
Any comments regarding experiences with NUMA affinity are welcome.