OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

NUMA effects with OpenCL


Hi Guys,

Recently I am working on the OpenCL and using a dual sockets machine from Intel (X5650). I wonder how I can control the NUMA effects with OpenCL? Do I have any API for it? or it can be handled by the run-time and this factor is hidden by the run-time?



0 Kudos
2 Replies

Hello Jianbin,

You can try the following:

  1. Allocate memory yourself, using something like libnuma to ensure it's all allocated on a single socket.
    Make sure to align the memory to the size of the OpenCL data type you intend to use.
  2. Create memory objects using CL_MEM_USE_HOST_PTR to wrap these allocations.
  3. Use clCreateSubdevices to create sub-devices representing the different NUMA nodes. The current version of the SDK doesn't support partitioning by CL_DEVICE_AFFINITY_DOMAIN_NUMA, but you can use the Intel extension CL_DEVICE_PARTITION_BY_NAMES_INTEL to define which cores to assign to which sub-devices, yourself. Read more about it here:

That should allow you to enqueue kernels on a single socket using the appropriate sub-device ID, and you can ensure each kernel operates on memory objects allocated on physical pages from that node.

As an aside, the reason there isn't a more straightforward way to go about things is that our testing showing a relatively low return on investment - the performance impact was negligible thanks to the Intel Quick Path Interconnect technology.

If you try this and find a case where this has a significant impact, please let us know.




0 Kudos

Doron Singer (Intel) wrote:

If you try this and find a case where this has a significant impact, please let us know.

Reductions!  As I reported here:

I haven't tested it on other bandwidth bound applications, but I think it's generally applicable.  Thank you, Doron.


0 Kudos