OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1686 Discussions

Local memory optimization on CPU

I was curious about the question:
What is the actual penalty of using local memory optimization on CPU. The documents available on the Intel OpenCL webpage (325696-001US) states that there is "moderate" overhead. This was not very accurate and I wanted to test it.

I have written multiple versions of matrix by matrix multiplication kernels using different approaches to local memory optimization and it appeared that for the matrices of size 1024x1024 the results using local memory are almost twice as fast as without optimization. How can this be explained?

The results for 1024x1024 matrix multiplication for CPU, OpenMP and different kernels is available in the attachment. These were executed on Intel i5 2500k CPU

So again, my question is:

How is it possible that using local memory optimization on matrix multiplication kernel increases the performance on GPU CPU?
0 Kudos
2 Replies
New Contributor I
It's not clear is your question about CPU or GPU. I'll try to answer both.
On GPU local memory is commonly very fast on-chip memory like cache on CPU that can be controlled explicitly, so it's clear that it brings benefit in performance.
In case of CPU local memory isregion of physical memory (DDR) so at first glancethere is not such good performance improve like on GPU. But we must take into account that entire work-group of work-items when using local memory buffer stores data in compact region of memory, so when reusing data by work-items in single work-group we will have good cache hit and data will be processed in cache with very fast rate. If we do not use local memory there won't be such good cache hit, especially in case of loading elements from single column (because there is a big stride between consecutive elements).
Thank you.

I think that answers my question. I was asking about CPU, because on GPU it is simple and I know how it works.

I was suspecting the influence of cache hit count on the performance when using CPU, but without having access to the internal mechanics of Intel OpenCL implementation I was not sure.
Another issue is why the results for Intel OpenCL platforms are so good (better than it should be on 4 core CPU), but I suspect it is because of some internal optimizations of code.