Local memory optimization on CPU

t4deusz · ‎05-29-2012

I was curious about the question:
What is the actual penalty of using local memory optimization on CPU. The documents available on the Intel OpenCL webpage (325696-001US) states that there is "moderate" overhead. This was not very accurate and I wanted to test it.

I have written multiple versions of matrix by matrix multiplication kernels using different approaches to local memory optimization and it appeared that for the matrices of size 1024x1024 the results using local memory are almost twice as fast as without optimization. How can this be explained?

The results for 1024x1024 matrix multiplication for CPU, OpenMP and different kernels is available in the attachment. These were executed on Intel i5 2500k CPU

So again, my question is:

How is it possible that using local memory optimization on matrix multiplication kernel increases the performance on GPU CPU?

EvgeniyPeshkov · ‎06-21-2012

It's not clear is your question about CPU or GPU. I'll try to answer both.

On GPU local memory is commonly very fast on-chip memory like cache on CPU that can be controlled explicitly, so it's clear that it brings benefit in performance.

In case of CPU local memory isregion of physical memory (DDR) so at first glancethere is not such good performance improve like on GPU. But we must take into account that entire work-group of work-items when using local memory buffer stores data in compact region of memory, so when reusing data by work-items in single work-group we will have good cache hit and data will be processed in cache with very fast rate. If we do not use local memory there won't be such good cache hit, especially in case of loading elements from single column (because there is a big stride between consecutive elements).

t4deusz · ‎07-02-2012

Thank you.

I think that answers my question. I was asking about CPU, because on GPU it is simple and I know how it works.

I was suspecting the influence of cache hit count on the performance when using CPU, but without having access to the internal mechanics of Intel OpenCL implementation I was not sure.
Another issue is why the results for Intel OpenCL platforms are so good (better than it should be on 4 core CPU), but I suspect it is because of some internal optimizations of code.