STREAM Benchmark on MIC, OpenCL version

Sameer_A_ · ‎03-17-2015

Hello,

I am trying to write OpenCL kernels that stress the Memory Bandwidth capabilities of my Intel MIC card to get some understanding for program behavior. I started by looking at the stream benchmark ,which is written in OpenMP, I followed the instructions here (https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad) and I was able to get 133.675 GB\S for triad which is a reasonable number given that my machine has a peak BW of 240 GB\s for the 3120A card. 5110P gets 171 GB\s and it's peak is 352 as publish by intel in the link I posted above.

I am trying to replicate the same results using OpenCL. However, the numbers I am getting are significantly lower. The maximum I was ab;r to sustain out of the MIC card was 32.7 GB\s which is almost one forth what I can get with the OpenMP code. I also ran hte STREAM on a XeoN E5-2620 to get a sense of how different the number might get and found out the difference was a bit marginable.

Is there any specific optimization for OpenCL related to memory bandwidth? I am getting the best results for the OMP code when using 56 threads on the MIC so I am running a work group of 56*32 global work size for the OpenCL code. Is that a wrong way to go?

Is there an available implementation of the STREAM Benchmark on OpenCL?

Thank you,

McCalpinJohn · ‎03-18-2015

I don't recall having come across an OpenCL implementation of STREAM, but it has not been high on my priority list. There is a CUDA version that runs quite well, but CUDA requires a lot fewer lines of code.

There are two things to watch out for in an OpenCL version:

The OpenCL compilation process might not generate streaming (non-temporal) stores. This can cut performance by ~25%.
On a machine with high memory bandwidth and small memory (like Xeon Phi), you have to be careful of the overheads of thread coordination at the beginning and ending of parallel sections.
1. With OpenMP the overhead of a "#pragma omp parallel for" is something like 20 microseconds for 240 threads, and the corresponding overhead of an OpenMP barrier is about 10 microseconds for 240 threads.
2. STREAM requires that each array be at least 4x the aggregate cache, so 4x30MiB = 120MiB = approx 15 million elements, but I have found that for good performance the arrays need to be at least 10x this large. I think that the thread coordination overhead is a significant contributor to this sensitivity to array size.

I should probably build (or acquire) offload versions of STREAM using the Intel offload syntax and using the OpenMP version 4 offload syntax. This is not a programming model that I use much myself, but it is a reasonable choice for heterogeneous systems.