- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to write OpenCL kernels that stress the Memory Bandwidth capabilities of my Intel MIC card to get some understanding for program behavior. I started by looking at the stream benchmark ,which is written in OpenMP, I followed the instructions here (https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad) and I was able to get 133.675 GB\S for triad which is a reasonable number given that my machine has a peak BW of 240 GB\s for the 3120A card. 5110P gets 171 GB\s and it's peak is 352 as publish by intel in the link I posted above.
I am trying to replicate the same results using OpenCL. However, the numbers I am getting are significantly lower. The maximum I was ab;r to sustain out of the MIC card was 32.7 GB\s which is almost one forth what I can get with the OpenMP code. I also ran hte STREAM on a XeoN E5-2620 to get a sense of how different the number might get and found out the difference was a bit marginable.
Is there any specific optimization for OpenCL related to memory bandwidth? I am getting the best results for the OMP code when using 56 threads on the MIC so I am running a work group of 56*32 global work size for the OpenCL code. Is that a wrong way to go?
Is there an available implementation of the STREAM Benchmark on OpenCL?
Thank you,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't recall having come across an OpenCL implementation of STREAM, but it has not been high on my priority list. There is a CUDA version that runs quite well, but CUDA requires a lot fewer lines of code.
There are two things to watch out for in an OpenCL version:
- The OpenCL compilation process might not generate streaming (non-temporal) stores. This can cut performance by ~25%.
- On a machine with high memory bandwidth and small memory (like Xeon Phi), you have to be careful of the overheads of thread coordination at the beginning and ending of parallel sections.
- With OpenMP the overhead of a "#pragma omp parallel for" is something like 20 microseconds for 240 threads, and the corresponding overhead of an OpenMP barrier is about 10 microseconds for 240 threads.
- STREAM requires that each array be at least 4x the aggregate cache, so 4x30MiB = 120MiB = approx 15 million elements, but I have found that for good performance the arrays need to be at least 10x this large. I think that the thread coordination overhead is a significant contributor to this sensitivity to array size.
I should probably build (or acquire) offload versions of STREAM using the Intel offload syntax and using the OpenMP version 4 offload syntax. This is not a programming model that I use much myself, but it is a reasonable choice for heterogeneous systems.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page