- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to get maximum/high memory bandwidth with a Stream like benchmark based on OpenCL. The maximum performance I am able to achieve seems to be about 35GB/s. With the same benchmark on Nvidia Titan and AMD W9000 I get close to the peak performance.
Has anybody implemented a steam like benchmark for Intel MIC using OpenCL and sees good performance?
Thanks, Sebastian
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just as an update, the kernel code I used can be found here: https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc (kernels peak_copy, peak_scale etc).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sebastian,
Things are pretty quiet here so I won't be able to get you an answer until next week.
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sebastian,
Thanks for your question.
I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.
Are local and global sizes did you use in your measurements?
More efficient approach for Xeon phi would be:
Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).
Please update here with your findings.
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answers. I tested a few new things. I know get about 100GB/s using the following kernel that utilizes a block tick:
AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {
const int bsize = 32;
const int mult = 64;
int id = (get_mesh_id() / bsize)*bsize*mult + get_mesh_id() % bsize;
for(int32_t i=0; i<mult; i++) {
dst[id + i * bsize] = src[id + i * bsize];
}
}
I launch 1024x1024 threads with a work group size between 16 and 1024.
Arik Narkis, with your approach I get about 80GB/s. Is there an OpenCL kernel somewhere that gets peak bandwidth? I'd really like to start from something like that, I'm not getting anywhere currently.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page