topic Sebastian, in Software Archive

OpenCL and Bandwidth

Sebastian_S_ — Wed, 27 Nov 2013 14:17:33 GMT

I'm trying to get maximum/high memory bandwidth with a Stream like benchmark based on OpenCL. The maximum performance I am able to achieve seems to be about 35GB/s. With the same benchmark on Nvidia Titan and AMD W9000 I get close to the peak performance.

Has anybody implemented a steam like benchmark for Intel MIC using OpenCL and sees good performance?

Thanks, Sebastian

Just as an update, the kernel

Sebastian_S_ — Thu, 28 Nov 2013 08:00:20 GMT

Just as an update, the kernel code I used can be found here: https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc (kernels peak_copy, peak_scale etc).

Sebastian,

TaylorIoTKidd — Sat, 30 Nov 2013 01:47:11 GMT

Sebastian,

Things are pretty quiet here so I won't be able to get you an answer until next week.

--
Taylor

Sebastian,

Arik_N_Intel — Sun, 01 Dec 2013 11:39:57 GMT

Sebastian,

Thanks for your question.

I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.

Are local and global sizes did you use in your measurements?

More efficient approach for Xeon phi would be:

AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {

int id = get_global_id(0); //can be extended for multiple dimensions

dst[id] = src[id];

}

Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).

Please update here with your findings.

Arik

Thanks for your answers. I

Sebastian_S_ — Sun, 01 Dec 2013 16:42:38 GMT

Thanks for your answers. I tested a few new things. I know get about 100GB/s using the following kernel that utilizes a block tick:

AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {
const int bsize = 32;
const int mult = 64;
int id = (get_mesh_id() / bsize)*bsize*mult + get_mesh_id() % bsize;
for(int32_t i=0; i<mult; i++) {
dst[id + i * bsize] = src[id + i * bsize];
}
}

I launch 1024x1024 threads with a work group size between 16 and 1024.

Arik Narkis, with your approach I get about 80GB/s. Is there an OpenCL kernel somewhere that gets peak bandwidth? I'd really like to start from something like that, I'm not getting anywhere currently.