Software Archive
Read-only legacy content
17061 Discussions

OpenCL and Bandwidth

Sebastian_S_
Beginner
519 Views

I'm trying to get maximum/high memory bandwidth with a Stream like benchmark based on OpenCL. The maximum performance I am able to achieve seems to be about 35GB/s. With the same benchmark on Nvidia Titan and AMD W9000 I get close to the peak performance.

Has anybody implemented a steam like benchmark for Intel MIC using OpenCL and sees good performance?

Thanks, Sebastian

0 Kudos
4 Replies
Sebastian_S_
Beginner
519 Views

Just as an update, the kernel code I used can be found here: https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc (kernels peak_copy, peak_scale etc).

0 Kudos
TaylorIoTKidd
New Contributor I
519 Views

Sebastian,

Things are pretty quiet here so I won't be able to get you an answer until next week.

--
Taylor
 

0 Kudos
Arik_N_Intel
Employee
519 Views

Sebastian,

Thanks for your question.

I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.

Are local and global sizes did you use in your measurements?

More efficient approach for Xeon phi would be:

AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {
  int id = get_global_id(0);   //can be extended for multiple dimensions
  dst[id] = src[id];
 }

Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).

Please update here with your findings.

Arik

0 Kudos
Sebastian_S_
Beginner
519 Views

Thanks for  your answers. I tested a few new things. I know get about 100GB/s using the following kernel that utilizes a block tick:

AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {
  const int bsize = 32;
  const int mult = 64;
  int id = (get_mesh_id() / bsize)*bsize*mult + get_mesh_id() % bsize;
  for(int32_t i=0; i<mult; i++) {
    dst[id + i * bsize] = src[id + i * bsize];
  }
}

I launch 1024x1024 threads with a work group size between 16 and 1024.

Arik Narkis, with your approach I get about 80GB/s. Is there an OpenCL kernel somewhere that gets peak bandwidth? I'd really like to start from something like that, I'm not getting anywhere currently.

0 Kudos
Reply