- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
I'm trying to get maximum/high memory bandwidth with a Stream like benchmark based on OpenCL. The maximum performance I am able to achieve seems to be about 35GB/s. With the same benchmark on Nvidia Titan and AMD W9000 I get close to the peak performance.
Has anybody implemented a steam like benchmark for Intel MIC using OpenCL and sees good performance?
Thanks, Sebastian
Link kopiert
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Just as an update, the kernel code I used can be found here: https://github.com/sschaetz/aura/blob/a72fbf56470c553794f0d20da1354d31c7a925be/bench/peak.cc (kernels peak_copy, peak_scale etc).
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Sebastian,
Things are pretty quiet here so I won't be able to get you an answer until next week.
--
Taylor
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Sebastian,
Thanks for your question.
I've looked at the code and noticed that the memory accesses are strided with a big stride. Xeon Phi would perform best with consecutive memory access pattern.
Are local and global sizes did you use in your measurements?
More efficient approach for Xeon phi would be:
Please use big local size (maximum supported is 8K). Please make sure that to create enough working groups (At the bare minimum the number of compute units).
Please update here with your findings.
Arik
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Thanks for your answers. I tested a few new things. I know get about 100GB/s using the following kernel that utilizes a block tick:
AURA_KERNEL void peak_copy(AURA_GLOBAL float * dst, AURA_GLOBAL float * src) {
const int bsize = 32;
const int mult = 64;
int id = (get_mesh_id() / bsize)*bsize*mult + get_mesh_id() % bsize;
for(int32_t i=0; i<mult; i++) {
dst[id + i * bsize] = src[id + i * bsize];
}
}
I launch 1024x1024 threads with a work group size between 16 and 1024.
Arik Narkis, with your approach I get about 80GB/s. Is there an OpenCL kernel somewhere that gets peak bandwidth? I'd really like to start from something like that, I'm not getting anywhere currently.

- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite