Hi all, I encountered a very weird behavior. I ran the following codes:
t0 = get_time();
clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, 1.8*GB, host, 0, NULL, NULL);
printf("%lf secs", get_time() - t0);
The evaluation system has 4 Intel Xeon Phi 5110p coprocessors. (with Intel OpenCL runtime 14.2 and MPSS 3.4.2)
When I ran the code using MPI, that is 4 MPI-task, each task showed about 0.0000x secs.
But when I ran the code using threads, such as 4 OpenMP threads, it showed about 5 secs. Even though it is a enqueuing a non-blocking command.
Do you have any idea?