How vtune compute bandwith?

HUIZHAN_Y_ · ‎06-25-2015

Hi, I am analyzing a simulated cannealling program from parsec. The program often access elem data randomly, so it have poor performance. I add a prefetching instruction for elem, and I am glad to see the time of parallel region with multiple threads has been reduced from 31 second to 15 second. Indeed it is a good result. I just prefetch the data in advance one iteration, and I wish get more performance improvement. But after adjusting the prefetching parameter, I cannot get much better result. So I doubt the prefetching has used up all bandwidth when prefetching the data in advance one iteration. So I check the bandwidth with vtune bandwidth analysis after the prefetching, and I found that the bandwidth only was increased a few from 3.004GB/s to 3.268 (for a single package). I feel the result is not right. Since adding prefetching do not add loaded data size, the time is reduced to a half from 31s to 15s, the bandwidth should be equal to DATA_SIZE/TIME, so the bandwidth should be doubled. Anyone has some ideas about the result. Or know how to compute bandwidth by vtune?

I do an experiment with the following program:

#include <stdio.h>
#include <unistd.h>
#define SIZE 100*1024*1024
double a[SIZE];
double b[SIZE];
double c[SIZE];
int
main()
{
int i;
for (i=0; i< SIZE; i++) {
a = 0.;
b = 0.;
c = 0.;
}
sleep(2);

// continuing access
for (i=0; i< SIZE; i++) {
a = b + c;
}
sleep(2);

//stride access
for (i=0; i< SIZE; i+=8) {
a = b + c;
}
sleep(2);
}

Vtune shows that the bandwidth for the part with continuing access is 4.16GB/s, and the running time is 0.7s, I compute the bandwidth is 4.46GB/s(100MB*4*8/0.7). The stride acces is 8.3GB/s, the running time is 0.4s, I compute the bandwidth is 7.8GB/s(100MB*4*8/0.4). It seems that my computation method is almost right.

HUIZHAN_Y_ · ‎06-25-2015

I understand that it seems there are some problems. I think maybe the DATA_SIZE has been changed after adding prefetching. When adding prefetching, the cache may have more effective data, and the data locality is improved. So although no prefetch case need multiple loading from memory, the prefetch case can find some reusing data. So the improvement of data locality leads to better performance, and at the same time the memory bandwidth is not increased significantly.