Software Archive
Read-only legacy content
17061 Discussions

about the memory bandwidth

Mian_L_
Beginner
1,044 Views

Hi all,

For the memory bandwidth, I have noticed that the STREAM benechmark can achieve > 100GB/sec, however, after I look at the code and do some experiments by myself, I found it is very tricky to achieve such high memory bandwidth. I think there are two major tricks:

1. it uses static global arrays, rather than dynamically allocated arrays, when I use the dynamically allocated arrays, the bandwidth is much lower
2. the data is touched once before the real evaluation, I know it tries to remove some overhead, but if not touched, the bandwidth of the first scan is very low

finally, it uses openmp, when i use pthread to try to do the same experiment, the bandwidth is really low (~3GB/sec). I attach the source code of openmp and pthread. Correct me if I am wrong for this experiment. Thanks very much!

0 Kudos
6 Replies
Evgueni_P_Intel
Employee
1,044 Views

Hi Mian L., Following these advices in http://software.intel.com/en-us/forums/topic/382760 one should be able to reach ~150GB/s or more. Your benchmark is different only in that it runs natively on Phi instead of an offload section on Xeon. Thanks, Evgueni.

0 Kudos
Mian_L_
Beginner
1,044 Views

Hi Evgueni,

Thanks. Since I am using pthread, it is different from that article. I have tried the method in that article, it can achieve the normal bandwidth, but not in my pthread program.

0 Kudos
Evgueni_P_Intel
Employee
1,044 Views

pthreads should be affinitized using sched_setaffinity declared in sched.h

0 Kudos
Charles_C_Intel1
Employee
1,044 Views

Touching the memory before first use ensures it is mapped when you need it.  This mapping can take quite a while.  Since you are benchmarking memory bandwidth, I propose that it is fair to make sure all the pages are mapped and avaiable before you try to benchmark copying data between them.  :-)   Statically allocated arrays are propably mapped at program startup.  Dynamic arrays will need to be touched after allocation to be mapped, either by an initialization loop or first program use (remember, Linux doesn't actually map a page until you request it...which is why it isn't safe to see if your allocations failed by the return status of malloc()).

Other complicating factors in your code:  the memory initialization loop in the OpenMP code also serves to start up the OpenMP thread pool, so you don't benchmark the cost of thread creation and OpenMP runtime startup when you run test_omp.  Your pthread, code, however, is timing thread creation in addition to the work done by the worker.  Maybe start the threads and have them wait on a lightweight sync object until they are all created, then start timing and give them a "go" signal?

0 Kudos
Mian_L_
Beginner
1,044 Views

Hi Charles, thanks for the explain. I am more clear now. However, I wonder what exactly the meaning of "mapping"? Is it means to map physical addresses to virtual addresses? Thanks very much.

Charles Congdon (Intel) wrote:

Touching the memory before first use ensures it is mapped when you need it.  This mapping can take quite a while.  Since you are benchmarking memory bandwidth, I propose that it is fair to make sure all the pages are mapped and avaiable before you try to benchmark copying data between them.  :-)   Statically allocated arrays are propably mapped at program startup.  Dynamic arrays will need to be touched after allocation to be mapped, either by an initialization loop or first program use (remember, Linux doesn't actually map a page until you request it...which is why it isn't safe to see if your allocations failed by the return status of malloc()).

Other complicating factors in your code:  the memory initialization loop in the OpenMP code also serves to start up the OpenMP thread pool, so you don't benchmark the cost of thread creation and OpenMP runtime startup when you run test_omp.  Your pthread, code, however, is timing thread creation in addition to the work done by the worker.  Maybe start the threads and have them wait on a lightweight sync object until they are all created, then start timing and give them a "go" signal?

0 Kudos
robert-reed
Valued Contributor II
1,044 Views

Intel Xeon Phi coprocessor and its host both use "virtual memory," an address space unique to a process whose addresses are "mapped" to physical pages in memory when they exist there.  Thus, malloc can return a range of virtual addresses that currently have no physical memory addresses attached to them.  The act of accessing them on Linux will force the page mapping if a free page is available, but it can take some time.

0 Kudos
Reply