I try to measure the bus utilization on a Xeon 5400 machine, which has 1333MHz FSB and DDR2-667, when I do simple memory copy with 8 threads (the machine has 2 processors and 4 core in each processor). The throughput of memcpy (from one large chunk of memory to another large chunk of memory) is 3000MB/s.
I use oprofile to measureBUS_TRANS_ANY.ALL_AGENTS andCPU_CLK_UNHALTED.BUS. As suggested by Intel Optimization reference manual, the bus utilization can be measured as followBUS_TRANS_ANY.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS *100. When I do it, I only get 66%.
If I do simple memory copy, I should be able to saturate memory bus, right? Why do I only get 66%? Which part goes wrong?
Hello Da, Can you look at: Bus Not Ready Ratio: BUS_BNR_DRV.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS * 100 This equation tells you what percent of the time the bus was stalled and unable to accept new transactions.
And also look at: Data Bus Utilization: BUS_DRDY_CLOCKS.ALL_AGENTS / CPU_CLK_UNHALTED.BUS * 100
The BUS_TRAN_ANY.ALL_AGENTS equation really reports the address bus utilization.
The bus can become too congested to accept more traffic. From my recollection, %utilizations of 60% to 70% are very high. You've probably maxed out the bus at this level of utilization.
This is one of the reasons for moving to NUMA memory, integrated memory controllers and QPI. The QPI links separate the coherency traffic from the memory traffic. The local NUMA memory with an integrated memory controller allows for more efficient memory accesses with lower latency and higher bandwidth. Pat
Hello Da, The 66% utilization you reported before is typical for bus saturation on FSB-type core2 systems. You can see that the address bus in this case is the limiter. For the 2 processor system FSB handles a lot of coherency traffic between the processors. There is even more coherency traffic for 4 processor systems. This was one of the main reasons for the death of the FSB-based memory systems. Sorry to not have a better answer for you. Pat