@John McCalpin

Yinchao_Z_ · ‎03-30-2016

hi,

I use pcm tool to monitor qpi performance, and now i am confused with some question.

1. with qpi frequency set as 9.6GT/s ,what's the max bandwidth it can be ? I mean from point to point like from socket1 to socket0?

2. I see the source code to find out how the pcm calculate the qpi Utilization , it shows that : max_bytes = (double)(double(max_speed) * double(getInvariantTSC(before, after) / double(m->getNumCores())) / double(m->getNominalFrequency()));

can you tell me what is the factor meaning ?

3. In my test result , in 4cpu mechine , the outgoing QPI traffic(in a second ) can up to 23GB with 94% utilization, but in another mechine with 16socket sometimes it can be 24G with 77% utilization ,the denominator seems change all the time ?

In my view ,the max_bytes in a second should be the max_speed 19.2 GB/s ,but in my test it was not.

really confused. help !

thank you !

Yinchao_Z_ · ‎04-04-2016

@John McCalpin

can you help me ?

thank you very much !

McCalpinJohn · ‎04-05-2016

QPI links have an "effective width" for data of 16 bits in each direction, so the peak data bandwidth of a link running at 9.6 GT/s is 19.2 GB/s per direction.

Socket-to-socket connections in 4 socket-systems are a single link wide, while many/most 2-socket systems use two links to connect the two sockets.

This peak bandwidth number is not attainable for a large number of reasons. Part of the bandwidth is used for requests (loads or snoops) and part is used for responses without data (mostly snoop responses). Traffic going in both directions will see contention between the requests and data responses.

In many systems it is not possible to get close to the maximum bandwidth for some transaction types, suggesting that there are inadequate buffers for some transaction types. I have only tested this extensively on 2-socket systems -- the results may be completely different on larger systems, since the allocation of buffers is likely to be different. These limitations vary by processor generation and (more recently) by the "snoop mode" of the system.

Xeon E5 (v1, Sandy Bridge) was only able to drive ~66% of peak QPI bandwidth for reads, and much less than that for combinations of reads with either writebacks or streaming stores.
Xeon E5 v2 (Ivy Bridge) introduced the "Home Agent" snoop mode, which allows much better performance -- up to 80% of peak for reads, and improvements of up to 100% for reads+writebacks and improvements ~70% for reads+streaming_stores.
Xeon E5 v3 (Haswell, using "Home Agent" snoop mode) retains these improvements and adds up to 25% improvement in workloads that are limited by read+streaming_store performance.

I have tested performance of 4-socket systems, but only with Xeon E5 v1 (Sandy Bridge) processors, and the performance was very low -- about 1/2 of the percentage of peak QPI bandwidth that I saw on the 2s systems (and the peak is only 1/2 because only one link is used between each pair of chips). I don't know if this can be attributed to coherence traffic or if there is another cause. I have not tested any 4s systems newer than Sandy Bridge (Xeon E5-4650), so I don't know how the performance has changed.....

some question about QPI