Processors
Intel® Processors, Tools, and Utilities
14503 Discussions

DDIO does not reduce Memory Read Bandwidth Despite 100% PCIRdCur Hit Rate

Mark_D_9
New Contributor I
3,926 Views

CPU:  Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz

OS:  CentOS Linux release 7.8.2003 (Core)

Kernel:  3.10.0-1127.el7.x86_64

NIC:  Solarflare XtremeScale X2522-25G Adapter (sfc 4.15.6.1004 - Firmware 7.6.2.1006)

OpenOnload:  7.1.0.265

 

I'm running SockPerf in "ping-pong" mode between two (2) servers on the same VLAN connected via an Arista 7124SX with 10GbE. Both hosts are using the same OpenOnload, sfc, and Solarflare NIC versions.

 

On the host in question (details above), I'm running the SockPerf client using the following command (core 3 is on Socket 1):

taskset -c 3 onload -p latency sockperf ping-pong -i 10.3.27.117 -p 5001 --msg-size=256 -t 30 --full-log /tmp/ddio-enabled-pingpong.stats

 

On the other host, I'm running the SockPerf server with the following command:

taskset -c 1 onload -p latency sockperf server -i 10.3.27.117 -p 5001 

 

I'm using the ddio-bench tool to toggle DDIO off/on between tests. With DDIO disabled, DRAM Bandwidth Usage looks like the following:

|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):         34.79                --|
|--           System DRAM Write Throughput(MB/s):         43.00                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):         34.79                --|
|--                System Write Throughput(MB/s):         43.00                --|
|--               System Memory Throughput(MB/s):         77.79                --|
|---------------------------------------||---------------------------------------|

 

However, after toggling DDIO on, DRAM bandwidth is nearly eliminated for Writes, but it increases for Reads:

|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):         49.21                --|
|--           System DRAM Write Throughput(MB/s):          3.11                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):         49.21                --|
|--                System Write Throughput(MB/s):          3.11                --|
|--               System Memory Throughput(MB/s):         52.32                --|
|---------------------------------------||---------------------------------------|

 

This is despite a 100% cache hit rate for PCIRdCur:

Skt |  PCIRdCur  |  RFO  |  CRd  |  DRd  |  ItoM  |  PRd  |  WiL
 0        0          0       0       0     448        0    1400   (Total)
 0        0          0       0       0       0        0    1400   (Miss)
 0        0          0       0       0     448        0       0   (Hit)
 1       11 K      181 K     0       0     454 K      0     104 K (Total)
 1        0          0       0       0       0        0     104 K (Miss)
 1       11 K      181 K     0       0     454 K      0       0   (Hit)
------------------------------------------------------------------
 *       11 K      181 K     0       0     455 K      0     105 K (Aggregate)

 

I also notice that, based on the PCIRdCur Total vs. ItoM Total, the PCIe Read Rate is nowhere near that of the PCIe Write Rate despite the fact that this is a ping-pong test - I'd assume the bandwidth would be symmetrical. 

 

So, my questions are:

  1. Why doesn't DDIO reduce Memory Read BW in the instance of a high cache hit rate the way it does for Memory Write BW?
  2. Why does PCIRdCur not match up with ItoM for a ping-pong test, when my understanding is that both PCIe Reads and Writes use the LLC as primary destination, only going to RAM in the instance of a miss?
0 Kudos
1 Solution
Mark_D_9
New Contributor I
3,282 Views

Intel discovered that the issue is with the XPT Prefetcher - disabling it resolves the issue.

View solution in original post

0 Kudos
19 Replies
SergioS_Intel
Moderator
3,909 Views

Hello Mark_D_9,

 

 Thank you for contacting Intel Customer Support.

  

In order to properly assist you can you please provide us the model of the memory and server board that you are using?


We will be looking forward to your response.

  

 Best regards,

 Sergio S.

 Intel Customer Support Technician

 For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,900 Views

The server is a DELL PowerEdge R740xd.  RAM is Hynix HMA81GR7CJR8N-XN with one (1) DIMM per Channel installed in each of the six (6) socket channels.

0 Kudos
SergioS_Intel
Moderator
3,877 Views

Hello Mark_D_9,

 

We appreciate the additional information, please allow us to check it and we will get back to you.

  

 Best regards,

 Sergio S.

 Intel Customer Support Technician

 For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,832 Views

Preliminarily, can anyone at Intel at least confirm whether something looks "off" in these R/W bandwidth results with DDIO enabled?

In other words, shouldn't I expect to see a similar reduction in RAM Read BW as I see in RAM Write BW if the socket IIO Controller is reading directly from and writing directly to the LLC with ~100% cache hit rate?

0 Kudos
JoseH_Intel
Moderator
3,823 Views

Hello Mark_D_9,


We appreciate your patience with this issue. We have elevated your question to our seniour team and we require the following info as part of records


Could you explain the purpose to have this information (personal or business project, product design, etc)?

Do you see different behavior with older Xeon processors?

Is this being used for purchase decisions and if yes, how many systems are you interested in?

Is this somehow impacting your existing data center, what application are they using (commercial, custom?) , what kind of workloads are you processing, approximately how many server nodes, etc.?


We will look forward for your details


Regards


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,818 Views

The purpose of this ticket is to determine why the latency improvement is less that what I'd estimate it to be on our trading application (reading UDP multicast feeds and sending TCP messages to an exchange). We originally experienced a very poor ItoM cache hit rate ( > 10% ) due to the original Rx/Tx Descriptor Buffer sizes, which has been increased to 96+% hit rate after tuning. However, I didn't notice a difference in PCIe Read BW in our lab setting.

 

Since I can't share private info regarding our trading application, I used a well-known benchmark like SockPerf, using a very simple test like a ping-pong between hosts on the same VLAN, to demonstrate it.

 

So, what I'm looking for *first* is a simple confirmation regarding whether DDIO should behave the same for PCIe Reads as it does for PCIe Writes - i.e., reduce memory bandwidth usage once high LLC cache hit rates are achieved for the IIO controller.

 

We're evaluating this for near term purchasing decisions for either Ice Lake upgrades or other CPU offerings targeted at HFT firms.

 

I think this "bandwidth effect" assumption should be a pretty simple thing for the Engineering Team to answer upfront - that only requires that someone knows how DDIO should work. Can we start with that? Can someone tell me what DDIO *should* be doing in this case?

0 Kudos
Mark_D_9
New Contributor I
3,810 Views

I ran the same test on an identical R740xd with the same OS, BIOS settings, drivers, and Onload version, and it behaves as I would estimate that it should; namely, enabling DDIO nearly eliminates both Memory Write *and* Memory Read BW usage.

 

The *only* difference is it has one (1) DIMM populated per socket - Micron 18ASF2G72PDZ-3G2E1. The SUT from the original post has six (6) DIMMs populated per socket with the previously indicated Hynix part number.

0 Kudos
JoseH_Intel
Moderator
3,794 Views

Hello Mark_D_9,


Thank you very much for your complete update. Let me share this with our senior team. We will get back to you as soon as we have further details.


Regards


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios



0 Kudos
JoseH_Intel
Moderator
3,753 Views

Hello Mark_D_9,


We have the following information.


Why doesn't DDIO reduce Memory Read BW in the instance of a high cache hit rate the way it does for Memory Write BW?

  • From what we can tell from the tools’ output, the memory read bandwidth is not being generated by IO transactions. Our guess is some application on the CPU cores is issuing some memory reads which are missing the LLC. Can confirm this with pcm.x (customer already has the tool).

 

Why does PCIRdCur not match up with ItoM for a ping-pong test, when his understanding is that both PCIe Reads and Writes use the LLC as primary destination, only going to RAM in the instance of a miss?

 

  • PCIRdCur counter is accurate, so from what we can tell this ping pong test is not generating as much PCIe reads as writes (in fact almost no PCIe Reads, they could be just metadata like descriptors). We would recommend looking at the tool configuration/output some more. From the pcm-pcie.x tool output - IO Reads are only 700KB, IO writes are ~40MB.

 

So, what I'm looking for *first* is a simple confirmation regarding whether DDIO should behave the same for PCIe Reads as it does for PCIe Writes - i.e., reduce memory bandwidth usage once high LLC cache hit rates are achieved for the IIO controller.

  • Yes, DDIO would allow both writes and reads to be served by caches instead of DRAM. In fact, even with DDIO disabled reads would still be served from caches. But first, we would recommend doublechecking the application - it seems like it’s not doing what the customer is expecting.

 

I ran the same test on an identical Dell PowerEdge R740xd with the same OS, BIOS settings, drivers, and Onload version, and it behaves as I would estimate that it should; namely, enabling DDIO nearly eliminates both Memory Write *and* Memory Read BW usage.

  • Question on this: was the NIC populated on socket0 or socket1 in the Dell test? 


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios



0 Kudos
Mark_D_9
New Contributor I
3,746 Views

Regarding the last point, both Dell R740xds have the NIC installed on Socket #1, as that is our standard here. Both of the Dells were purchased together, and it was stipulated that one of them would have one (1) 32GB DIMM per socket, while the other one would have all six (6) DIMMs per socket. I ran "omreport chassis biossetup" on both and diff-ed the output to confirm that they have the same BIOS settings. 

 

When I run the *exact* same SockPerf ping-pong test on the host with 1-DIMM per socket, I get the following output from pcm-memory.x:

 

|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):          4.08                --|
|--           System DRAM Write Throughput(MB/s):          3.96                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):          4.08                --|
|--                System Write Throughput(MB/s):          3.96                --|
|--               System Memory Throughput(MB/s):          8.04                --|
|---------------------------------------||---------------------------------------|
0 Kudos
Mark_D_9
New Contributor I
3,741 Views

By the way, the application in question is the well-known, opensource tool SockPerf. It's not doing anything weird or awkward - just a standard ping-pong test:

 

https://github.com/Mellanox/sockperf

 

The command line arguments quoted above are exactly what's being run for both R740xd hosts mentioned - the one on which DDIO appears to be misbehaving, and the one on which it is working as described in all relevant Intel Documentation.

0 Kudos
JoseH_Intel
Moderator
3,726 Views

Hello Mark_D_9,


Thank you for the update. I will get back to you soon.


Regards


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,717 Views

FYI - SockPerf is also yum-installable on CentOS, by the way.

0 Kudos
Mark_D_9
New Contributor I
3,664 Views

FYI - CPU Microcode on both Cascade Lake machines are the exact same: 0x5002f01

0 Kudos
Mark_D_9
New Contributor I
3,643 Views

Is anyone still looking into this?

0 Kudos
JoseH_Intel
Moderator
3,611 Views

Hello Mark_D_9,


Could you please tell if OSB is enabled on this system? There might be an Opportunistic Snoop Broadcast setting in BIOS.


Also if possible, please see if the memory accesses are being generated by prefetchers. For example, run the same experiment with prefetchers off. Prefetchers can be disabled with “wrmsr -a 0x1a4 0xf” .


Besides that could you tell if you or your company have a Non-Disclosure Agreement with Intel? Further troubleshooting might require an NDA in place. Intel does not set up an NDA with individuals. In case you need to setup a NDA we will continue the communication over email due to the nature of the information required


We will look forward to your updates


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,604 Views

As I stated a few weeks ago, the BIOS settings between the host with the correct DDIO functionality and the one with the off-kilter DDIO functionality are identical (the BIOS settings are attached for both hosts).

The only difference between the two is that one is configured with 1 DIMM Per Socket while the other has all 6 DIMMs per Socket populated. They both have SNC disabled, which I assume means it defaults to OSB mode.

 

If we need to execute a mutual NDA, please reach out over email at mdawson@whtrading.com so our Legal Team can process the request.

0 Kudos
JoseH_Intel
Moderator
3,589 Views

Hello Mark_D_9,


Thank you for the logs provided. I just sent you a private message to the address provided by you. You can reply with the information requested in order to start the NDA process.


We will look forward to your updates


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


0 Kudos
Mark_D_9
New Contributor I
3,283 Views

Intel discovered that the issue is with the XPT Prefetcher - disabling it resolves the issue.

0 Kudos
Reply