- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The PCM toolset incldues a command, pcm-memory.x, which gives a top-level overview of memory bandwidth read, write and total. What constitutes this? Does it include the DMA to/from PCIe devices? CPU core/LLC access? Or both? Anything else?
Thanks!
-adrian
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting question.My bet is that this could be total memory usage.Can you look at source code, does it give you some clue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
There's only 64gig of RAM. It's FreeBSD-10 serving lots of video content. So there's lots and lots of kernel structures ((network) mbuf entries, (vm/cache) vm_page_t entries) churning through memory. Since they're spread all throughout memory at the moment, I'm seeing 10-15% of CPU cycles spent page table walking. It scales up we increase traffic.
When I started grouping these allocations to occur in a smaller region, I saw a noticable (5-7%) drop in CPU utilisation.
Hence why I'd like to drill down further to fully understand what's going on. In my scenario, the direct map is mostly 1GB entries but there are regions of memory that can't be mapped with a 1GB page, so there will be some 2MB and 4KB page table entries. I'd like to drill down to understand where the bulk of the current TLB walking is.
Thanks!
-a
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding video content it is probably stored as Ethernet frame prior to the sending by NIC driver to Ethernet card itself so unless jumbo frames are used I suppose that standard size will be 1492 bytes for one frame and from the perspective of DTLB and in order to reduce pressure of frequently walking 4kb pages I think that best option will be usage of larger pages.
Can you set maybe programmatically your networking stack to use only larger pages?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the memory bandwidth metrics in PCM include all traffic: DMA, PCIe, core, etc.
--
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
We're sending large frames to the hardware and using TSO offload in the TX path. So it's going to be larger frames anyway (up to 64k segments.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman - thankyou. I'll go digging through the uncore documentation for Sandy Bridge Xeon to see if I can better understand how the counters work.
I'd like to try and differentiate memory bus transactions from memory<->PCIe (and PCIe<->LLC as part of DDIO) and memory<->core (for instruction/data fetches/data stores.) Is this possible?
Thanks!
-a
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Roman.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Adrian!
What do you mean by writing 64k segments?
Thanks in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The NIC supports something called TSO - transmit segment offload. We hand the NIC 64k of TCP data. THe NIC takes care of breaking it up into MSS chunks and sending it to the client.
-adrian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Adrian.
Initially I thought that you mean 64k of segments.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page