Recently I'm involved in a project to investigate the NUMA awareness and CPU affinity of a new file system. My method is to choose a baseline file system that is well-known to be NUMA friendly, and then compare the QPI traffic with the new file system by writing the same amount of data.
I have a question about the QPI traffic and the miss count of last level cache fulfilled by remote DRAM and cache. I thought they must be strong correlated as QPI traffic should be caused by remote memory access, until I saw the following stats from vtune:
This is the stats of baseline file system:
This is the stats from the new file system:
From the graph above, as you can see the new file system produced way more QPI traffic than the old one(the integral of QPI bw utilization should be the total traffic), but the LLC miss count fulfilled by remote memory and cache is way less. In the case of baseline file system, the number is 108M but it is 35M for the new one.
This also can be verified by the summary stats from CLI. The average QPI bandwidth of the baseline file system:
Package Total, GB/sec:Self
And the corresponding number for the new file system:
Package Total, GB/sec:Self
I have tried to enable and disable hardware prefetcher but the results are about the same.
Please share your thoughts and any help is appreciated.
You should specify the hardware of the system under test. Xeon E5 and Xeon E7 are quite different, and each generation (v1,v2,v3,v4) has a different set of performance counter bugs. Starting with Xeon E5 v2, the QPI snooping mode also has a significant impact on QPI traffic counts (and types of transactions).
It is not clear that there is enough information provided here to compute all of the necessary metrics for comparison, but the QPI values for the "new" filesystem are certainly suspicious. The three values that I see don't appear consistent:
Some of these metrics may be overall QPI traffic and some might be QPI data traffic, but it is still hard to understand how these could all be correct.
It would help to know how much data you requested to be moved to the filesystem and how long the two tests took (though this is often tricky with filesystem tests).
As a general suggestion on methodology, I recommend comparing cases based on raw counts rather than rates whenever possible. You can always divide by the corresponding elapsed times to get rates, but if you start by comparing rates, you can easily get confused about what portion of the change is due to a change in the raw counts and what portion is due to the change in elapsed time.
@John, thanks for reply.
I did the test on a node with 'E7-4890 v2' Ivy Bridge. The node has 4 CPUs installed. However, I picked 8 cores from each node to perform this test. Hardware prefectcher is disabled by running command 'wrmsr 0x1a5 0xf'. Also the operating system is centos7 and I used RAMDisk as the storage. The `old' file system is a variation of ext4 and the `new' file system is ZFS. ZFS checksum is disabled to make the comparison to be fair. It spent 20s for ext4 and 28s for ZFS to write the same amount of data.
Indeed, I should have posted the raw data here. From what I have seen, the raw data from QPI counters match the bandwidth I posted here. These are the raw data for ext4:
The raw data for ZFS:
From the above raw data, ZFS produced about 20 times more QPI traffic than ext4 when writing the same amount of data. Part of the reason is that ZFS doesn't have done any thing for NUMA and the write amplification problem in ZFS.
The problem I don't understand here is that I though LLC miss count should be relative to the QPI traffic. As I mentioned earlier, I used RAMDisk for testing, therefore each actual IO to storage will usually cause cache miss. Is there any optimization inside CPU that it could bypass cache for large data transfer?
@John, please let me know if you want more data.
I don't know which counters VTune is using in the reports above, but I would guess that the LLC misses reported are only those due to accesses from the cores. IO DMA traffic will cause QPI traffic without causing core-based LLC misses. A non-NUMA-aware filesystem is likely to perform DMA writes to buffers on the "wrong" socket more often than a NUMA-aware filesystem.
The Uncore counters have the ability to count many different types of CBo accesses -- including that associated with different types of IO traffic -- but all of this is very minimally documented and I don't have the infrastructure required to test any of these traffic types.