Solved: Thanks, lots of good info

Tim_Day · ‎07-31-2014

I've just been reading this interesting paper: http://www.cs.berkeley.edu/~hcook/papers/ISCA13_Henry_Cook.pdf "A Hardware Evaluation of Cache Partitioning to Improve Utilization and Energy-Efficiency while Preserving Responsiveness"

The main thing which surprised me in it: "Taking advantage of the partitioning mechanism in the LLC, we change the LLC space allocated to a given application from 0.5MB to 6MB." (LLC = last level cache, presumably L3).

I had no idea it was possible to do that, but it sounds quite useful. I have some current interest in the impact a high scheduler priority foreground and low scheduler priority background task running on the same HW can have on each other, and the way they share (or fight for!) cache has some impact on this... further research led to the paper above and I'd really like to see the impact on of such partitioning on our own code...

But what's the trick to actually doing it? Googling things like "intel cache partitioning control" and variants involving "mask", "API" etc finds me nothing useful. If it's thread state, does it need OS support? Any pointers gratefully received!

TimP · ‎07-31-2014

According to that paper, the BIOS was replaced by a special one of unspecified origin, so it's not a practical solution. Effective use of LLC may be strongly dependent on BIOS quality and coordination with OS scheduler even when using production BIOS which doesn't have the special hooks implied in this paper.

A benchmark such as SPEC CPU normally runs its components individually with access to the entire cache of both CPUs and is run multi-threaded, contrary to the assertion of this paper (under the rules, not using the OpenMP, but relying on auto-parallel compilation). It's not expected to perform well in competition with another application which would cut into the cache allocation. It would take advantage of OpenMP affinity (e.g. OMP_PROC_BIND) so as to partition data automatically and maintain NUMA memory and cache locality.

A few of those SPEC CPU 2006 benchmarks rely on auto-parallel compilation with the data initialization loop run with the same scheduling and affinity as the data consumption loops, even though the initialization time is negligible. It is set up to trap the unwary person who wants to cheat by using the OpenMP directives which are in the source code although not permitted to be used by the benchmarking rules.

The performance of a multi-CPU multi-core platform running multiple single-thread applications might be enhanced under linux by pinning each application suitably to cores, e.g. by using taskset command, compared to what may happen when such applications are left up to the scheduler. The biggest differences would show up when an application is suspended and resumed with intervening events which break the scheduler's record of which cores it was running on.

Simply by pinning the favored application to one CPU and the competitors to the other, you would reserve the entire LLC of one CPU to that application, even if your application is allowed to go idle for significant periods. As hinted in the paper, the acceleration of a favored background application is done by limiting resources which might be used effectively by some foreground application.

In large data centers, time-critical applications don't normally share CPUs with others on nodes even as large as Intel(r) Xeon(tm) EP. It's only with very large expensive NUMA nodes like HP SuperDome or Xeon E7 that one gets interested in schemes to share resources among multiple applications while maintaining cache priority and memory bank locality of a priority application. The time and effort involved in rigging a mass produced platform for special application sharing is likely to exceed that of acquiring another.

View solution in original post

TimP · ‎07-31-2014

According to that paper, the BIOS was replaced by a special one of unspecified origin, so it's not a practical solution. Effective use of LLC may be strongly dependent on BIOS quality and coordination with OS scheduler even when using production BIOS which doesn't have the special hooks implied in this paper.

A benchmark such as SPEC CPU normally runs its components individually with access to the entire cache of both CPUs and is run multi-threaded, contrary to the assertion of this paper (under the rules, not using the OpenMP, but relying on auto-parallel compilation). It's not expected to perform well in competition with another application which would cut into the cache allocation. It would take advantage of OpenMP affinity (e.g. OMP_PROC_BIND) so as to partition data automatically and maintain NUMA memory and cache locality.

A few of those SPEC CPU 2006 benchmarks rely on auto-parallel compilation with the data initialization loop run with the same scheduling and affinity as the data consumption loops, even though the initialization time is negligible. It is set up to trap the unwary person who wants to cheat by using the OpenMP directives which are in the source code although not permitted to be used by the benchmarking rules.

The performance of a multi-CPU multi-core platform running multiple single-thread applications might be enhanced under linux by pinning each application suitably to cores, e.g. by using taskset command, compared to what may happen when such applications are left up to the scheduler. The biggest differences would show up when an application is suspended and resumed with intervening events which break the scheduler's record of which cores it was running on.

Simply by pinning the favored application to one CPU and the competitors to the other, you would reserve the entire LLC of one CPU to that application, even if your application is allowed to go idle for significant periods. As hinted in the paper, the acceleration of a favored background application is done by limiting resources which might be used effectively by some foreground application.

In large data centers, time-critical applications don't normally share CPUs with others on nodes even as large as Intel(r) Xeon(tm) EP. It's only with very large expensive NUMA nodes like HP SuperDome or Xeon E7 that one gets interested in schemes to share resources among multiple applications while maintaining cache priority and memory bank locality of a priority application. The time and effort involved in rigging a mass produced platform for special application sharing is likely to exceed that of acquiring another.

Tim_Day · ‎07-31-2014

Thanks, lots of good info there; I skimmed over the bit about the BIOS without realizing the significance of it at all!

Yes, in the absence of the ability to more finely control use of cache resources than at ~CPU granularity, thread affinity would certainly seem to be the next best tool available.

McCalpinJohn · ‎07-31-2014

The paper clearly says that it is using non-standard hardware:

We use a prototype version of Intel's Sandy Bridge x86
processor that is similar to the commercially available client
chip, but with additional hardware support for way-based
LLC partitioning.

The acknowledgement section adds:

We would especially like to thank everyone at Intel who
made it possible for us to use the cache-partitioning ma-
chine in this paper, [...]

In general, it has been found that static cache partitioning degrades throughput. I first observed this in the development of the IBM POWER4 processor in 2000-2001 (where two cores shared an L2 cache and an L3 cache). This paper does not disagree, but points out that throughput is not always the most important metric.

As the paper discusses in section 7, "way-partitioning" is only one approach to static partitioning -- "set partitioning" is also commonly used in research, and can be implemented without extra hardware support by using appropriately specialized page coloring. From the perspective of analysis, I prefer "set partitioning" because it does not change the associativity when the size is changed. The "way partitioning" used in the paper necessarily changes both at once, so additional analysis is required to understand how much of the change in cache miss rate is due to capacity changes and how much is due to associativity changes.

Configuring cache partitioning; how?