I don't think that Intel has

hiratz · ‎01-13-2018

Hi,

On one hand, two logical cpus in a physical core share the same set of 4 prefetchers and share the same control MSR. On the other hand, performance counter is logical cpu-based. So when hyperthreading is enabled, for those non-prefetcher events, like retired instructions, load access or bandwidth, the performance counter of each logical cpu should get a different value from the other logical one. But for those prefetcher-related events, like L2_RQSTS.L2_PF_MISS or L2_RQSTS.ALL_PF, should the two logical cpus in one physical see the same value in their respective performance counter of these events (say L2_RQSTS.L2_PF_MISS)? I did an experiment and found the numbers read from the two logical cpus in a physical core are very close, but different. I'm sure whether this difference is from the noise or means these two logical cpus really have different values.

Any ideas are really appreciated!

McCalpinJohn · ‎01-15-2018

Your concern seems reasonable -- I would not expect the core performance counters to be able to distinguish between L2 hardware prefetches that were triggered by one logical processor vs the other logical processor. The extreme case would be both logical processors making accesses to the same 4KiB page in shared memory -- both would be interacting with the L2 HW prefetcher, and it would not be possible to unambiguously differentiate between the two.

Don't be surprised if performance counts are not identical -- it is not possible to read multiple performance counters atomically. The closest you could get with the core performance counters is to re-write the IA32_PERF_GLOBAL_CTRL MSR (MSR 0x38f) to clear the low-order bits that control the programmable performance counters. I think it unlikely that this is guaranteed to stop the counters "simultaneously" (or if such a concept could even be applied to the performance counter unit), but "freezing" the counters in this way may allow them to be read with slightly less skew than reading them while they are still active.

Note that setting the "AnyThread" bit in the performance counter event select register may not cause the counter in one logical processor to increment "instantly" for operations taking place in the sibling logical processor. It appears that the counts are combined in batches, rather than on a cycle-by-cycle basis. (Unfortunately I cannot find the forum thread on this topic, but it involved a user who thought he was seeing non-monotonic behavior on a single core -- what he was actually seeing was a the appearance of non-monotonicity when comparing counts that were made on different threads of the same core with the AnyThread bit set.)

hiratz · ‎01-16-2018

Hi John,

Thank you for your valuable points. But I don't think it is the case that two logical processors trigger the L1/L2 prefetchers respectively. I think it should be a "mixed" access stream from two logical processors that arrives at and trigger L1/L2 prefetchers. This is how SMT (or hyperthreading) works, right?

Speaking of this, actually another interesting question come to my mind. Do you think the cache access requests which leave the core contains the logical cpu id? I think so, otherwise the two sibling logcial cpus' performance counters can not differentiate their own statistics for many events like "LLC Reference(mask: 4f, event: 2e)", "LLC Misses (mask: 41, event: 2e), or "MEM_LOAD_UOPS_RETIRED.L1/L2/L3_HIT/MISS".

Yes, I use IA32_PERF_GLOBAL_CTRL MSR (MSR 0x38f) as a overall control switch. For example, I use "wrmsrl_safe_on_cpu(cpu, IA32_PERF_GLOBAL_CTRL, 0);" to "freeze" the performance counters of local logical cpu. Then I read their values and do something. After that, I clear this control msr's overflow bits and re-enable pmu again by using "wrmsrl_safe_on_cpu(cpu, IA32_PERF_GLOBAL_CTRL, (u64)(IA32_FIXED_CTR_ENABLE|0xf));" (here: IA32_FIXED_CTR_ENABLE = (u64)(((u64) 1 << 32)+((u64) 1 << 33)+((u64) 1 << 34));).

But, if you want to disable other logical cpus' performance counters on LOCAL logical cpu, you have to first send IPI (Inter-Processor Interrupt) to those cores to tell them to do so.

About "Any Thread", your observation is interesting and valuable! I didn't notice this phenomenon you mentioned because usually I don't set the "Any Thread" bit when I run benchmarks on two sibling logical cpus. I just added their values together for my use (for convenience). But I did do some tests to observe how pmu works when "Any Thread" is set. It turned out that the pmu of one logical cpu would collect the statistics (say retired instructions) of both two sibling logical cpus as "Intel® 64 and IA-32 Architectures Developer's Manual Vol. 3B" says and the results seems reasonable.

McCalpinJohn · ‎01-17-2018

Hiratz wrote:

But I don't think it is the case that two logical processors trigger the L1/L2 prefetchers respectively. I think it should be a "mixed" access stream from two logical processors that arrives at and trigger L1/L2 prefetchers.

I am not sure about the L1 prefetchers -- they are not as important to performance on my codes and there are not many performance counter events that allow one to differentiate between L1 HW prefetches and L1 demand accesses. The L2 HW prefetchers work on independent 4KiB pages, so if those pages are accessed by different Logical Processors, then there is some degree of independence. If two Logical Processors access the same 4KiB page, then the behavior of the L2 HW prefetcher is based on the combined access pattern from the two Logical Processors, and it is no longer possible to say that a specific L2 HW prefetch access was "caused" by one Logical Processor or the other.

The L2 HW Prefetchers have strongly dynamic behavior that depends on how "busy" the L2 cache is, so L2 accesses from one Logical Processor will influence the L2 HW Prefetch behavior for memory access streams initiated by the other Logical Processor.

Do you think the cache access requests which leave the core contains the logical cpu id?

Yes, I think this is the case for demand transactions -- both for the events you mentioned and for the OFFCORE_RESPONSE events. L2 HW Prefetch events are much more autonomous, so it is not clear that a Logical Processor ID can be unambiguously assigned to these events.

The performance counter results with the "AnyThread" bit set (which Intel is deprecating) do work correctly on systems and events where they are supported -- even if the Logical Processor doing the counting is idle for most of the interval. The issue I was referring to occurred only at very fine granularity, where counts were accrued immediately/continuously for events that occurred on the same Logical Processor, but were accumulated in larger chunks at many-cycle intervals for events that occurred on the sibling Logical Processor. This allowed the appearance of non-monotonicity when taking the difference of "local" and "remote" counts.

hiratz · ‎01-18-2018

Right, all requests which arrive at L2 contains the ones from both core and L1 HW prefetcher (this event, L2_RQSTS.ALL_DEMAND_DATA_RD, is described as "Counts any demand and L1 HW prefetch data load requests to L2. " though its name has the word "DEMAND". It is a little bit confusing).

For multiprogrammed workloads, it is not possible for two sibling logical cpus to access the same 4KiB. But for multithreading workloads, it is really possible because all threads share the same address space.

The L2 HW Prefetchers have strongly dynamic behavior that depends on how "busy" the L2 cache is, so L2 accesses from one Logical Processor will influence the L2 HW Prefetch behavior for memory access streams initiated by the other Logical Processor.

By saying that the L2 cache is "busy", are you using some metric like "the number of accesses per second" or "the number of accesses per 1000 retired instructions"?

Also, if we assume the prefetch request also contains a logical cpu id, it is indeed possible to let the two memory access streams from two sibling logical cpus trigger the (at least) "L2 streamer prefetcher" independently (called method 1). Or it is also possible to let the two memory access streams mix into a single stream to trigger the "L2 streamer prefetcher"(called method 2). This depends on how the L2 streamer prefetcher is designed. I don't know how Intel designed this. Buf I think it may be a good way to use "method 1" for multiprogrammed workload and "method 2" for multithreading workload. Overall I think "method 1" makes more sense than "method 2".

(which Intel is deprecating)

I didn't notice this. Can you show me where I can see Intel is deprecating "AnyThread" bit? Thanks. About this phenomenon you mentioned, I think one logical processor needs some latency to get its sibling logical processor's performance data. In addition, the transferring overhead also should be considered. So I think it sounds reasonable to let a local logical processor to grab its sibling's data in a larger time window (that is, many-cycle interval).

McCalpinJohn · ‎01-18-2018

I don't think that Intel has disclosed specifics about the dynamic/adaptive behavior of the L2 HW prefetchers, but "busy" likely includes the number of L1 miss/L2 hits currently pending as well as the number of L2 misses currently pending.

L1 miss/L2 hit processing is quite fast, so you have to throw a lot of L1 misses at the L2 in a short period to see this, but it is possible. The topic is discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346, with my most recent updates at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346#comment-1916594 and https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346#comment-1916640 -- the latter containing a plot that shows that L2 HW prefetch rate decreases when the L1 miss/L2 hit transactions are processed at the highest rate.

The influence of the number of L2 misses pending on the L2 HW prefetcher behavior is documented explicitly in the Intel Optimization Reference Manual (document 248966, revision 037, July 2017). Section 2.4.5.4 describes the hardware prefetchers on the Sandy Bridge processors. The section on the L2 "streamer" prefetcher includes the statement:

"Adjusts dynamically to the number of outstanding requests per core. If there are not many outstanding requests, the streamer prefetches further ahead. If there are many outstanding requests it prefetches to the LLC only and less far ahead.

My experiments have confirmed this general behavior, but it is challenging to observe what is happening in enough detail to formulate a quantitative model.

The future of the "AnyThread" bit is implied in Section 18.2.3.1 "AnyThread Counting and Software Evolution" in Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 064, October 2017). In addition, the Knights Landing processor only supports the AnyThread bit for the three architectural events that match the fixed-function counter events (documented in the Knights Landing Processor Performance Monitoring Reference Manual - Volume 1: Registers (document 332972). Finally, Section 19.2 of Volume 3 of the Intel Architectures Software Developer's Manual says that for the Xeon Processor Scalable Family (Skylake Xeon), users should refrain from using the AnyThread bit unless it is specifically listed as an option in the description of an event. In this case, the AnyThread option is only listed as an option to the two cycle-count architectural events that match the fixed-function counter events, plus INT_MISC_RECOVERY_CYCLES and L1D_PEND_MISS. While the KNL AnyThread feature restriction may not be particularly relevant, the dramatic reduction in the supported use of AnyThread on the Skylake (and Kaby Lake) cores suggests that AnyThread is unlikely to be widely supported in the future.

hiratz · ‎01-18-2018

Thanks a lot, John. Your answer is really helpful!

I'll read the links and sections you posted and let you know if I have some new findings or thoughts.

Best regards

How does the performance counter of a logical cpu collect the prefetcher-related event with hyperthreading enabled?