Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

The delay for accessing addresses in processes running on a same physical core

yunfeng7854
New Contributor I
917 Views

Hello, there,

We are exploring the feature of memory dependency prediction, and we observed a quite confusing effect. Let's assume 2 processes A and B running on 2 logical cores on a same physical core. While process A writes to addressA with something like "mov $4, (addressA)" and process B loads from addressB with "mov (addressB), %rax", if the 2-11 bits of addressA and addressB are the same, we observed a drastic delay of the loads in process B. Can someone kindly explain why there is a dependency here?

We are testing on a i7-6700K cpu, and we think it might not be caused by the cache bank conflict.

Thanks,

Wenhao

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
917 Views

I suspect that the "memory disambiguation" mechanisms for controlling reordering of accesses to the L1 cache has the cross-thread impact that you are observing because of a subtle feature of the Intel memory ordering model.  Specifically, the ordering model requires that writes by a single processor become visible in the same order to all other processors.  This includes the "sibling" logical processor(s) in a system employing HyperThreading, and prohibits some cross-thread snooping optimizations.   (Reference: Section 8.2.2 of Volume 3 of the Intel Architectures SW Developer's Manual, document 325384-062, March 2017).
 

View solution in original post

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
917 Views

This is covered under the topic of "Memory Disambiguation" at the end of section 2.4.5.2 in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-037, July 2017).

0 Kudos
yunfeng7854
New Contributor I
917 Views

McCalpin, John wrote:

This is covered under the topic of "Memory Disambiguation" at the end of section 2.4.5.2 in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-037, July 2017).

Will the "memory disambiguation" detect a possible address dependency across logical cores? My concern is that, as far as I read the manual, the store buffer is partitioned between 2 logical cores.

It appear to me that the load operation on 1 logical core need to snoop the store operations on the other logical core. Is there any insight on this? How is it implemented?

Thanks

0 Kudos
McCalpinJohn
Honored Contributor III
918 Views

I suspect that the "memory disambiguation" mechanisms for controlling reordering of accesses to the L1 cache has the cross-thread impact that you are observing because of a subtle feature of the Intel memory ordering model.  Specifically, the ordering model requires that writes by a single processor become visible in the same order to all other processors.  This includes the "sibling" logical processor(s) in a system employing HyperThreading, and prohibits some cross-thread snooping optimizations.   (Reference: Section 8.2.2 of Volume 3 of the Intel Architectures SW Developer's Manual, document 325384-062, March 2017).
 

0 Kudos
yunfeng7854
New Contributor I
917 Views

Thank you, Dr. Bandwidth. The only thing I am not sure is the "... prohibits some cross-thread snooping optimizations" in your reply. To prevent possible memory order problems, I guess there should be some cross-thread snooping optimizations? If the cross-thread snooping is prohibited, I don't think there will be a delay for conflicted address accesses.

0 Kudos
McCalpinJohn
Honored Contributor III
917 Views

The two logical processors share the same L1, and the L1 is where the memory ordering is enforced (since the store buffers are not snooped by external transactions).  The "memory disambiguation" feature allows relaxed ordering of reads when addresses don't collide, but it only looks at the first 12 bits (the same bits that define the location in the L1 Data cache).   For a load following a store, a match on all address bits is a true collision, while a match on bits 11:0 is a false collision.  In either case, the load is delayed until either the store data is available (in the case of a true collision) or until the processor can guarantee that the collision was false.  (This can be implemented in many different ways -- too large a discussion for today.)  So a false conflict in the "memory disambiguation" feature couples the two threads because it operates at the cache level, not at the store buffer level.  

One of the primary motivations for using store buffers is to avoid external snooping.  Self-snooping is required, because a subsequent load to the same address must get the new data, whether the store buffer contents have been committed to the cache or not.  HyperThreading is a special case.  The logical processors are sharing the same physical core, so it would be easy to snoop the subset of the store buffers used by the "sibling" logical processor.  But (in general) this cannot be allowed because there are too many cases where such snooping could cause a sequence of stores to appear in a different order for the sibling thread (compared to their order of appearance to other cores in the system).

0 Kudos
yunfeng7854
New Contributor I
917 Views

McCalpin, John wrote:

The two logical processors share the same L1, and the L1 is where the memory ordering is enforced (since the store buffers are not snooped by external transactions).  The "memory disambiguation" feature allows relaxed ordering of reads when addresses don't collide, but it only looks at the first 12 bits (the same bits that define the location in the L1 Data cache).   For a load following a store, a match on all address bits is a true collision, while a match on bits 11:0 is a false collision.  In either case, the load is delayed until either the store data is available (in the case of a true collision) or until the processor can guarantee that the collision was false.  (This can be implemented in many different ways -- too large a discussion for today.)  So a false conflict in the "memory disambiguation" feature couples the two threads because it operates at the cache level, not at the store buffer level.  

One of the primary motivations for using store buffers is to avoid external snooping.  Self-snooping is required, because a subsequent load to the same address must get the new data, whether the store buffer contents have been committed to the cache or not.  HyperThreading is a special case.  The logical processors are sharing the same physical core, so it would be easy to snoop the subset of the store buffers used by the "sibling" logical processor.  But (in general) this cannot be allowed because there are too many cases where such snooping could cause a sequence of stores to appear in a different order for the sibling thread (compared to their order of appearance to other cores in the system).

If I understand correctly, two statements are made here:

  1. For the second paragraph, self-snooping enables store-to-load forwarding in a thread, which is allowed and necessary to ensure the thread always gets the latest data, before the data is committed to the cache. Snooping and forwarding data from the "sibling" logical processor's store buffer are prohibited, because they may cause various ordering issues.
  2. For the first paragraph, because the L1 is where the memory ordering is enforced, the "memory disambiguation" mechanism needs to resolve the possible address dependencies between the logical processors. This explains why there is a delay when the load address in a logical thread has a match on bits 11:0 with the store address in the "sibling" logical thread.

It is still hard to understand why resolving address dependencies between the logical processors is necessary. Nevertheless when the load instruction is retired, the processor will know it uses an outdated data (by comparing with the value in L1). Then all the pipeline and instructions not retired will be flushed (considering the instructions are retired in the program order).

In my understanding, the address dependency prediction between logical cores might be an optimization of the above scheme. The earlier the dependency is resolved, the smaller the penalty will be (not necessarily flushing the pipeline). On the other hand, such an optimization may be easy to implement between logical cores.

0 Kudos
Reply