Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Intercore communication and Cache Sharing

Tim_G_2
Beginner
559 Views

What would be the fastest (read lowest latency) method for moving several KiB of data from one core to another on the same physical package?

Suppose core 0 writes some data to L3 cache. My understanding from the SDM Vol 3, Section 11.4 is that core 0 gains ownership of the cache lines (if it did not have it already) even if they were previously shared. For core 1 to read the newly written data, the cache lines should normally be flushed with write to ram and fetched by the other core through a cache miss. Is it possible to transfer ownership of the L3 lines from one core to the other to allow core 1 to access the data without waiting for the store and load via ram?

In particular, I'm targeting Xeon E7 processors in the v4 series.

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
559 Views

There are a number of different ways to implement these transactions, and it is difficult to be sure that you understand a particular implementation well enough to correctly predict what it is going to do....   Section 11.4 of Volume 3 of the SWDM is a very high-level description, and does not contain all of the cache states and transactions that are implemented in specific processors.   

In your case the trickiest part is to ensure that the "consumer" thread does not start reading the "data" before the "producer" thread has completed writing it.   Once that is correctly implemented, the "consumer" does not need to "move" anything -- it can simply access the data by loading from the appropriate addresses -- the L3 cache will handle the transfers efficiently.  If the consumer needs to copy data to a different address range, a simple read/write loop or memcpy() call should provide very close to the best possible performance.

The L3 cache can deliver data at a sustained rate of about 14 Bytes/cycle on the Xeon E7 v4 processors, so the total time for the consumer to read all the data will be composed of three parts:

  1. A synchronization overhead -- the time required for the producer to notify the consumer that the data is ready to be read. 
    1. This involves a fair number of cache transactions, with timing that depends on a great many factors, including (at least): core frequency, uncore frequency, number of cores on the die, locations of the producer core, the consumer core, and the L3 slice containing the "flag" variable used for synchronization, etc.
    2. I don't have results on a Xeon E7 v4, but I measured a range of 200-300 cycles on a 12-core Xeon E5 v3 (depending on the relative placement of the cores and the L3 slice handling the cache line used for the handoff).
  2. A pipeline startup -- the time required for the consumer to receive the first data elements from the producer after the synchronization tells the consumer that it can start reading the data.
    1. There are 2.5 intervention latencies in the synchronization above, so this should be 80-120 cycles.
  3. The bulk transfer
    1. This should average about 14 bytes per cycle for loads of L3-resident data.
    2. E.g., for 2 KiB, I would expect about 150 cycles.

Adding these three parts together gives an estimate of 430 to 570 cycles for 2048 Bytes, or about 3.6 Bytes/cycle to 4.8 Bytes/cycle.

Because the transfer is dominated by the overhead here, the efficiency should improve for larger transfer sizes (up to the limit of the L1, when things get more complex).

 

View solution in original post

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
560 Views

There are a number of different ways to implement these transactions, and it is difficult to be sure that you understand a particular implementation well enough to correctly predict what it is going to do....   Section 11.4 of Volume 3 of the SWDM is a very high-level description, and does not contain all of the cache states and transactions that are implemented in specific processors.   

In your case the trickiest part is to ensure that the "consumer" thread does not start reading the "data" before the "producer" thread has completed writing it.   Once that is correctly implemented, the "consumer" does not need to "move" anything -- it can simply access the data by loading from the appropriate addresses -- the L3 cache will handle the transfers efficiently.  If the consumer needs to copy data to a different address range, a simple read/write loop or memcpy() call should provide very close to the best possible performance.

The L3 cache can deliver data at a sustained rate of about 14 Bytes/cycle on the Xeon E7 v4 processors, so the total time for the consumer to read all the data will be composed of three parts:

  1. A synchronization overhead -- the time required for the producer to notify the consumer that the data is ready to be read. 
    1. This involves a fair number of cache transactions, with timing that depends on a great many factors, including (at least): core frequency, uncore frequency, number of cores on the die, locations of the producer core, the consumer core, and the L3 slice containing the "flag" variable used for synchronization, etc.
    2. I don't have results on a Xeon E7 v4, but I measured a range of 200-300 cycles on a 12-core Xeon E5 v3 (depending on the relative placement of the cores and the L3 slice handling the cache line used for the handoff).
  2. A pipeline startup -- the time required for the consumer to receive the first data elements from the producer after the synchronization tells the consumer that it can start reading the data.
    1. There are 2.5 intervention latencies in the synchronization above, so this should be 80-120 cycles.
  3. The bulk transfer
    1. This should average about 14 bytes per cycle for loads of L3-resident data.
    2. E.g., for 2 KiB, I would expect about 150 cycles.

Adding these three parts together gives an estimate of 430 to 570 cycles for 2048 Bytes, or about 3.6 Bytes/cycle to 4.8 Bytes/cycle.

Because the transfer is dominated by the overhead here, the efficiency should improve for larger transfer sizes (up to the limit of the L1, when things get more complex).

 

0 Kudos
Reply