Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

ICX - What is SpecI2M request and how it differs from RFO?

Eiv
Novice
5,610 Views
From IRMA's presentation on Icelake server it said:
 

"Covert RFO to specI2M when memory subsystem is heavily loaded Reduces mem bandwidth demand on streaming WLs that do full cache line writes (25% efficiency increase)"

So I would like to understand what is specI2M and how it differs from RFO(Read for ownership) 

Labels (1)
0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
5,445 Views

It is unlikely that Intel is going to address this topic in detail -- except perhaps in patent applications (which often do not relate as closely to actual product implementations as one might expect...)

It is very easy to make incorrect assumptions about how the relationships between instructions and protocol messages -- especially in Intel processors that have increasingly complex dynamically-adaptive behavior.

The description of the "specI2M" transaction in the HotChips presentation makes it clear that this provides an alternate mechanism for handling RFOs that can be used in a dynamic-adaptive environment to reduce read traffic ("write allocates") when utilization is high.  You are correct that several mechanisms already exist for this case, but the existing mechanisms are "static", while this enables dynamic adaptation.

As an example, one can compile the STREAM benchmark to use non-temporal stores.  This eliminates the write allocate traffic (1/3 of the total traffic for the Copy and Scale kernels, and 1/4 of the total traffic for the Add and Triad kernels) and provides a useful speed boost.  BUT, non-temporal stores have several inconvenient features:

  • Non-temporal stores are weakly ordered and require the binary to include extra fence instructions when switching back to ordinary stores.
  • Non-temporal stores are intended to push data all the way to "memory".  
    • This is not the desired behavior for all loop sizes or all contexts -- if the loop is small and the data could fit into some level of cache, then using streaming stores reduces performance by preventing cache re-use.
    • A compiler can generate multiple versions of loops, but this can be an expensive approach -- versioning can be multiplicative, and there are already too many special cases (based on alignment of each of the variables involved and the length of the loop).
    • The concept of "memory" is becoming less clear, with the addition of layers "beyond" DRAM (e.g., persistent memory), and/or the addition of caching layers between DRAM and the traditional caches (e.g., MCDRAM cache in Xeon Phi x200, L4 DRAM caches in some client chips, potential future HBM caches?). 

A dynamic mechanism can (conceptually) make decisions about the generation of write-allocates independently at each level of the cache+memory hierarchy.  At each level, the mechanism might be chosen by some combination of available information, perhaps including queue occupancy at the input and output buffers, some measure of hit rates at the particular level of cache, etc.   

I certainly recall discussing this type of optimization while I was on the IBM POWER design team in the early 2000's.  We already had an instruction that would allocate and zero a cache line without reading the line from memory, but we wanted this to be something that could happen automatically when it was beneficial to do so.  (I have no recollection about what we decided to do -- I just remember the discussions!)

A similar feature is included in some ARM processors, but they have rearranged their web site and I can't find the reference at the moment.  Paraphrasing from memory: when a store buffer is ready to be written to the L1 Data Cache AND all bytes of a cache line are "valid" in the store buffer AND the corresponding cache line is not present in the L1 Data Cache, the L1 Data cache controller may choose to issue a transaction corresponding to "RFO without data".  I get the impression that the cache controller tracks how often stores match these properties, and only makes the switch to "RFO without read" when the scenario happens "often".   It appears that this mechanism is replicated at each level of the cache because the STREAM benchmark regularly delivers performance that is too high if one assumes that there are write allocates, while inspection of the source code shows that only "ordinary" (allocating) stores are used.

View solution in original post

10 Replies
Maria_R_Intel
Moderator
5,596 Views

Hello Eiv,


Thank you for posting on the Intel* Community.


The Context Sensing SDK is no longer supported. Intel Customer Service no longer supports inquiries for it, but perhaps fellow community members have the knowledge to jump in and help. We apologize for the inconvenience but it was end-of-life earlier in the year.


Best regards, 

Maria R. 

Intel Customer Support Technician 


0 Kudos
Eiv
Novice
5,590 Views

Hi Maria,

 

I afraid we aren't talking about the same product - Icelake server which is about to get released in the end of this year.

 

Thanks for replying though,

Eiv

0 Kudos
HadiBrais
New Contributor III
5,584 Views

An RFO request reads the cache line and obtains exclusive ownership of the line. However, if every byte of the line is going to be modified, then it's unnecessary to read the line. SpecI2M only obtains ownership and doesn't read the cache line. SpecI2M is new type of request introduced on ICL-SP, while older generations only supported the I2M request type. SpecI2M will probably be documented in the uncore performance monitoring guide of ICL-SP, which should be released in the next few months.

BTW, this question is more suitable on the "Software Tuning, Performance Optimization & Platform Monitoring" forum.

Eiv
Novice
5,576 Views

Thanks for the answer!

Since it's my first post in Intel forums I wasn't aware of where is the most suitable sub forum to post my question, next time I will definitely use "Software Tuning, Performance Optimization & Platform Monitoring" instead.

 

Regarding your answer, to be honest I was sure that this feature - when you write full 64B in single instruction that it shouldn't read the data and just fetch cache line, store the data and mark all the other instances as invalid was already existing with AVX512 introduction.

 

What is the I2M request type? is there some reference I can read?

 

0 Kudos
HadiBrais
New Contributor III
5,563 Views

I didn't say that obtaining ownership without data is a new feature on ICL-SP (which actually existed long before AVX-512 to implement string instructions efficiently). I said that SpecI2M is a new type of request to support the new feature of converting RFOs to ownership-without-data requests.

0 Kudos
Eiv
Novice
5,555 Views

do you know why do we need this new type of request (specI2M) for streaming workload that does full cache line writes?

I would expect it to use the mechanics already available (AVX512, string instructions, etc,..)

0 Kudos
McCalpinJohn
Honored Contributor III
5,446 Views

It is unlikely that Intel is going to address this topic in detail -- except perhaps in patent applications (which often do not relate as closely to actual product implementations as one might expect...)

It is very easy to make incorrect assumptions about how the relationships between instructions and protocol messages -- especially in Intel processors that have increasingly complex dynamically-adaptive behavior.

The description of the "specI2M" transaction in the HotChips presentation makes it clear that this provides an alternate mechanism for handling RFOs that can be used in a dynamic-adaptive environment to reduce read traffic ("write allocates") when utilization is high.  You are correct that several mechanisms already exist for this case, but the existing mechanisms are "static", while this enables dynamic adaptation.

As an example, one can compile the STREAM benchmark to use non-temporal stores.  This eliminates the write allocate traffic (1/3 of the total traffic for the Copy and Scale kernels, and 1/4 of the total traffic for the Add and Triad kernels) and provides a useful speed boost.  BUT, non-temporal stores have several inconvenient features:

  • Non-temporal stores are weakly ordered and require the binary to include extra fence instructions when switching back to ordinary stores.
  • Non-temporal stores are intended to push data all the way to "memory".  
    • This is not the desired behavior for all loop sizes or all contexts -- if the loop is small and the data could fit into some level of cache, then using streaming stores reduces performance by preventing cache re-use.
    • A compiler can generate multiple versions of loops, but this can be an expensive approach -- versioning can be multiplicative, and there are already too many special cases (based on alignment of each of the variables involved and the length of the loop).
    • The concept of "memory" is becoming less clear, with the addition of layers "beyond" DRAM (e.g., persistent memory), and/or the addition of caching layers between DRAM and the traditional caches (e.g., MCDRAM cache in Xeon Phi x200, L4 DRAM caches in some client chips, potential future HBM caches?). 

A dynamic mechanism can (conceptually) make decisions about the generation of write-allocates independently at each level of the cache+memory hierarchy.  At each level, the mechanism might be chosen by some combination of available information, perhaps including queue occupancy at the input and output buffers, some measure of hit rates at the particular level of cache, etc.   

I certainly recall discussing this type of optimization while I was on the IBM POWER design team in the early 2000's.  We already had an instruction that would allocate and zero a cache line without reading the line from memory, but we wanted this to be something that could happen automatically when it was beneficial to do so.  (I have no recollection about what we decided to do -- I just remember the discussions!)

A similar feature is included in some ARM processors, but they have rearranged their web site and I can't find the reference at the moment.  Paraphrasing from memory: when a store buffer is ready to be written to the L1 Data Cache AND all bytes of a cache line are "valid" in the store buffer AND the corresponding cache line is not present in the L1 Data Cache, the L1 Data cache controller may choose to issue a transaction corresponding to "RFO without data".  I get the impression that the cache controller tracks how often stores match these properties, and only makes the switch to "RFO without read" when the scenario happens "often".   It appears that this mechanism is replicated at each level of the cache because the STREAM benchmark regularly delivers performance that is too high if one assumes that there are write allocates, while inspection of the source code shows that only "ordinary" (allocating) stores are used.

Eiv
Novice
5,426 Views

Thank you for the detailed answer.

0 Kudos
IntelSupport
Community Manager
5,500 Views

Hello Eiv,


Please allow us more time to investigate your inquiry. We will post back on this thread as soon as possible.



Best regards,

Maria R.

Intel Customer Support Technician


0 Kudos
Maria_R_Intel
Moderator
5,483 Views

Hello Eiv,


To better assist you, we will move this thread to the proper sub-forum. Please expect a response soon.


Best regards, 

Maria R.  

Intel Customer Support Technician 


0 Kudos
Reply