On our Xeon Gold 6154 the clwb instruction is bahaving exactly the same as the clflushopt instruction meaning that is also evicts the cacheline, which it is not supposed to do...
Thank you very much for contacting the Intel® Communities Team, freeky1337.
In order for me to assist you better, please provide me with the .txt file that the https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility System Support Utility will generate. Furthermore, I would like to know which is the software you are running when this issue happens and if possible, please attach an image of the issue to this thread. To attach a file, you must click the "Attach" option on the bottom right-hand corner of the response box.
Thanks for your response. I attached the output of the System Support Utilitty.
We encoutered the issue the first time as we used the Intel PMDK. The test iterates over an array using different access patterns (e.g., sequential or uniform distributed) and updates 4 byte integers followed by a CLWB and an SFENCE. No matter which write instruction (CLFLUSHOPT, or CLWB) or array size (1GB or 1MB to be fully cachable) the performance is the same, which is a clear indicator that CLWB always evicts the cacheline. Especially the sequential access pattern runs much faster without any CLFLUSHOPT or CLWB but half as slow as the uniform one, because it seems to evict the cache line it needs for the subsequent update.
The problem is not PMDK-specific and a small benchmark with rdtsc timing also indicates that CLWB does nothing else than CLFLUSHOPT.
Thanks in advance,
Thank you very much for your prompt reply, freeky1337.
By checking the information provided, in this case, for us to better assist you regarding this inquiry you have, please open a thread with the https://software.intel.com/en-us/support Intel® Developer Zone team so they can assist you further regarding this inquiry you have.