Here is an observation I have. Can you help me explain it.
Setup -1 : A process updates shared memory allocated on the local node(0) and writes to it constantly from a core (3) on package (0) attached to the node. Another process reads it from a core (1) on the same package(0) and attached node(0) constantly. The read cycle I am measuring in clock cycles is around 70.
Setup-2 : A process running on a core (2) running on package (1) updates shared memory allocated on the remote node (0) and writes to it constantly. Another process reads it from a core (1) on package(0), local to the shared memory node (0). In this case the reader reads it in about 3 cycles (within a statistical error)
What is the explanation for the reader incurring less penalty in reading this shared memory location when a process running on the remote node is updating it as opposed to a process running on another core on the local package updating it?
Can we see your test code?
If I were to guess Setup 1 is reading from RAM, whereas Setup 2 is reading from L1.
This seems to be reversed from what you would expect.
Are you timing reads without regards to memory change?
If so, Setup 2 would have longer write intervals thus making fewer cache line evictions for the other socket (and reading same value multiple times).