Here is an observation I have. Can you help me explain it.
Setup -1 : A process updates shared memory allocated on the local node(0) and writes to it constantly from a core (3) on package (0) attached to the node. Another process reads it from a core (1) on the same package(0) and attached node(0) constantly. The read cycle I am measuring in clock cycles is around 70.
Setup-2 : A process running on a core (2) running on package (1) updates shared memory allocated on the remote node (0) and writes to it constantly. Another process reads it from a core (1) on package(0), local to the shared memory node (0). In this case the reader reads it in about 3 cycles (within a statistical error)
What is the explanation for the reader incurring less penalty in reading this shared memory location when a process running on the remote node is updating it as opposed to a process running on another core on the local package updating it?
My first guess is that you have messed up the test. From the stats, it looks like case 1 is actually remote memory access and case 2 is actually local memory access (maybe even same core access).
Are you using Linux or windows? How are you identifying nodes/cores/cpus? How are pinning to a node/core/cpu? When you write the memory, are you writing a different value each time? Are you testing that the reader is getting the value being written by the writer? if there are no locks, is the reader in case 2 just getting a subset of the write values? Do you know how frequently the update thread and the reader thread is running? Why are you running the tests?
Just the first 10 minutes of questions...