topic Cache-coherence traffic in Intel® Moderncode for Parallel Architectures

Cache-coherence traffic

aminer10 — Fri, 11 Oct 2013 15:19:21 GMT

Hello,

I have come to an interresting subject, in computer science
we calculate the complexity to give an idea at how good
or bad is the algorithm, it's the same with Locks you have
to do some calculation fir Locks algorithms to give an idea at how good or bad is the Lock is with cache-coherence traffic, so follow with me
please, if you take a look at the source code of my
scalable distributed fair Lock (you can download the source
code at: http://pages.videotron.com/aminer/)

You will read inside the LW_DFLOCK.pas right on the
"procedure TDFLOCK.Enter;" you will read this:

==

if ((FCount3^.FCount3=2) and CAS(FCount2.FCount2,1,0))
then
begin
myobj^.myid:=-1;
break;
end;"

==

If you have noticed if FCount2.FCount2 have changed this
will generate N (N is the number of threads) cache lines misses
and cache lines transfers, but you have to be smart please ,
in a contention scenario since we are looping around:

"if ((FCount3^.FCount3=2) and (FCount1^[myid].FCount1=myobj^.count)) then break;"

and

FCount1^[myid].FCount1 is on the a local cache

that means that the cache-coherence traffic will be reduced to 1
cache misse and 1 cache line tranfer every time FCount1^[myid].FCount1 have changed , so this is better than
the cache-coherence traffic of 1+2+3...N = (N^2+N)/2 or even worse the N+N+N+...N of the spinlock with a backoff or the Ticket spinlock,
other than that my scalable distributed Lock is Fair and it avoids
starvation.

Thank you,
Amine Moulay Ramdane.

I correct some typos...

aminer10 — Fri, 11 Oct 2013 15:22:43 GMT

I correct some typos...

Hello,

I have come to an interresting subject, in computer science
we calculate the complexity to give an idea at how good
or bad is the algorithm, it's the same with Locks you have
to do some calculations with Locks algorithms to give an idea at how good or bad the Lock is with cache-coherence traffic, so follow with me
please, if you take a look at the source code of my
scalable distributed fair Lock (you can download the source
code at: http://pages.videotron.com/aminer/)

You will read inside the LW_DFLOCK.pas right on the
"procedure TDFLOCK.Enter;" you will read this:

==

if ((FCount3^.FCount3=2) and CAS(FCount2.FCount2,1,0))
then
begin
myobj^.myid:=-1;
break;
end;"

==

If you have noticed if FCount2.FCount2 have changed this
will generate N (N is the number of threads) cache lines misses
and cache lines transfers, but you have to be smart please ,
in a contention scenario since we are looping around:

"if ((FCount3^.FCount3=2) and (FCount1^[myid].FCount1=myobj^.count)) then break;"

and

FCount1^[myid].FCount1 is on the a local cache

that means that the cache-coherence traffic will be reduced to 1
cache misse and 1 cache line tranfer every time FCount1^[myid].FCount1 have changed , so this is better than
the cache-coherence traffic of 1+2+3...N = (N^2+N)/2 or even worse the N+N+N+...N of the spinlock with a backoff or the Ticket spinlock,
other than that my scalable distributed Lock is Fair and it avoids
starvation.

Thank you,
Amine Moulay Ramdane.