A new algorithm of a scalable distributed sequential lock

aminer10 · ‎12-19-2014

Scalable distributed sequential lock

Scalable Distributed Sequential lock

Scalable Distributed Sequential lock version 1.11

Author: Amien Moulay Ramdane.

Description:

This scalable distributed sequential lock was invented by Amine Moulay Ramdane, and it combines the characteristics of a distributed reader-writer lock with the characteristics of a sequential lock , so it is a clever hybrid reader-writer lock that is more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, so my invention that is my scalable distributed sequential lock has eliminated this weakness of the Dmitry Vyukov's distributed reader-writer mutex, so that the writers throughput has become faster and very fast, and my scalable distributed sequential lock elminates the weaknesses of the Seqlock (sequential lock) that is "livelock" and "starvation" even when there is more writers, so my scalable distributed sequential lock is a powerful sychronization mechanism that can replace RCU and that can replace Seqlock and that can replace Dmitry Vyukov's distributed reader-writer mutex. And if you look at my algorithm you will notice that on the reader side it looks like a sequential lock, but i have added a variable called fcount6^.fcount6 that allows the readers to switch between a sequential lock mechanism and a distributed reader-writer mechanism, so on the writer side you will notice that on every "number of cores" calls to WLock(), my scalable distributed sequential lock switches to a distributed reader-writer lock mechanism and executes the writer in this mode for a one time and for the rest of the time it will switch to a sequential lock mode , this allows my scalable distributed sequential lock to minimize efficiently cache-line transfers on the writer side, this is why my scalable distributed sequential lock beats Dmitry Vyukov distributed reader-writer mutex and it beats Seqlock.

Please look at the "test.pas" example that i have included and you will notice that the reader side have to be used in a transactional mode with a repeat-until loop, like this:

repeat

t.rlock(id1,id2,id3);

until t.runlock(id1,id2,id3);

it is like a transactional mode of a Seqlock and id1,id2 and id3 must be of type "long" that must be imported from the drwlock library. The writer side is easy to program , please look at the "test.pas" example and you will understand how to use my scalable distributed sequential lock.

My scalable distributed sequential lock can be used from accross processes and threads.

My new invention called distributed sequential lock beats the following algorithm, it is more scalable and faster than the following algorithm of Dmitry Vyukov called Distributed reader-writer mutex ( i have just explained to you why ..):

http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex

And my new invention called distributed \ sequential lock beats also Seqlock (i have just explained to you why ...), read here about it:

http://en.wikipedia.org/wiki/Seqlock

And my new invention called distributed sequential lock can replace RCU cause it is a powerful synchronization mechanism.

I will explain my new algorithm of a scalable distributed sequential lock that is very powerful...

I have arrived at my new algorithm by also reading many PhD papers and by looking first at the weakness of the following algorithm of Dmitry Vyukov's Reader-writer mutex:

http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex

look at the writer side of the Dmitry Vykov's algorithm , it is doing with every rwlock in each corresponding core:

for (i = 0; i != mtx->proc_count; i += 1)

pthread_rwlock_wrlock(&mtx-cell.mtx);

but this render this algorithm inefficient for the writer side , cause

every pthread_rwlock_wrlock() call transfer many cache-lines betwenn cores, so this will generate too much cache line transfers between cores when each writer executes distr_rw_mutex_wrlock() and this will become worse and worse when you add more and more cores, i mean this will tranfer more and more cache-lines between cores with more and more cores... so this will render the writer side throughput slow and this is the weakness of this algorithm...

So this is why i have used a sequential lock mechanism in the write side of my new algorithm so that to minimize efficiently the cache-lines transfers, so if you look at my algorithm, i am doing this inside the WLock() function on the writer side:

if (FCount6^.fcount6 mod GetSystemThreadCount)=0

then FCount5^.fcount5:=1;

FCount6^.fcount6 is a counter and i have applied to it a modulo of GetSystemThreadCount(), and GetSystemThreadCount() will return the number of cores, so every "number of cores" calls to WLock() , my new algorithm switches from a sequential lock to a distributed lock so that to eliminate the "livelock" and "starvation" weaknesses of Seqlock, cause Seqlock is a lockfree mechanism that can starve threads, and that can livelock when there is more writers , cause Seqlock gives preference to writers threads over reader threads, but my new invention of a scalable distributed sequential lock eliminates "livelock" and "starvation" of Seqlock when there is more writers cause it switches to a distributed lock on every "number of cores" calls to the WLock() method , so this is why my new invention beats Seqlock, and my new invention beats also the Dmitry's Reader-writer mutex cause it switches to a distributed lock on every number of cores calls to WLock(), so in the remaining calls it switches to a sequential lock that render the writer side faster cause it generate much less cache-lines tranfers than the Dmitry Vyukov's distributed reader-writer mutex. On the reader side of my algorithm i am using a Sequential lock mechanism. And i think that since my new algorithm of a scalable distributed sequential lock is very powerful, so i think it can even replace RCU.

Here is my proof:

Look at the source code, when Fcount5^.fcount5 equal 0 in the writer side , the reader side will be run in a Seqlock mode, , i don't need to proove Seqlock because my Seqlock algorithm is correct, just compare it to the other Seqlock algorithms and you will notice it, what i need to proof is when Fcount6^.fcount6 modulo "the number of cores" is equal to 0 , that means that we are in a distributed mode, so when we are in a distributed mode, the reader side of my algorithm will enter the distributed reader-writer lock or it will wait for the distributed reader-writer lock to exit, if the reader side enters the distributed reader-writer lock , if Fcount6^.fcount6 has not changed on the RUnlock(), the reader thread will exit with a "true" value, if Fcount6^.fcount6 has changed, the RUnlock() method will catch it and it will rollback with Seqlock mechanism, now if Fcount6^.fcount6 modulo "the number of cores" is equal to 0 on RLock() , and the reader side waits for the writer side to enter the distributed reader-writer lock, the reader side after that will enter the distributed reader-writer lock and if Fcount6^.fcount6 modulo "the number of cores" has changed on RUnlock(), the reader side will rollback in a Seqlock mode, if not, RUnlock() will exit with the value of "true". So i think my algorithm is correct.

About the sequential consistency of my scalable distributed sequential lock, my algorithm works on x86 architecture and i think my algorithm is correct cause look at the source code of the WLock() method, since i am using a Ticket spinlock with a proportional backoff on the writer side, the Ticket spinlock is using a "lock add" assembler instruction to increment a counter in the Enter() method of the ticket spinlock , and this "lock add" assembler instruction is a barrier for stores and loads on x86, and the stores of Fcount6^.fcount6 and FCount5^.fcount5 are not reordred with the stores of the writer section, so the WLock() method is sequential consistent and correct, now look at the WUnlock() , we don't need an "sfence" cause stores on WUnlock() are not reordered with older stores on x86 , so WUnlock() method is sequential consistent and correct, now look at the RLock() method, the loads inside RLock() method are not reordered with the loads of the reader section , and on RUnlock(), the loads of RUnlock() are not reordered with older loads of the reader section , so all in all i think my algorithm is sequential consistent and correct on x86. So be confident cause i have reasoned correctly and i think my algorithm is correct and it is a powerful synchronization mechanism that can replace RCU and that can replace Seqlock cause it beats Seqlock.

You can download my scalable distributed sequential lock

https://sites.google.com/site/aminer68

Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/

Operating Systems: Windows, Mac OSX , Linux on x86.

Required FPC switches: -O3 -Sd -dFPC -dFreePascal

-Sd for delphi mode....

Required Delphi switches: -$H+ -DDelphi

{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems

{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems