Community
cancel
Showing results for 
Search instead for 
Did you mean: 
aminer10
Beginner
33 Views

A question about scalability ...


Hello,

I didn't know where to ask this question, so i decided to ask here..

I just read yesterday the following page, look at the the USL
(Universal Law of Computational Scalability) of Dr. Gunther,
he wrote this: ( see http://en.wikipedia.org/wiki/Neil_J._Gunther)

--------------------------

The relative capacity C(N) of a computational platform is given by:

C(N) = N
-------------------
1 + (N - 1) + N (N - 1)

where N represents either the number of physical processors
in the hardware configuration or the number of users driving the
software application. The parameters and represent respectively
the levels of contention (e.g., queueing for shared resources) and
coherency delay (i.e., latency for data to become consistent) in the
system. The parameter also quantifies the retrograde throughput seen
in many stress tests but not accounted for in either Amdahl's law or
event-based simulations.

---------------

His website: http://www.perfdynamics.com/

If you read carefully , you will see that Dr. Gunther is using this
model to predict scalability after he simulates a relatively small
number
of vusers in LoadRunner ( because of licensing costs, it's cost-
effectiveness)
and after that he finds the coefficients of the 2nd-degree polynomial
(quadratic equation) and then transform those coefficients back to the
USL parameters using the = b - a and = a.

And than he is extrapolating with the USL model to higher loads
to predict scalability.

He is also applying the model to webservers with heterogeneous
workloads. like in the following page:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-wi...

Now my question follows:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation to
predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ?

Can it give this information ?

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

0 Kudos
42 Replies
aminer10
Beginner
26 Views


An important answer that i wrote (from comp.programming.threads newsgroup)...

On Sep 6, 4:12 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> As the subject line suggests (I hope), I'm missing something
>really basic about lock-free algorithms. In particular, why
>they can't use locks.


In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:

1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

When we have lock contention, a factor called mean
waiting time will be added to the Amdhal equation
like this: Speed = 1/(S+mean waiting time+(P/N))
it's the mean waiting time of the threads that are
waiting on the lock , and when we reduce the duration
for wich locks are held this will lower the service time:
hence the waiting time will be smaller and this
will higher the speed in the Amdhal equation.


Look at how my Lock-free ParallelQueue is scaling:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm


If you loOK at this page, i wrote this:

[1] If there is LESS contention THEN the algorithm
will scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N becomes bigger,
the result - the speed of the program/algorithm... -
of the Amdahl's equation 1/(S+(P/N)) become bigger.


[2] If there is false sharing THEN the algorithm
will not scale well. Due to the fact that S (the serial part)
of the Amdahl's equation 1/(S+(P/N)) will becomes bigger
with false sharing, so, this is not good for scalability.

As you have notice, i am including the mean waiting time
factor as if it were a serial part, cause in Amdhal equation
there is only two parts S(serial part) and P(parallel part),
but the correct equation is: Speed = 1/(S+mean waiting time+(P/N)).


Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

On Sep 6, 4:12 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> As the subject line suggests (I hope), I'm missing something really basic about
> lock-free algorithms. In particular, why they can't use locks.
>
> Wikipedia and others define lock-free algorithms as those where at least one
> thread in a process is guaranteed to make progress, regardless of what the other
> threads are doing. So suppose I bundle an array with a mutex to make a
> "synchronized array," where each operation on the array grabs the mutex, does
> some work on the array (but does not call into code outside the array, i.e.,
> doesn't invoke any code that could itself try to grab the mutex), then releases
> the mutex. This uses a lock, but inside the "synchronized array" critical
> section, even if all the other threads in the process are stalled, the thread in
> the critical section is making progress.
>
> As I said, I'm missing something fundamental. I'd be grateful if somebody would
> point out what it is.
>
> Thanks,
>
> Scott
>
> --
> * C++ and Beyond: Meyers, Sutter, & Alexandrescu, Oct. 24-27 near Seattle
> (http://cppandbeyond.com/)
> * License my training materials for commercial (http://tinyurl.com/yfzvkp9) or
> personal use (http://tinyurl.com/yl5ka5p).

aminer10
Beginner
26 Views



Hello again,


I will add that The N. J. Gunther Universal Scalability Law states

The relative capacity C(N) is a normalized throughput given by:
C(N) = N

1 + (N 1) + N (N 1)

And as you will notice in the following graphs in:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

there is anegative return on investment when > 0 andthe
coherency paramater > 0, thiscauses the throughput
to go retrograde faster in the lock-free RingBuffer(in both the
push and the pop) and in the lock-free flqueue (in the pop side),
but in my lock-free ParallelQueue itkeep scaling very well...



regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/


aminer10
Beginner
26 Views

Scott Meyers wrote:
> I agree, but my question is about the *definition*
> of being lock-free and how that definition precludes
> the use of locks. I already know why lock-free is
> good for scalability. What I don't know is how the
> definition of lock-free prohibits using locks.


I don't understand ? in lock-free algorithms under x86
we use for example a lock instruction in assembler,
and it's in fact a lock that we are using.


It's the lock-free algorithm that is different ,
you retry when the transaction( the lock..) fails,
like in TM.

Regards,
Amine Moulay Ramdane.

aminer10
Beginner
26 Views

On Sep 6, 5:36 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> On 9/6/2010 2:27 PM, aminer wrote:
>
> > I don't understand ? in lock-free algorithms under x86
>
> "Lock-free algorithm" is a technical term. Its definition is independent of the
> language you are programming in and the hardware you are running on.
>
> Note that the definition of "lock-free" makes no mention of locks at all, so
> it's not something as simple as "lock-free is defined to use no locks". This,
> for example, is Wikipedia's definition
> (http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom):
>
> An algorithm is lock-free if it satisfies that when the program threads are
> run sufficiently long at least one of the threads make progress (for some
> sensible definition of progress).
>
> How does this definition preclude the use of e.g. a mutex in the algorithm?


You are confusing/mixing mutex with lock, you were speaking
about lock, and as i said before lock-free algorithms does
in fact use locks , like in x86 lock-free algorithms are
using a lock instruction in assembler , in SMR also they
use locks etc. the Mutex also does use locks in its code internally...

But as i said before , it's the lock-free algorithm
that is different , you retry when the transaction( the lock..)
fails, like in TM, and we call that Optimistic locking.


Regards,
Amine Mouly Ramdane.




jimdempseyatthecove
Black Belt
26 Views

In"Lock Free" coding you are only permitted to hold an uninterruptable atomic lock. Example:

LOCK; add dword ptr [eax], 1

In the above sequence (when eax is a legitimate, aligned,address in R/W memory)the LOCK prefex is used to permit Read/Modify/Write instruction to execute in an atomic manner. More important to "Lock Free" - the above instruction cannot be interrupted. Therefore, "Lock Free" means: no lock can be pre-empted.

While a "Lock" is a software lock (typically using LOCK prefix to obtain the lock) and once the lock is obtained, interruptable code sequence follows. The problem with { lock, code, unlock} comes when the system pre-empts the code while holding the lock - all threads waiting for the lock wait for the duration of the preemption (plus remaining code time by thread holding lock).

Jim Dempsey
aminer10
Beginner
26 Views


Hello again,

Look at the following page, Dr Gunther is applying
the USL model to webservers:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-with-mixed.html


When we are modeling webservers , we have to include
the network&tcp/ip in our newtwork queuig model
(that comprises the queue of of the computer server side) ,
and since the service in the computer server is comprised of
multiple services (when we are using htmls , databases etc.)
the network&tcp/ip queue will not be markovian in the service
side, and we have to model the network&tcpip queue as an M/G/1
and this will complicate the mathematical analytic modeling...

Now as i said before:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation
to predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ? I think USL can not predict this.

So, i think the best way is to use a webserver stress tool
like http://fwptt.sourceforge.net/

You can even test the webserver with an heterogeneous
workloads by starting multiple fwtpp processes, and
you should increase the number of threads to 5 and after
that to 10 threads, 15 ... and so on until the webserver
applications stops responding propperly(and this will inform
on the maximum number of users that you can have in the same time...)
and as you are stress testing you can even see (by testing/measuring
it) if the network bandwidth is not the bottleneck... and this can
not be done by the USL model.


Regards,
Amine Moulay Ramdane


aminer10
Beginner
26 Views

I wrote:
> When we are modeling webservers , we have to include
> the network&tcp/ip in our network queuig model
> (that comprises the queue of the computer server side) ,
> and since the service in the computer server is comprised of
> multiple services (when we are using htmls , databases etc.)
> the network&tcp/ip queue will not be markovian in the service
> side, and we have to model the network&tcpip queue as an M/G/1
> and this will complicate the mathematical analytic modeling...


We already know that to satisfy a Poisson process we must
have that N(t1)- N(t0), N(t2)- N(t1) etc. that means the
counting increments must be independant.

We have the following relation between
the Poisson law and Exponential law:

the expected value E(X exponential) = 1 / E(X poisson)

and if the arrival is poissonian than the interarricals are exponential..

Now what about a webserver ?

I have read the following paper:

http://docs.google.com/viewer?a=v&q=cache:JFYCp_dSPP4J:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%...
It says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..

Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this

An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue

and we have the following:

Ni: number of jobs in each queue
Ui: utilization of each queue

Ni = Ui / (1-Ui)

Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.

After that we apply the Little formula:

A: network arrival rate
T: average response time

N = A*T => T = N / A

And after that from the mathematical analytical equation
we can simmulate the jackson queuing network that model our webservers...

Now there is still an important question that i have:

Our analytical jackson network model can give us insight
on the webservers behavivior and validate our simulation
with fwptt.. but the difficulty that i find in practice
is that: suppose we have found the right parameters
that we want to choose, like for example the service rate of
the M/M/1 Server Queue , how , from this service rate, can
we buy the right computer ?

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10
Beginner
26 Views



I wrote:
>We already know that to satisfy a Poisson process we must
>have that N(t1)- N(t0), N(t2)- N(t1) etc.

must be independant.

>that means the
>counting increments must be independant
aminer10
Beginner
26 Views


Hello again,


We say in OR that:

"Understanding the behavior of a system is what Queueing Theory
and Littles Law is all about. But, managing a Queue requires not
just understanding the behavior of a system, but also in-depth
knowledge of how to improve a system - improving both aspects
of Queueing will mean a better, more efficient and cost-effective
system and, more importantly, a much better customer experience."

I wrote before that:

>It says that a simple model with M/M/1/k with FCFS discipline
>can predict webserver performance quite well..

>Hence, i think we can model our webserver over internet
>with 3 queues connected as a Jackson Network like this

>An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue

>and we have the following:

>Ni: number of jobs in each queue
>Ui: utilization of each queue

>Ni = Ui / (1-Ui)

>Adding all the Ni in each individual queue will give the
>average number of jobs in the entire queuing network.

>After that we apply the Little formula:

>A: network arrival rate
>T: average response time

>N = A*T => T = N / A

>And after that from the mathematical analytical equation
>we can simulate the jackson queuing


As you have noticed , this mathematical model of this jackson network
does in fact take into account theM/M/1 Network queue node ,the
USL model can not do this... and with this performance data
from the mathematical analytical model simulationwecanfor example
validate the performance data of the fwptt stress webserver simulation..

But you have to take into account worst cases and the peak trafficloads...

Let for examplewe have a a webserver hostinghtml pages
and it is receiving 1000000 HTTP operations
per day with an average file size of 10 KB.

What would be the network bandwidth required for this website
considering peak traffic if the peak traffic load from past observations
was four times greater than average loads?

Required bandwidth is solved by the following equation:

HTTP op/sec x average file size or

1000000 HTTP ops per day =1000000/24 = 41,667 op/hour =
41,667/3600= 11.6 HTTP ops/sec

The needed bandwidth is

11.6 HTTP ops/sec X 10 KB/HTTP op = 116 KB/sec = 928 Kbps.

If we assume a protocol overhead of 20% then the actual throughput
required is 928 Kbps X 1.2 = 1,114 Kbps.

However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes

it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line...



Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/



aminer10
Beginner
26 Views


I will add the following:

We have to be smarter than that in OR (operational research)...

As you have noticed i said that:

"As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation.."


and i said that:

"Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"

And of course on Capacity Planning for Enterprise Datacenters
and Websites , you can mirror many computer servers and load
balance between them with a software... to make the system much
FASTER, and this will be modeled as a jackson network like this:

An M/M/n Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"


But there is still an important thing , as i have showed before
on my calculations:

"However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line..."


I think that you have to take into account the knee utilisation
of your M/M/n Servers Queues, if for example the number of computer
servers is 8 and the Knee is 74% that means that in our previous
example the bandwidth must equal to: 174% X 4.456 Mbps = 7.753 Mbps


And we have to take into account the cost of the line ...


So be smarter !

Regards,
Amine Moulay Ramadne.
http://pages.videotron.com/aminer/

aminer10
Beginner
26 Views


I wrote:
> I think that you have to take into account the knee utilisation
> of your M/M/n Servers Queues, if for example the number of computer
> servers is 8 and the Knee is 74% that means that in our previous
> example the bandwidth must equal to: 174% X 4.456 Mbps = 7.753 Mbps


I mean 126% X 4.456 Mbps = 5.614 Mbps.

Cause as you know over this Knee utilisation of 74%
for 8 servers, the curve of the waiting time does grow quickly ..


Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10
Beginner
26 Views


Hello,


I have cleaned all my previous posts , please read again...


I didn't know where to ask this question, so i decided to ask here..


I just read yesterday the following page, look at the the USL
(Universal Law of Computational Scalability) of Dr. Gunther,
he wrote this: ( see http://en.wikipedia.org/wiki/Neil_J._Gunther )

--------------------------


The relative capacity C(N) of a computational platform is given by:


C(N) = N
-------------------
1 + (N - 1) + N (N - 1)


where N represents either the number of physical processors
in the hardware configuration or the number of users driving the
software application. The parameters and represent respectively
the levels of contention (e.g., queueing for shared resources) and
coherency delay (i.e., latency for data to become consistent) in the
system. The parameter also quantifies the retrograde throughput
seen
in many stress tests but not accounted for in either Amdahl's law or
event-based simulations.

----------


His website: http://www.perfdynamics.com/


If you read carefully , you will see that Dr. Gunther is using this
model to predict scalability after he simulates a relatively small
number of vusers in LoadRunner ( because of licensing costs, it's
cost-effective) and after that he finds the coefficients of the
2nd-degree polynomial (quadratic equation) and then transform
those coefficients back to the USL parameters using the = b - a
and = a.


And then he is extrapolating with the USL model to higher loads
to predict scalability.


He is also applying the model to webservers with heterogeneous
workloads. like in the following page:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-with-mixed.html


Now my question follows:


Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation
to predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ? I think USL can not predict this.


When we are modeling webservers , we have to include
the network&tcp/ip in our network queuig model
(that comprises the queue of the computer server side) ,
and since the service in the computer server is comprised of
multiple services (when we are using htmls , databases etc.)
the network&tcp/ip queue will not be markovian in the service
side, and we have to model the network&tcpip queue as an M/G/1
and this will complicate the mathematical analytic modeling...

So, i think the best way is to use a webserver stress tool
like http://fwptt.sourceforge.net/

You can even test the webserver with an heterogeneous
workloads by starting multiple fwtpp processes, and
you should increase the number of threads to 5 and after
that to 10 threads, 15 ... and so on until the webserver
applications stops responding propperly(and this will inform
on the maximum number of users that you can have in the same time...)
and as you are stress testing you can even see (by testing/measuring
it) if the network bandwidth is not the bottleneck... and this can
not be done by the USL model.


We already know that to satisfy a Poisson process we must
have that N(t1)- N(t0), N(t2)- N(t1) etc. must be independant
that means the counting increments must be independant.

We have the following relation between the Poisson law
and Exponential law:


the expected value E(X exponential) = 1 / E(X poisson)


and if the arrival is poissonian then the interarrivals are
exponential..


Now what about a webserver ?


I have read the following paper:


http://docs.google.com/viewer?a=v&q=cache:JFYCp_dSPP4J:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%...


And it says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..

Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this


A -> M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A

A: is the arrival rate

and we have the following:


Ni: number of jobs in each queue
Ui: utilization of each queue

Ni = Ui / (1-Ui)


Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.


After that we apply the Little formula:


A: network arrival rate
T: average response time


N = A*T => T = N / A


And after that from the mathematical analytical equation
we can simulate the jackson queuing network that model our
webservers...


Now there is still an important question that i have:


Our analytical jackson network model can give us insight
on the webservers behavivior.. but the difficulty that i find in
practice is that: suppose we have found the right parametters
that we want to choose, like for example the service rate of
the M/M/1 Server Queue , how , from this service rate, can
we buy the right computer that satisfies the service rate
that we want ?


We say in OR that:


"Understanding the behavior of a system is what Queueing Theory
and Littles Law is all about. But, managing a Queue requires not
just understanding the behavior of a system, but also in-depth
knowledge of how to improve a system - improving both aspects
of Queueing will mean a better, more efficient and cost-effective
system and, more importantly, a much better customer experience."


I wrote before that:

---

"It says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..
Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
A -> M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A
A: the arrival rate
and we have the following:
Ni: number of jobs in each queue
Ui: utilization of each queue
Ni = Ui / (1-Ui)
Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.
After that we apply the Little formula:
A: network arrival rate
T: average response time
N = A*T => T = N / A
And after that from the mathematical analytical equation
we can simulate the jackson queuing"

--


As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation..

But you have to take into account worst cases and the
peak traffic loads...


Let for example we have a a webserver hosting html pages
and it is receiving 1000000 HTTP operations
per day with an average file size of 10 KB.


What would be the network bandwidth required for this website
considering peak traffic if the peak traffic load from past
observations was four times greater than average loads?


Required bandwidth is solved by the following equation:


HTTP op/sec x average file size or


1000000 HTTP ops per day =1000000/24 = 41,667 op/hour =
41,667/3600= 11.6 HTTP ops/sec


The needed bandwidth is


11.6 HTTP ops/sec X 10 KB/HTTP op = 116 KB/sec = 928 Kbps.


If we assume a protocol overhead of 20% then the actual throughput
required is 928 Kbps X 1.2 = 1,114 Kbps.


However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
would be 4 X 1,114 Kbps = 4.456 Mbps.


So you have to think also about the cost of this line...

I will add the following:

As you have noticed i said that:


"As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation.."


and i said that:


"Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"


And of course on Capacity Planning for Enterprise Datacenters
and Websites , you can mirror many computer servers and load
balance between them with a software... to make the system much
FASTER, and this will be modeled as a jackson network like this:


A -> M/M/n Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A

A: the arrival rate to the system"


But there is still an important thing , as i have showed before
on my calculations:


"However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line..."


I think that you have also to take into account the knee utilisation
of your M/M/n Servers Queues, if for example the number of computer
servers is 8 and the Knee is 74% that means that in our previous
example the bandwidth must equal to:

126% X 4.456 Mbps = 5.614 Mbps.

Cause as you know, over this Knee of 74% for 8 servers
the curve of the waiting time does grow quickly ..


And we have to take into account the cost of the line ...


So be smarter !


Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/



aminer10
Beginner
26 Views


I wrote:
>Cause as you know, over this Knee of 74% for 8 servers

i mean: above this Knee of 74%...

>the curve of the waiting time does grow quickly



Regards,
Amine Moulay Ramdane.
aminer10
Beginner
26 Views


Hello,


My article about availibility and efficiency is on my website now,
(I have corrected some typos etc.)

Welcome: http://pages.videotron.com/aminer/efficiency_availibility.htm

Regards,
Amine Moulay Ramdane
http://pages.videotron.com/aminer/

aminer10
Beginner
26 Views


Hello,


If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
retrograde throuput and lockfree flqueue also have a retrograde
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any
retrograde throuput...

So , i was thinking more about this, so please follow
with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:


1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.


and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waiting for the locked section to be released , hence, if
we have n threads this imply that the time the threads will
be waiting for the locked section to be released is:

Average time spend inside the locked section * (1 + 2 +3 + 4 ... + n)

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of n * (n + 1) /2 = infinite

So, this mathematical term diverge when n -> infinite

So, i think i can rewrite the scalability equation to something like this,

Amine''s equation is:

Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)

S: serial part
P: parallel part
n: number of threads
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/


aminer10
Beginner
26 Views


I correct some typos. ..


Hello,


If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
retrograde throuput and lockfree flqueue also have a retrograde
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any
retrograde throuput...

So , i was thinking more about this, so please follow
with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:


1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.


and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waitingfor the locked section to be released , hence, if we
have n threads this imply that the time the threads will be
waiting for the locked section to be released is:

Time spend inside the locked section * (1 + 2 +3 + 4 ... + n)

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

So, i think i can rewrite the scalability equation to something like this,

Amine''s equation is:

Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)

S: serial part
P: parallel part
n: number of threads
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

aminer10
Beginner
26 Views

I correct:

I wrote:
> Time spend inside the locked section * (1 + 2 +3 + 4 ... + n)
>
> And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:
>
> (n * ( n + 1)) / 2
>
> and
>
> Limit (n -> infinite) of (n * (n + 1) /2) = infinite
>
> So, this mathematical term diverge when n -> infinite
>
> So, i think i can rewrite the scalability equation to something like
> this,
>
> Amine''s equation is:
>
> Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)


If T is the time spend in the locked section,
and let us have a scenario of four threads, so
under heavy contention and in the worst case,
if the threads are all waiting for the locked-section
(locks,false sharing etc.) the first thread will not
wait, second thread will wait T and third thread
will wait 2 * T and fourth thread will ait 3 * T...

Hence, in my equation we have to replace n by n -1
so the Amahal equation, this will give us:


So, my scalability equation become:

Speed = 1 / (S * ( (n-1) * (n)) / 2 ) + (P / N))

and the Limit (n-> inite) of ((n-1) * (n)) / 2 = infinite

this term diverge..


S: serial part
P: parallel part
n: number of threads
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.http://pages.videotron.com/aminer/


aminer10
Beginner
26 Views


Hello,


I have cleaned and corrected my previous post,
here it is againand tell me what do you think...


If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
retrograde throuput and lockfree flqueue also have a retrograde
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any
retrograde throuput...

So , i was thinking more about this, so please follow
with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:


1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waiting for the locked section to be released:

If T is the time spend in the locked section,
and let us have a scenario of four threads, so
under heavy contention and in the worst case,
if the threads are all waiting for the locked-section
(locks,false sharing etc.) the first thread will not
wait, second thread will wait T and third thread
will wait 2 * T and fourth thread will wait 3 * T...

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

We have to replace n by (n -1) since n threads will be waiting

T * (n -1)


So the my scalabilityequation will become:

Speed = 1 / (S * ( (n-1) * (n)) / 2 ) + (P / N))

and the Limit (n-> inite) of ((n-1) * (n)) / 2 = infinite

this term diverge..


S: serial part
P: parallel part
n: number of threads
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/