Solved: Re: A multithreaded demo on quadcore / corei7 (scaling manycore - Page 3

fraggy · ‎01-25-2009

Hello everybody,

I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent, if you want more informations on this project feel free to read www.acclib.com

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

View solution in original post

Dmitry_Vyukov · ‎02-05-2009

Quoting - fraggy

So I can say thatcontention on shared data structures (in library code) is reduce to minimum :p

Well, the fact that it's reduced to the minimum doesn't yet mean that it's enough for linear scaling :)
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.

Dmitry_Vyukov · ‎02-05-2009

Quoting - fraggy

So I may have a lot of computable "tasks" in a given zone. But Since I have 1 thread per zone only, I can't do any optimization for HT (in the zone part of the architecture)

Nope, you can! For example, thread 1 and thread 2 are working on one HT-capable core. Thread 1 processes zone 1, and thread 2 processes zone 2. You can schedule the work in following way:

Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.

Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.

fraggy · ‎02-05-2009

Quoting - Dmitriy Vyukov

Well, the fact that it's reduced to the minimum doesn't yet mean that it's enough for linear scaling :)
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.

I've done everythings humanly possible to make my algorithme lineare, but I can only test on 4 core and 4core HT processor for now :p

I someone give me the opportunity to test my library on a real manycore processor (like larrabee), performance gain will be linear !

Vincent, faith, faith, faith :p

ps : and if it is linear then ACClib will probably be the only library capable of such performance (for 3D real time application)

fraggy · ‎02-05-2009

Quoting - Dmitriy Vyukov

Nope, you can! For example, thread 1 and thread 2 are working on one HT-capable core. Thread 1 processes zone 1, and thread 2 processes zone 2. You can schedule the work in following way:

Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.

Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.

You are right.

Assuming I can sort task. I just have to chek if it will not cost determinism on my architecture.

Vincent, still have to test on 12 zones cores :p

jimdempseyatthecove · ‎02-05-2009

Vincent,
You are not testing with 12 cores. You are testing with 4 cores and 12 sub-boxes (and runs using 1, 2, 3, 4 threads).

1 core:
12 boxes

2 cores:
6 boxes
6 boxes

3 cores:
4 boxes
4 boxes
4 boxes

4 cores:
3 boxes
3 boxes
3 boxes
3 boxes

12 is the lowest number that is evenly divisible by 1, 2, 3 and 4. Thus no idel time (although you will expect some 4 core drop off due to display update and other system overhead).

Jim

jimdempseyatthecove · ‎02-05-2009

Additional hint to attain better performance

Box has12 zones

|00|01|02|03|04|05|06|07|08|09|10|11|

Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...

However many combinations do not share a wall.

If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.

|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3

Jim Dempsey

jimdempseyatthecove · ‎02-05-2009

And similar pattern with reduced number of threads

T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...

Jim

fraggy · ‎02-06-2009

Quoting - jimdempseyatthecove

Additional hint to attain better performance
Box has12 zones
|00|01|02|03|04|05|06|07|08|09|10|11|
Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...
However many combinations do not share a wall.
If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.
|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3

I guess that, by "wall" you mean the limit between 2 zones. The limit where sphere can cross from a zone to another :)

This will be difficult to explain but, even if 2 zones seems to be nearby (with sphere flying around and traveling from one to another) there have nothing in common (no shared data).

->Except for a pointer on a messagebox class (I use message passing algorithme to synchronize zone)

fraggy · ‎02-06-2009

Quoting - jimdempseyatthecove

And similar pattern with reduced number of threads
T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...

Why it's so difficult to force threads affectation... But not impossible :p

Usualy the first available thread is use to compute the next most urgent zone in line (threads are slaves, they never rest :p). I think that waiting for the right thread to compute the right zone may be a little risky (for performance).

And, if I affect the first available threads to another zone (not the most urgent), this may work (not waiting time) but since my most urgent zone stay the most urgent zone to be done, one of the main algorithms is stuck and synchronization may be delayed until the right thread become available and is affected to the (still) most urgent zone :)
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.

Vincent, and as you say it's os dependent, so it's evil :p

jimdempseyatthecove · ‎02-06-2009

Quoting - fraggy

Why it's so difficult to force threads affectation... But not impossible :p

Usualy the first available thread is use to compute the next most urgent zone in line (threads are slaves, they never rest :p). I think that waiting for the right thread to compute the right zone may be a little risky (for performance).

And, if I affect the first available threads to another zone (not the most urgent), this may work (not waiting time) but since my most urgent zone stay the most urgent zone to be done, one of the main algorithms is stuck and synchronization may be delayed until the right thread become available and is affected to the (still) most urgent zone :)
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.

Vincent, and as you say it's os dependent, so it's evil :p

Vincent,

The threads do not wait

|00|01|02|03|04|05|06|07|08|09|10|11|
T0 T1 T2 T3 T3 T2 T1 T0T0 T1 T2 T3

You set up a pecking order

T0's preference is 00, 07, 08, 09, 06, 01, 10, 05, 02, 11, 04, 03
T1's preference is 01, 06, 09, 10, 08, 07, 05, 02, 00, 11, 04, 03
...
IOW
First pick in sequence you would pick should all sequence together
Second pick in sequence of reverse order of adjacent cell(s) at interaton level.
Skip any sub-box where computation began.

The sequencing is designed to have higher probability of "hot in cache" which will (may) give you super-linearity.

Jim Dempsey

fraggy · ‎02-06-2009

Quoting - jimdempseyatthecove

The sequencing is designed to have higher probability of "hot in cache" which will (may) give you super-linearity.

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

jimdempseyatthecove · ‎02-06-2009

Quoting - fraggy

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

From my understanding of watching your video

The particles in the large container have a radius (I think they are all the same, but a real model would accommodate arbitrary radii)

The particles interact with walls and each other

The particles in a partitioned large container should behave the same (or within rounding error) as the un-partitioned container.

Therefore particles at or further than 1r from wall are within the domain of inside the box.

Particles less than 1r from wall(s) exist in one to six perimeter domains

particles inside the "inside the box" can interact with other particles inside the "inside the box" as well as with particles within the nearest perimeters visible inside the box.

Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Jim Dempsey

robert-reed · ‎02-06-2009

Quoting - fraggy

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.

Tan-Killer · ‎02-06-2009

Quoting - fraggy

Hello everybody,

I'am currently working on a multithread demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent

oh good !!

fraggy · ‎02-09-2009

independentlyQuoting - jimdempseyatthecove

Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Sphere can have any radii in my model. For instance you can see blue spheres in my video test, size of those spheres are 5 time the little one, and you can have sphere even bigger than a zone (not in the video)

The repartition you describe may not work when you mix sphere of all radii, from "small" to "bigger that a zone".

But, repartition use in this architecture still ensure reliable physic and ... no shared data :)

If a sphere exist in multiple zone, one of the zone are the owner (it depends on the center of the sphere) and the other zones get copies (throught the asynchrone "message passing like" messagebox class). Thoses spheres are linked (not shared...)

Each zone compute and store results in parallel and when a "linked" sphere is involve in a collision, a specific algorithm is use to safely synchronize data between zones.

The algorithm use the messageBox, so this message box is the only shared data.

Vincent

fraggy · ‎02-09-2009

Quoting - Robert Reed (Intel)

There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.

Zones may be created, deleted, moved and shaped dynamicaly and you can have plenty of them at the same time.
There is on thing that limits the use of multiples zones : each time you add a zone you add a "interzone" limit. it's a place where objects travels from a zone to another. the less you have interzone, the better.
When you place your zones in a building for instance, always try to use existing limits (walls, furnitures...) for your limits, so objects don't travel too easily from one zone to another...

The video show a worst case scenario, application test don't have walls and every objects can travel from a zone to another. The point was to prove that performance can be interesting even in that case...

You can extends the use of ACCLib in a (4/8 core) video game quite easily and with dozen (or more if needed) of cleverly distributes zones, you may have a very load balanced system :) as I said before you can also have additionnal zones whenever you need : if all the "actions" happen inside a specific room you can add zones in the room and balanced the computing needs.

Vincent

fraggy · ‎02-09-2009

Quoting - Tan-Killer

oh good !!

Welcome in the discussion !!!

:p

jimdempseyatthecove · ‎02-09-2009

Quoting - fraggy

independentlyQuoting - jimdempseyatthecove

Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Sphere can have any radii in my model. For instance you can see blue spheres in my video test, size of those spheres are 5 time the little one, and you can have sphere even bigger than a zone (not in the video)

The repartition you describe may not work when you mix sphere of all radii, from "small" to "bigger that a zone".

But, repartition use in this architecture still ensure reliable physic and ... no shared data :)

If a sphere exist in multiple zone, one of the zone are the owner (it depends on the center of the sphere) and the other zones get copies (throught the asynchrone "message passing like" messagebox class). Thoses spheres are linked (not shared...)

Each zone compute and store results in parallel and when a "linked" sphere is involve in a collision, a specific algorithm is use to safely synchronize data between zones.

The algorithm use the messageBox, so this message box is the only shared data.

Vincent

So then each sphere then has one master state set of variables (describing center)and up to 8 sets of encroachment zone contributions. These can be either Delta Velocity or Delta Momemtum contributors. All next steps can be calculated concurrently.For those linked through a shared zone there is an additional accumulation pass to derrive the new master state. This can be done in parallel by the thread owning the prior center of sphere.

This will reduce the number of interlocked operations.

Have you run the 12 partition test on your 4 core system? I am interested to see what the curve looks like going from 3 to 4 cores. Ianticipate you will reclaim some performance..

Jim Dempsey

fraggy · ‎02-09-2009

Quoting - jimdempseyatthecove

So then each sphere then has one master state set of variables (describing center)and up to 8 sets of encroachment zone contributions. These can be either Delta Velocity or Delta Momemtum contributors. All next steps can be calculated concurrently.For those linked through a shared zone there is an additional accumulation pass to derrive the new master state. This can be done in parallel by the thread owning the prior center of sphere.

This will reduce the number of interlocked operations.

Not sure I understand all you've just say if there is a sphere somewhere that can interact in multiple zone, that sphere exist in each zone as a linked copy. One of them are refered to the master because center are in the zone the other are slaves. If one of them interact (with anything) the other will get the "message" when available and synchronisation algorithm began.

There is no interlocked operations at all.

Except when message pass from on zone to another.

fraggy · ‎02-09-2009

Quoting - jimdempseyatthecove

Have you run the 12 partition test on your 4 core system? I am interested to see what the curve looks like going from 3 to 4 cores. Ianticipate you will reclaim some performance..

Unfortunatly using 12 zones in this particular test application is not a good idea. About performance scaling I have a very nice chart, the problem is pure performance...

A high number of zone means a high number of interzone, it means a lot of use of my synchronization algorithm.

in short, Tests on 12 zones may produce better looking charts, but poor performance (about 3KSphere with 4 threads instead of 5/6k spheres)

Vincent, I'am still stuck with my 1 thread->1 zone test conditions...

jimdempseyatthecove · ‎02-10-2009

Please run the test with 12 zones, you may be pleasantly surprised. The purpose of the additional zones is to accomidate background activity on the system. Your application will perform more computations, however, I anticipate you will also have more working threads for longer duration. Your system has a non-zero amount of work to perform while the application is running. Therefore, if this time is X and your zone run time is Y your 4 core system is unproductive for 3(Y-X). By doubling the number of zones, your system is unproductive for 3(Y/2-X). Although you increase the overhead of the additional zones so the base Y is slightly different.

Jim Dempsey

A multithreaded demo on quadcore / corei7 (scaling manycore processor)