- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...
On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.
On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?
I know that the corei7 is not a octocore processor but I expected a bit more...
Vincent, if you want more informations on this project feel free to read www.acclib.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p
Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?
I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.
You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, the fact that it's reduced to the minimum doesn't yet mean that it's enough for linear scaling :)
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nope, you can! For example, thread 1 and thread 2 are working on one HT-capable core. Thread 1 processes zone 1, and thread 2 processes zone 2. You can schedule the work in following way:
Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.
Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.
Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vincent,
You are not testing with 12 cores. You are testing with 4 cores and 12 sub-boxes (and runs using 1, 2, 3, 4 threads).
1 core:
12 boxes
2 cores:
6 boxes
6 boxes
3 cores:
4 boxes
4 boxes
4 boxes
4 cores:
3 boxes
3 boxes
3 boxes
3 boxes
12 is the lowest number that is evenly divisible by 1, 2, 3 and 4. Thus no idel time (although you will expect some 4 core drop off due to display update and other system overhead).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional hint to attain better performance
Box has12 zones
|00|01|02|03|04|05|06|07|08|09|10|11|
Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...
However many combinations do not share a wall.
If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.
|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And similar pattern with reduced number of threads
T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Box has12 zones
|00|01|02|03|04|05|06|07|08|09|10|11|
Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...
However many combinations do not share a wall.
If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.
|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.
Vincent,
The threads do not wait
|00|01|02|03|04|05|06|07|08|09|10|11|
T0 T1 T2 T3 T3 T2 T1 T0T0 T1 T2 T3
You set up a pecking order
T0's preference is 00, 07, 08, 09, 06, 01, 10, 05, 02, 11, 04, 03
T1's preference is 01, 06, 09, 10, 08, 07, 05, 02, 00, 11, 04, 03
...
IOW
First pick in sequence you would pick should all sequence together
Second pick in sequence of reverse order of adjacent cell(s) at interaton level.
Skip any sub-box where computation began.
The sequencing is designed to have higher probability of "hot in cache" which will (may) give you super-linearity.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From my understanding of watching your video
The particles in the large container have a radius (I think they are all the same, but a real model would accommodate arbitrary radii)
The particles interact with walls and each other
The particles in a partitioned large container should behave the same (or within rounding error) as the un-partitioned container.
Therefore particles at or further than 1r from wall are within the domain of inside the box.
Particles less than 1r from wall(s) exist in one to six perimeter domains
particles inside the "inside the box" can interact with other particles inside the "inside the box" as well as with particles within the nearest perimeters visible inside the box.
Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).
Particles can flow from one domain to the other (or bounce as the case may be).
The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).
The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.
Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)
There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'am currently working on a multithread demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...
On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.
On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?
I know that the corei7 is not a octocore processor but I expected a bit more...
Vincent
oh good !!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Particles can flow from one domain to the other (or bounce as the case may be).
The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).
The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.
Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.
There is on thing that limits the use of multiples zones : each time you add a zone you add a "interzone" limit. it's a place where objects travels from a zone to another. the less you have interzone, the better.
When you place your zones in a building for instance, always try to use existing limits (walls, furnitures...) for your limits, so objects don't travel too easily from one zone to another...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Particles can flow from one domain to the other (or bounce as the case may be).
The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).
The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.
Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.
So then each sphere then has one master state set of variables (describing center)and up to 8 sets of encroachment zone contributions. These can be either Delta Velocity or Delta Momemtum contributors. All next steps can be calculated concurrently.For those linked through a shared zone there is an additional accumulation pass to derrive the new master state. This can be done in parallel by the thread owning the prior center of sphere.
This will reduce the number of interlocked operations.
Have you run the 12 partition test on your 4 core system? I am interested to see what the curve looks like going from 3 to 4 cores. Ianticipate you will reclaim some performance..
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This will reduce the number of interlocked operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please run the test with 12 zones, you may be pleasantly surprised. The purpose of the additional zones is to accomidate background activity on the system. Your application will perform more computations, however, I anticipate you will also have more working threads for longer duration. Your system has a non-zero amount of work to perform while the application is running. Therefore, if this time is X and your zone run time is Y your 4 core system is unproductive for 3(Y-X). By doubling the number of zones, your system is unproductive for 3(Y/2-X). Although you increase the overhead of the additional zones so the base Y is slightly different.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page