Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

A multithreaded demo on quadcore / corei7 (scaling manycore processor)

fraggy
Beginner
5,473 Views
Hello everybody,

I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent, if you want more informations on this project feel free to read www.acclib.com
0 Kudos
1 Solution
Dmitry_Vyukov
Valued Contributor I
5,415 Views
Quoting - fraggy
In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?



I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.


View solution in original post

0 Kudos
69 Replies
fraggy
Beginner
1,402 Views
As a side issue, I am not entirely sure that for this demo that the metrics used have been normalized. Example, when going from one domain to two domains (sub-boxes in video) I would assume that there is a little more work happening at the domain interface between sub-boxes than there is at the domain interface to open space (a particle can only bounce off an open wall, but penetrate through and/or interact with particles in proximity of the interface to the adjacent box. So the scaling from 1 domain to many (T1*nCells + CellToCellInterfaceOverhead * nCellToCellInterfaces). As currently implemented nCells==nThreads

your completly right :)
if I test with additionnal wall (objects are stuck in there zone, not trespassing, no additionnal work) I have decent performance :x1.8 (2 threads), x2.56 (3 threads) and x3.03 (4 threads).The metrics was normalized only in that case (with additionnal wall). See this detailed test :http://www.acclib.com/2009/01/load-balancing-23012009.html

But if I test without walls (like in the corei7) performance are lower :x1.6 (2 threads), x2.11 (3 threads) and x2.52 (4 threads). From 1 to 2 zones, I multiply by 2 the volumes, the spheres and collisions, but I also add work that doesnt exist in 1 thread : sphere that travel from one zone to another zone. Same problem apply from 1 thread to 3 threads and so one.
The right metrics would be counting the number of checked pair and number of occured collision, but in my software architecture it would be tricky to do (don't ask why...), and people are usualy more interested in number of spheres than pair and collision :p.

I know that using pair and collision will "improve" my performance, just keep in mind that this demo show the worst case scenario : bunch of sphere running everywhere and an awfull amount of traveling sphere(from zone to zone). It's the lowest performance you can have with my library : not so bad :p

Vincent
0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov
If you are developing general-purpose library, then probably your users will have different different kinds of tasks. I may surmise that games contain some integer intensive parts (probably AI, world model manipulations, really don't know) and some floating-point intensive parts (physics). So user may write something like:

your_lib::task* t1 = your_lib::spawn(calculate_AI, your_lib::hint_integer_intensive);
your_lib::task* t2 = your_lib::spawn(calculate_physics, your_lib::hint_floating_point_intensive);

And you will try to schedule these tasks on HT sibling threads.

In my software architecture threads are scheduled on particular sets of data, not on sources code. Even if you can hint the sourcecode (hint_integer_intensive or hint_floating_point_intensive) they are not compute by threads directly, sourcecode are used by zone and zone are compute by threads. I can set a particular thread to a particular zone, but I can't set a particular thread to a particular sourcecode.
And there is no way to have a zone with integer or FP intensive flag_hint only.

One solution is to affect 2 threads for 1 zone, 2 siblings threads. But my zone are not meant to be multithreaded and everyone knows that the less you introduce thread related source code, the better :p

Another solution would be check every zone, before each iteration for Interger/FP repartition. For instance, if I find 2 zones with >90% of integer computation (thanks to the hint mecanisme), I just have to schedule non sibling threads for thoses zones.
On a processor with 8 HT percore, this optimization will be a must have.

Vincent
0 Kudos
fraggy
Beginner
1,402 Views
Quoting - alpmestan
But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

I think we will trust the user :p
Even, we can check automaticaly for type checking and hint our self the source code.

0 Kudos
fraggy
Beginner
1,402 Views
Dmitriy suggested your library interface, at the point of how threadscheduling is directed, contain an (optional) hint parameter that can be used for thread scheduling purposes. I agree with Dmitriy that you consider adding this to your code.

Reason 1: HT is now comming back in vogue, quad core (8 thread) Core i7 has it now, others will follow soon.

Reason 2: Some functionality may be better suitable to run in the GPU. When the platform has a compatable GPU the hint can be used to select the code path. (you or your users will have to write the two program paths)

Reason 3: Larrabee is comming down the road. This kind of product will blur the distinction between CPU and GPU. I anticipate that you (your library users) will find some code that is not suitable for export to the GPU is suitable for export to a Larrabee device. Although initial products with Larrabee are expected (by me) to be a GPU-like card with video outand occupy a PCI Expressw slot, later versions will likely be integrated into the motherboard thus leaving the PCI Express slot available for a GPU (more likely a watered down GPU).

Jim Dempsey

I'am agree with both of you.
On a processor with a 2 threads HT, it seems useless. But I plan to do something about soon or later. But things may be trcky to implement (see my previous posts)

Reason 1 : OK
Reason 2 : this library will not work on a GPU/CPU architecture easily, I'am threading 3D geographical zone, I can't ask a GPU to compute some of them and the CPU the rest of them
Reason 3: Larrabee is comming down the road.
Yes, I expect that too. Even more, I expect that larabee may be the only processor of the motherboard. My library will work just fine :p
0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov

Ok, here is very quick test that you may do in order to get some insight about effectiveness of HT usage.
Just pin all worker threads so that only one hardware thread will be used on every HT core. On Windows you can do it just by inserting SetProcessAffinity(GetCurrentProcess(), 1 | 4 | 16 | 64) in main(). On Linux you must use pthread_setaffinity_np() or sched_setaffinity(). Then compare obtained results with previous results. This way you will measure what speedup you are getting from HT.
If you will do this, please, post numbers here.



As you see when I test with 4 threads (and 4 zones), windows dispatch (without anyhelp) 1 thread per hardware core :
- x1.80 2 threads
- x2.44 3 threads
- x2.78 4 threads

I get a linear performance from 1 to 4 threads (almost) and different (but still linear) performance from 4 to 8 (see black line in jointed graphics)
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,402 Views

Fraggy,

I suspect that the drop-off for the 4th thread is due to the code that drives the display. You can quickly test this by having the display update every 1/100th of display cycles. Although your end product won't do this, it will be good for testing the linearity of the scaling of the processing portion of the applicaiton. Later on you may want to insert some RDTSC control logic to display update on a preferred frame rate. On the simulator code I run I included a variable frame rate. Once I am satisfied by watching the simulation run display at higher refresh rate that the initialization data seems correct, I then reduce the refresh frame rate to once every several seconds. I tend to work on something else while the simulation progresses with an occasional glance at the progress data. No sense in wasting time to refresh the display if it is not being observed.

Jim Dempsey
0 Kudos
fraggy
Beginner
1,402 Views
I suspect that the drop-off for the 4th thread is due to the code that drives the display. You can quickly test this by having the display update every 1/100th of display cycles.
I have already try this on a quadcore, but the drop off still occure !!!
I think it's due to the OS and other background application. I could redo the test with minimum back ground application and with a priority set to "high"...
But, I don't have the Corei7 anymore, sorry :p

Vincent


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,402 Views
Quoting - fraggy
In my software architecture threads are scheduled on particular sets of data, not on sources code. Even if you can hint the sourcecode (hint_integer_intensive or hint_floating_point_intensive) they are not compute by threads directly, sourcecode are used by zone and zone are compute by threads. I can set a particular thread to a particular zone, but I can't set a particular thread to a particular sourcecode.
And there is no way to have a zone with integer or FP intensive flag_hint only.

One solution is to affect 2 threads for 1 zone, 2 siblings threads. But my zone are not meant to be multithreaded and everyone knows that the less you introduce thread related source code, the better :p

Another solution would be check every zone, before each iteration for Interger/FP repartition. For instance, if I find 2 zones with >90% of integer computation (thanks to the hint mecanisme), I just have to schedule non sibling threads for thoses zones.
On a processor with 8 HT percore, this optimization will be a must have.



HT-aware scheduling must be done on the level of tasks, not threads. In the end, thread executes some code on some data. Code and data are combined in the form of tasks (or zones, in your terminology). So, the goal of HT-aware scheduler is to try to schedule "different" (integer intensive/fp intensive, or memory intensive/computations intensive) tasks (zones) to HT sibling threads at any given time.
Another possible optimization is to try to schedule tasks that work on the same data to HT sibling threads. So, I think, the ideal variant will be: first task makes integer computations on the data, second task makes floating point computations on the same data (how data can be integer and floating point at the same time? :) ).
However, probably, all this is more of theoretic interest now...


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,402 Views
Quoting - fraggy


As you see when I test with 4 threads (and 4 zones), windows dispatch (without anyhelp) 1 thread per hardware core :
- x1.80 2 threads
- x2.44 3 threads
- x2.78 4 threads

I get a linear performance from 1 to 4 threads (almost) and different (but still linear) performance from 4 to 8 (see black line in jointed graphics)


Thank you for posting the results.
So, it seems that you are gaining performance boost exactly from HT (not from other architectural improvements). ~25-35% improvement from HT seems quite reasonable.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,402 Views
Quoting - fraggy

I have already try this on a quadcore, but the drop off still occure !!!
I think it's due to the OS and other background application. I could redo the test with minimum back ground application and with a priority set to "high"...


The straightforward reason for negative scaling is excessive contention on shared data structures (in user code, or in library code). If user code doesn't modify any shared structures, then you can try to increase task granularity (so that calls to the scheduler will be less frequent -> less contention of library's internal structures).



0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov
HT-aware scheduling must be done on the level of tasks, not threads. In the end, thread executes some code on some data. Code and data are combined in the form of tasks (or zones, in your terminology). So, the goal of HT-aware scheduler is to try to schedule "different" (integer intensive/fp intensive, or memory intensive/computations intensive) tasks (zones) to HT sibling threads at any given time.
I don't have tasks, zone are just... geographical zone(a set of all the 3D object present in the zone). And normaly I don't have a zone with just integer or a zone with just FP computation.
1 Zone->1 thread
I can't let 2 threads (even Sibling) working on 1 zone at the same time : this require to much "threading related source code" to be write.
And since performance gain are OK, I will not walk this path :)


0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov
Another possible optimization is to try to schedule tasks that work on the same data to HT sibling threads. So, I think, the ideal variant will be: first task makes integer computations on the data, second task makes floating point computations on the same data (how data can be integer and floating point at the same time? :) ).
However, probably, all this is more of theoretic interest now...

I don't have any "tasks that work on the same data", this is a particularity of the architecture. I mean 2 zones, even with common object (Object that exist in both zone) don't have common/shared data.
I use different copy of the data and synchronize on the fly using message passing algorithm.
Sorry, to be more precise, I have shared datas : message boxes used by zones for message passing algorithm : write/read a message is the only place that are lock by a mutex.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,402 Views

Sorry you lost access to the Cori i7 for further testing.

Suggestion for scalability testing in general.

For your particular test, setup the test data the same but vary only the thread count (and thread placement with respect to HT). What this means is your container box should be divided up into the same number of sub-boxes for all tests. The one thread test would progress through each sub-box in sequence, two thread test would have each thread perfroming half the boxes, etc...

Unfortunately to have a completely fair test would require the number of sub-boxes to be a value that is factorable by theenumeration of thenumber of threads. For the Core i& this would be amultiple of 5*6*7*8 = 1,680 cells (1, 2 and 3 can factor into the other numbers). Fortunately, the number of particles you were testing with was sufficiently large enough to perform a proper test.

For the four core test use a multiple of 4*3 = 12 sub boxes. Using more sub boxes might permit you to detect when you reach the point where you can determine the amount of background processing taking place.

Jim Dempsey
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,402 Views
Quoting - fraggy
I don't have any "tasks that work on the same data", this is a particularity of the architecture. I mean 2 zones, even with common object (Object that exist in both zone) don't have common/shared data.
I use different copy of the data and synchronize on the fly using message passing algorithm.
Sorry, to be more precise, I have shared datas : message boxes used by zones for message passing algorithm : write/read a message is the only place that are lock by a mutex.

Oh, I see. If you are executing only 1 kind of tasks, then it's senseless to talk about different kinds of tasks (integer/fp), of course.
You were saying that it's a framework for game development, so I was thinking that there must be graphics, physics, AI, helper, etc tasks...


0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov
The straightforward reason for negative scaling is excessive contention on shared data structures (in user code, or in library code). If user code doesn't modify any shared structures, then you can try to increase task granularity (so that calls to the scheduler will be less frequent -> less contention of library's internal structures).

I was aware of that at the time I designed the architecture (several years ago) :
- I wanted to use Data parallelism architecture and one of the first way to do it is to divide the 3D world into zone
- I wanted to disconnect each zone form each other : 1 thread-> 1 zone. Inside a zone I don't have any mutex or condition.
- I wanted an asynchrone way to communicate from I thread to another : when a zone want to talk to another zone, it send a message to a mutex protected MessageBox. It's the only one of the two shared data structure in the software architecture (the other one is use to affect thread to zone)
- about granularity, with 1 thread -> 1 zone I can't do better !!! note that in all my test, threads are always on the same zone, they never switch.

So I can say thatcontention on shared data structures (in library code) is reduce to minimum :p
In user code, I can't force them to do so (avoiding the use of shared data), but I can test their code and show them that scaling will not be possible with there code.

0 Kudos
fraggy
Beginner
1,402 Views
For your particular test, setup the test data the same but vary only the thread count (and thread placement with respect to HT). What this means is your container box should be divided up into the same number of sub-boxes for all tests. The one thread test would progress through each sub-box in sequence, two thread test would have each thread perfroming half the boxes, etc...

Unfortunately to have a completely fair test would require the number of sub-boxes to be a value that is factorable by theenumeration of thenumber of threads. For the Core i& this would be amultiple of 5*6*7*8 = 1,680 cells (1, 2 and 3 can factor into the other numbers). Fortunately, the number of particles you were testing with was sufficiently large enough to perform a proper test.

For the four core test use a multiple of 4*3 = 12 sub boxes. Using more sub boxes might permit you to detect when you reach the point where you can determine the amount of background processing taking place.
I don't understand the "factorable by the enumeration of the number of threads" part, but I will try this on my quadcore.
stay posted.

0 Kudos
fraggy
Beginner
1,402 Views
Quoting - Dmitriy Vyukov
You were saying that it's a framework for game development, so I was thinking that there must be graphics, physics, AI, helper, etc tasks...
There is a lot of different kind of things to compute. User may specify all kind of Entity (3DObject, AIEntity, light source, shadow, particles...) they have to deliver a source code for every possible pair (detect and solve code for light_source/3DObject, light_source/Shadow, light_source/AI , etc... ).
So I may have a lot of computable "tasks" in a given zone. But Since I have 1 thread per zone only, I can't do any optimization for HT (in the zone part of the architecture)

But I think that future of manycore processor may use HT with 8 sibling threads per core (see the rock processor). So I will probably have to do something one day :)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,402 Views

Assume you are testing for 3 threads but your universe is divided into 4 sub-boxes

3 of the threads would perform 3/4ths of the sub-boxes in T=1 then 1 of the three threads would perform the remaining box in T=2 for performance rating of T=2 i.e. each performance taking 1/4th the time of 1 thread or the two performances taking 2/4ths of what 1 thread would take using 4 sub boxes. Runtime = 0.5 that of 1 thread for scalability factor of 0.3333/0.500 = 0.6666

When using 3*4 (12) sub boxes and 3 threads

3 threads perform 3/12ths, then perform another 3/12ths, then 3/12ths, then 3/12ths. or, each performance taking 1/12th the time and the 4 performances taking 4/12ths the time. Runtime = 0.3333 time of 1 thread for scalability factor of 0.3333/0.3333 = 1.0 (actual results will vary).

When the number of boxes is factorable by all possible combinations of threads (1,2,3,4,5,6,7,8) then you will not have idel threads on the last iteration of the performances.

1*2*3*4*5*6*7*8 (40,320) would be factorable however this would be impracticable for testing with this many sub boxes. Fortunately, 1 can factor in all other thread combinations, 2 factors into 4,6,8, and 3 factors into 6 and 4 factors into 8. This leaves 5*6*7*8 (1,680) produces the lowest number evenly divisible by 1,2,3,4,5,6,7,8.
If you were to test scalability using 1, 2, 4, 8 thread combinations then since each is divisible into 8 you could use 8 sub-boxes.

1, 2, 3, 4 thread tests using 4 sub-boxes (each represented by 1/4th the runtime of 1 thread

1 thread
============

2 threads
======
======

3 threads
======
=== (nothing to do for second performance)
=== (nothing to do for second performance)

4 threads
===
===
===
===

When using 12 sub-boxes you will have no idle threads (you can diagram this).

Jim Dempsey
0 Kudos
fraggy
Beginner
1,402 Views
1, 2, 3, 4 thread tests using 4 sub-boxes (each represented by 1/4th the runtime of 1 thread

1 thread
============

2 threads
======
======

3 threads
======
=== (nothing to do for second performance)
=== (nothing to do for second performance)

4 threads
===
===
===
===

I understand this !!!
If I test with 3 threads, one of them will work normaly the other 2 will idle 50% of the time.
It's perfectly logical, thanks to you jim.

Vincent
0 Kudos
fraggy
Beginner
1,421 Views


Red is a 1 to 4 thread testing with 4 zone.
Yellow is a 1 to 4 thread testing with 6 zone.

Red show a scary drop off at 3rd thread (drop off on the 4th thread is due to background application)
Yellow show an abnormal drop offat 4th thread

Vincent, this test show the "jimdempsey" effect :p
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,421 Views

Good work Vincent.

Using 12 zones should improve (reduce) the last core dropoff. 12 is divisible by 1,2,3,4 whereas 6 is divisible by 1,2,3.

Using 6 zones and 4 cores the estimated run times (sans overhead for display and other system tasks is)

1 thread
==================

2 threads
=========
=========

3 threads
======
======
======

4 threads
======
======
===
===

Giving 3 and 4 threads approximately the same run times (just what you observed).

Jim Dempsey

0 Kudos
Reply