Solved: Re: A multithreaded demo on quadcore / corei7 (scaling manycore

fraggy · ‎01-25-2009

Hello everybody,

I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent, if you want more informations on this project feel free to read www.acclib.com

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

View solution in original post

TimP · ‎01-25-2009

In my experience, single socket core i7 has performance similar to dual socket core 2 quad. Supposing that you had an application which could achieve an additional 50% performance by use of HyperThreading, you would require care in affinity and sharing of fill buffers, among other possible issues. If you have already taken care to minimize latencies and cache misses, or are running a 32-bit OS, that additional 50% from hyperthreading is unlikely.

fraggy · ‎01-26-2009

Quoting - fraggy

On my new corei7, from 1 to 8 threads we can only deal with 3,5 more objects. (video is not availlable yet)

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

Vincent

jimdempseyatthecove · ‎01-27-2009

In looking at your videos I have a few comments.

Your videos seem to indicate that the computational domains (sub-boxes) are running on threads that are not pinned to a specific hardware thread. (You are not using processor affinity to bind a thread to a specific core). When a software thread migrates to a seperate core, and depending on the relationships, the thread may loose some of the cached data (e.g. if the threads do not share the same L2 cache). You might try adding a little code at the thread startup to set the affinity.

The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better when you have a blend of FP and integer computations.

Since the problem is collisions of particles (small objects), consider creating more problem domains than you have threads. Have each thread work on multiple domains (same set of domains each iteration). This will create more wall checking, but will also reduce the volume (and number of particles) for the particle to particle interactions.

Also, up until you hit 8 threads, you can better assess the computation overhead of the other processing (display)

Jim Dempsey

fraggy · ‎01-27-2009

>You are not using processor affinity to bind a thread to a specific core
your right : I'am using posix :) for the moment I trust the OS to do the right thing. Maybe in the next level....

>The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better

>when you have a blend of FP and integer computations.

Do you mean, "only integer computation" or a "blend of floating point and integer computation" I don't get it. Anyway, my product is supposed to help developer to parallelize their code, I can't ask them to rewrite all there code with integer :p

I'll keep that in mind...

>Since the problem is collisions of particles (small objects), consider creating more problem domains than you have

>threads. Have each thread work on multiple domains (same set of domains each iteration). This will create more

>wall checking, but will also reduce the volume (and number of particles) for the particle to particle interactions.

I have already find a way to reduce the number of particle/particle pairs to check. It's a grid algorithm, I test only particules that are in the same grid (only the nearest particles).

>Also, up until you hit 8 threads, you can better assess the computation overhead of the other processing (display)

I have some trouble with that, Drawing 10K spheres cost me almost 40ms each time (glvertex is so slow). I'am trying to improve this part.

Vincent

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

>The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better

>when you have a blend of FP and integer computations.

Do you mean, "only integer computation" or a "blend of floating point and integer computation" I don't get it. Anyway, my product is supposed to help developer to parallelize their code, I can't ask them to rewrite all there code with integer :p

I'll keep that in mind...

To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of integer EUs, thus NO speedup.

I was thinking about introducing special hint into my task scheduling system. I.e. when user creates task he can specify whether task will make mostly integer computations (default), of floating-point computations. Thus run-time will be able to sensibly schedule tasks in HT-aware way.

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

>You are not using processor affinity to bind a thread to a specific core
your right : I'am using posix :) for the moment I trust the os to do the right think. maybe in the next level....

Usually today's OSes are BAD in that. So in the context of user-space task scheduler one batter NOT trust the OS. OS doesn't know your data placement, OS doesn't know what thread accesses what data, etc.

Dmitry_Vyukov · ‎01-27-2009

Quoting - Dmitriy Vyukov

To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of integer EUs, thus NO speedup.

I was thinking about introducing special hint into my task scheduling system. I.e. when user creates task he can specify whether task will make mostly integer computations (default), of floating-point computations. Thus run-time will be able to sensibly schedule tasks in HT-aware way.

Another implication of HT is that sibling threads share L1D$, L1I$. So probably you want to keep two different "normal" threads as far as possible from the viewpoint of accessed data (to reduce contention and to give each thread good piece of independent work), BUT probably you want to keep sibling HT threads close to each other from the viewpoint of accessed data/code (because otherwise their working sets may just not fit into L1D$/L1I$, thus you will get performance degradation from HT; yes, small performance degradation from HT is indeed possible).

fraggy · ‎01-27-2009

>You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

ok thank you :p

now I just have to wait for a real octocore...

>To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware

> thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the

> same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing

> floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-

>point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of

> integer EUs, thus NO speedup.

All of my threads work on independant data (I use a data parallel model in my API), every thread work on the exact same kind of data, I don't have a way to say, this one work on integer, this one work on float...

I guess this HT optimisation is not for me. But I have decent performance gain, so...

>Usually today's OSes are BAD in that. So in the context of user-space task scheduler one batter NOT trust the OS.

> OS doesn't know your data placement, OS doesn't know what thread accesses what data, etc.

I have a very simple architecture, every work are strictly exchangeable with each other, OS can't make mistake...

>Another implication of HT is that sibling threads share L1D$, L1I$. ..

like I sais before, there is no sibling threads for the moment. But maybe i can find a way in that direction...

Thank you for your participation in this discussion !!!

Vincent

fraggy · ‎01-27-2009

I need to test my demo on manycore processors (I expect linear scaling).

Anybody work on larrabee here ?

Vincent :p

Dmitry_Vyukov · ‎01-27-2009

Quoting - fraggy

>To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware

> thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the

> same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing

> floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-

>point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of

> integer EUs, thus NO speedup.

All of my threads work on independant data (I use a data parallel model in my API), every thread work on the exact same kind of data, I don't have a way to say, this one work on integer, this one work on float...

I guess this HT optimisation is not for me. But I have decent performance gain, so...

If you are developing general-purpose library, then probably your users will have different different kinds of tasks. I may surmise that games contain some integer intensive parts (probably AI, world model manipulations, really don't know) and some floating-point intensive parts (physics). So user may write something like:

your_lib::task* t1 = your_lib::spawn(calculate_AI, your_lib::hint_integer_intensive);
your_lib::task* t2 = your_lib::spawn(calculate_physics, your_lib::hint_floating_point_intensive);

And you will try to schedule these tasks on HT sibling threads.

I am thinking about the following test, in order to estimate possible performance gain. Create function f_int() that makes integer calculations and takes 10 seconds to complete. Create function f_fp() that makes floating-point calculations and takes 10 seconds to complete. Create 2 threads and bind them to sibling HT threads. Case 1: both threads execute f_int(). Case 2: both threads execute f_fp(). Case 3: one thread executes f_int(), and another executes f_fp(). The results can be something like: case 1 - runtime 19 secs, case 2 - runtime 19 secs, case 3 - runtime 14 secs.
Unfortunately I don't have HT machine now...

jimdempseyatthecove · ‎01-27-2009

Dmitriy,

You will need a few more permutations. The HT siblings also share a memory port so you would want to permutate on memory access too (and various cache levels). In an application such as this demo program, it could by chance have a blend of integer and FP and get good scaling, however the measured performance data does not seem so.

As a side issue, I am not entirely sure that for this demo that the metrics used have been normalized. Example, when going from one domain to two domains (sub-boxes in video) I would assume that there is a little more work happening at the domain interface between sub-boxes than there is at the domain interface to open space (a particle can only bounce off an open wall, but penetrate through and/or interact with particles in proximity of the interface to the adjacent box. So the scaling from 1 domain to many (T1*nCells + CellToCellInterfaceOverhead * nCellToCellInterfaces). As currently implemented nCells==nThreads

Jim

alpmestan · ‎01-28-2009

Hi all,

I'm working with fraggy on this project.

First thank you all for your answers, it helps a lot.

However, we're trying to build something really OS-independent, as any general purpose library must be. But such a general purpose library can't take benefit from optimisations such as integer calculi fetched to a thread, float calculi to another, except if the person using our library specify it itself ? But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

Moreover, is there anybody able to do some tests for us on a "many-core" architecture, except corei7 for which it's already done ?

I'd have an additional question : we think we can get some speed up with modifying some code structures, but as we, for the moment, haven't done anything related to int/float computations, for the general purpose reason, what kind of optimisations are left ?

Thanks all.

fraggy · ‎01-28-2009

ACCLib is more a video game general purpose library, not a real general purpose lib :p

We plan to work the general part as soon as possible.

Maybe a quick explanation may help (see http://www.acclib.com/search/label/ACCLib for more details) :

The Abstract World
Our API grant access to an abstract world in which the user (or developer) can add basic entities (such as ground, sphere, block, light source, etc) and custom entities (3D object, rigid body, shadow, AI, particles, etc). The abstract world computes internally interactions occuring between existing entities (collision, illumination, change of direction or speed, etc) and gives back data that can be used to draw each frame.

Customizing Entities
That's the key feature of the API. Users have to implement specific interfaces to allow their entities to be used in ACClib. Those interfaces will :
- define 3D properties of the entity (shape, texture, etc)
- define "two-by-two" interactions with other existing entities (what reaction when exposed to a ligh source, or when colliding with other entities, etc)
- allow automatic validation of this entity to check, e.g, multithread performances.

Each zone is compute by a different thread (or the same thread but not a the same time). If half the computation is integer and the other half is FP 2 threads may use different part of the processor while working on 2 differents zones. I will try to blend integer and FP, maybe performance can be improve that way.

->THANK YOU

Vincent

jimdempseyatthecove · ‎01-28-2009

fraggy, Vincent,

Dmitriy suggested your library interface, at the point of how threadscheduling is directed, contain an (optional) hint parameter that can be used for thread scheduling purposes. I agree with Dmitriy that you consider adding this to your code.

Reason 1: HT is now comming back in vogue, quad core (8 thread) Core i7 has it now, others will follow soon.

Reason 2: Some functionality may be better suitable to run in the GPU. When the platform has a compatable GPU the hint can be used to select the code path. (you or your users will have to write the two program paths)

Reason 3: Larrabee is comming down the road. This kind of product will blur the distinction between CPU and GPU. I anticipate that you (your library users) will find some code that is not suitable for export to the GPU is suitable for export to a Larrabee device. Although initial products with Larrabee are expected (by me) to be a GPU-like card with video outand occupy a PCI Expressw slot, later versions will likely be integrated into the motherboard thus leaving the PCI Express slot available for a GPU (more likely a watered down GPU).

Jim Dempsey

jimdempseyatthecove · ‎01-28-2009

Vincent,

RE: I will try to blend integer and FP, maybe performance can be improve that way.

When you code in C++ you will likely encounter a significant portion of integer computation. This is due to traversing the object hierarchy as opposed to computing on a large (composit) vector as often done in Fortran. So I suggest you examine the optimized release code before you take too much effort in co-mingling routines. Using a profiler may help too.

Jim Dempsey

Dmitry_Vyukov · ‎01-31-2009

Quoting - jimdempseyatthecove

Dmitriy,

You will need a few more permutations. The HT siblings also share a memory port so you would want to permutate on memory access too (and various cache levels). In an application such as this demo program, it could by chance have a blend of integer and FP and get good scaling, however the measured performance data does not seem so.

Hmmm... yes... I missed many other parameters. It would be interesting and insightful to test task combinations that fit/doesn't fit into L1, fit/doesn't fit into L2, saturate/doesn't saturate memory channel and so on. Also one HT thread can load mostly FP_ADD execution unit, and other - FP_DIV EU, and there is also MMX EUs...
It would be useful to catch on to main tendencies wrt those permutations... However... I think that it will be already infeasible for general-purpose library to ask user for so many parameters (L1D footprint, L1I footprint, L2 footprint, required memory bandwidth, exact types of used EUs, etc)... Probably it's feasible to add only one additional hint - hint_execute_alone - runtime will try to temporary shutdown sibling HT thread while such task executes... But worst case degradation from HT must be some 5%, so very likely hint_execute_alone will be useless.

Dmitry_Vyukov · ‎01-31-2009

Quoting - alpmestan

However, we're trying to build something really OS-independent, as any general purpose library must be. But such a general purpose library can't take benefit from optimisations such as integer calculi fetched to a thread, float calculi to another, except if the person using our library specify it itself ? But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

Yes, user specifies these hints himself.
If user will mislead runtime with such hint, then he will just experience some performance degradation (nothing terrible).

Dmitry_Vyukov · ‎01-31-2009

Quoting - fraggy

On may new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

Ok, here is very quick test that you may do in order to get some insight about effectiveness of HT usage.
Just pin all worker threads so that only one hardware thread will be used on every HT core. On Windows you can do it just by inserting SetProcessAffinity(GetCurrentProcess(), 1 | 4 | 16 | 64) in main(). On Linux you must use pthread_setaffinity_np() or sched_setaffinity(). Then compare obtained results with previous results. This way you will measure what speedup you are getting from HT.
If you will do this, please, post numbers here.

A multithreaded demo on quadcore / corei7 (scaling manycore processor)