Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

A multithreaded demo on quadcore / corei7 (scaling manycore processor)

fraggy
Beginner
7,766 Views
Hello everybody,

I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent, if you want more informations on this project feel free to read www.acclib.com
0 Kudos
1 Solution
Dmitry_Vyukov
Valued Contributor I
7,708 Views
Quoting - fraggy
In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?



I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.


View solution in original post

0 Kudos
69 Replies
fraggy
Beginner
1,029 Views

Please run the test with 12 zones, you may be pleasantly surprised. The purpose of the additional zones is to accomidate background activity on the system. Your application will perform more computations, however, I anticipate you will also have more working threads for longer duration. Your system has a non-zero amount of work to perform while the application is running. Therefore, if this time is X and your zone run time is Y your 4 core system is unproductive for 3(Y-X). By doubling the number of zones, your system is unproductive for 3(Y/2-X). Although you increase the overhead of the additional zones so the base Y is slightly different.

Jim Dempsey



I apologize, my first attempt to test on 12 cores was heart breaking : 12 zones means 12 core and I allocate a LOT of memory to do this -> at that time I had only 2 Gb of memory, vista start swapping and my performance went down...
I bought an extra Go (25euro) and spent some time to tune memory allocation (few minutes)... and tada !!!!!

As you said, it's a very good looking chart (the green one) and suprisingly, performance are not so bad :p

Vincent, faith seems to be the only required skills for R&D.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,029 Views

Good job at running the test. Here are my comments.

The 1 thread test with 12-zones is significantly different from the 1 thread test with 4 and 6 zones. Therefor I suspect you have optimization switch settings differences. There is almost no difference between your 4 and 6 zones 1 thread test. You can observe a very slight dip in the 6 zone 1 thread performance vs the 4 zone 1 thread test. I would expect a similar dip between the 6 to 12 zone test runs (1 thread) (~1/2 marker size).

In looking at the scaling curve for 3-4 on 12-zone you see that there is a much better slope than for the 6 zone test between 3 and 4 threads. I believe that once you find out and fix the 12-zone 1 thread test, that the complete 12-zone 2, 3, 4 thread test curvewill rise accordingly.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,029 Views

Why would the amount of memory change? The number of spheres are the same, aren't they. You do have more walls, I wouldn't think that would cause much of a footprint size difference.

Did the number of RAM chips change? Did the speed of the RAM chips change? If yes to either, did you re-run the 4 and 6 zone tests? (different memory may have different latencies and performance).

Jim Dempsey
0 Kudos
fraggy
Beginner
1,029 Views

Why would the amount of memory change? The number of spheres are the same, aren't they. You do have more walls, I wouldn't think that would cause much of a footprint size difference.

Did the number of RAM chips change? Did the speed of the RAM chips change? If yes to either, did you re-run the 4 and 6 zone tests? (different memory may have different latencies and performance).

Jim Dempsey

until now, choosing 12 zones means that I plan to test on 12 cores. I have to allocate enought memory to test 12x more sphere than 1 core...

Like I said before I ve just tune memory allocation it was not a big deal.
And about the Go memory I add, it's just an additional RAM Chip, memory certainely can't run faster, but it can run slower.
With more memory I should retest, condition have changed :p Vista seems to be a bit faster (less swapping), I compile and link a bit faster, maybe I can have some improvement (in pure performance, not scaling).

Vincent, stay posted
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,029 Views

When you added an additional chip did you go from 1 chip to 2 chips, 2 to 3 chips,...?

When you go from 1 chip to 2 chips most motherboards will permit interleaving of the memory addresses, some BIOS's permit you to select interlieved or seperate. Performance can vary depending on the interleave and application. For interleiving you need an even number of chips (well for 2-way interleaving, 4 chips for 4-way interleaving if you BIOS supports that).

If you nowhave an odd number of chips (3) the BIOS may have decided sinceit cannot interleave the 3rd chip,it will not interleave the first 2 chips. So if you changed your evenness/oddness (which by adding 1 chip you had to unless you replaced one smaller chip with one larger chip) then you may have changed the interleaving and thus affected the base level computation time (i.e. your run data for 4 and 6 zones is no longer valid for comparrison to new run data).

Another factor is if the new memory speed is different from the old memory speedthen the BIOS will generally select the slowest memory chip settings for all memory chip speed settings. Again, this will affect the base level test and the 4 and 6 zone test runs will have to be redone.

In essence your curve data were produced on different machines.

Jim Dempsey
0 Kudos
fraggy
Beginner
1,029 Views

In essence your curve data were produced on different machines.
I have redone all the testing, curve seems to be the same (more or less).
About future testing I will keep the 1thread/1zone systeme, like that, number of pair and effective collision rise in a linear way.

By the way, I've just found a performance scaling test for the corei7, Cinebench seems to provide one of the best performance on this processor : 4.32 more faster with 8 threads (compare to 1).


My performance results seems to be correct after all :)
If you want more informations on this project feel free to read www.acclib.com.

Vincent
0 Kudos
fraggy
Beginner
1,029 Views
Last results on the corei7
Tests had been done 20 time by Laurent, thanks to him.

With the last revision of the library, we can demonstrate 4.34x performance gain on a core i7, it's the most important performance gain ever experienced on a 3D real time application (like Video Game). It's de factor between 1 thread and 8 threads. Pov-Ray like benchmark demonstrate exactly the same performance scaling on corei7 :Free lunch is not over for video game !!!


I'm looking for game or demo performance scaling tests for comparison :p

0 Kudos
Reply