Looking for help defining a Threaded Programming Contest for 2009

ClayB · ‎02-06-2009

Since we had such a great response to the previous Threading Challenge Contest, we want to do another one or two or six. From that year-long contest that ended in September 2008 there are a few things we hope we can do better this time around. Some of the lessons learned include:

* Don't go on so long. Twelve months is a virtual programming marathon and I appreciate everyone that stuck with it. While it did separate the dabblers from the hardcore hackers from the thread monkeys, it discouraged participation from some innovative and deserving contestants.

* Try to have more objective judging criteria. Execution speed is the whole reason for concurrent and multi-core programming. Does it matter how you get there or what tools are used? A code that screams through a data set should be judged higher than a code that is consistently well indented, right?

* Be more open and precise about the judging machine configuration(s). It would be much better to have a single machine that all contestants could access to develop, test, and submit their entries. Unfortunately, this would require administrators and rack space and all sorts of other expenses. Second best is to be sure everyone understands how the machines are set up, both for hardware and software.

The driving idea behind threaded programming events for 2009 would be to take "classic" algorithms and parallelize them. The term "classic" would mean something that might be studied in an undergraduate course on algorithm design. Categories of classic algorithms would include things like searching, sorting, numerical computations, string processing, computational geometry, and graph algorithms. In addition to fostering a threaded code solution to "classic" algorithms, we are hoping to somehow to generate content for the Intel Software Network.

Since we are still in the planning stages, things are still in flux. Other ideas that have been brought up are to limit the languages used, pick a single OS to be used, restrict some problems be done in a non-traditional programming language (e.g., Haskell), or offer more than one problem per month.

In order to make this contest the best it can be, we'd like to solicit your input. If you have any ideas, especially if you participated in the Threading Challenge Contest, about how the contest could be administered, judging criteria to be used, a problem that could be posed, or how contest entries could be featured in ISN and educate readers about multi-core programming, then we want to hear your ideas. You can reply to this post or send me an message through the email facility here in the forums.

With your help, we can make threaded programming contests this year and in coming years better, more fun, and provide exceptional examples for programmers around the world.

jimdempseyatthecove · ‎02-09-2009

Very good outline for contest.

The contest duration should be long enough for the programmer to perform the task in intermittent steps (between their normal work obligations). As to what this is, I do not have a good recommendation. 12 months is too long, 1 month might be too short.

I agree, the contest should be judged on results, accuracy and performance. Because performance will vary on target platform might I suggest the following: Perform the tests on multiple different target platforms of varying numbers of cores, speeds, instructions sets,memory configurations etc.. Intel does have a test lab, I assume,with various configurations. Choose a selection criteria for platform such as no older than n years, not newer than not released. You may choose equal weighting or bias the results by some factor such as by number of particular processors sold. (e.g. Q6600 would have heavier weight than Extreme series).

The performanceand accuracyis all that matter. In the case of a tie then maybe other factors can be used (e.g. earliest entry, ...)

Language or tool selection should be immaterial. A user should be free to choose hand coding in assembler as well as choose to write and Excel macro.

The challenge should be generic and offer the contestant the widest of leeway in producing the solution.

Bad example:

Write the fastest multi-threaded Quick Sort routine given this data-set and results desired.

Good example:

Write the fastest multi-threaded sort routine given this data-set and results desired.

If someone wishes to submit an entirely new sort routine that is entirely up to them.

Jim Dempsey

fraggy · ‎02-09-2009

Quoting - Clay Breshears (Intel)

The driving idea behind threaded programming events for 2009 would be to take "classic" algorithms and parallelize them. The term "classic" would mean something that might be studied in an undergraduate course on algorithm design. Categories of classic algorithms would include things like searching, sorting, numerical computations, string processing, computational geometry, and graph algorithms. In addition to fostering a threaded code solution to "classic" algorithms, we are hoping to somehow to generate content for the Intel Software Network.

Future of processor is multicore therefore future of software have to be multithread :)

The more people work/think on parallel software the better and the Intel Threaded Programming Contest might help on that...

>"The driving idea [...] would be to take "classic" algorithms and parallelize them."

Doest it means that if you want to parallelize real life applications you need to parallelize all the algorithm within ?

For instance, does it mean that if I want to parallelise a video compression software, I should try to parallelize the compression algorithm ? For a antivirus software do I really need to make parallel heuristic algorithm ? And for a Spell corrector, a video decompressor, a compilator, a speech recognizer, a render solfware, a neural network patern recognizer, a 3D modeler, a physics engine or a video game ?

But most of these Softwares can be parallelize pretty easily : antivirus may load multiple file in memory, one file per core. video compression algorithm may work on different part of the video, if you can cut the video on enought pieces etc... You can do the same for spell/grammar corrector, render software etc... but not for all of them.

Some software are "not so easy to parallel" : you really need to multithreadvideo decompression algorithm, there is no point to decompress differents parts of the video at the same time... Same problem with speech recognizer or physics engine...

Parallelize that kind of "not so easy to parallel" software is more interesting than "classic" algorithm...

Who really think nowdays that multithreaded every part of a monothreaded software can make a PERFORMANT multithreaded version of the sofware ? Who really think that the use of a multithreaded library like TBB with protected vector/data structure and multithread algorithms can make PERFORMANT multithread software ?

TBB like library is essential, access to a performant multithread quick sort is essential, but your application will not scale if you don't rethink and refactor the core architecture from scratch.

My very personal opinion : the path to performance in multithreading is refactoring the software architecture, not refactoring the tools.

Intel Threaded Programing Contest should be oriented on real life application (video decompression, physics engine, video game or any "not so easy to parallel" software). Reward need to be more interesting (being a first client, financing with hardware/training/money etc... ) And the only judging criteria must be performance scaling on future manycore.

Vincent

jimdempseyatthecove · ‎02-10-2009

Clay,

The problem with threading challenge contests is when you define a problem as writefrom scratch contest you generally restrict yourself to small easily understood algorithms. These types of contests are good, but they are not the only type of problem to be presented for competition.

A second type of problem to parallelize is a more complicated application. An example might be to provide the contestants with a working base level program and then challenge them to multi-thread it. An example might be the tryHavok Smoke demo or other similar applications. Note, although Smoke is already multi-threaded using Windows Treads and/or TBB threads, these efforts at threading leave considerable room for improvement.

Jim Dempsey

Dmitry_Vyukov · ‎02-10-2009

Quoting - jimdempseyatthecove

A second type of problem to parallelize is a more complicated application. An example might be to provide the contestants with a working base level program and then challenge them to multi-thread it. An example might be the tryHavok Smoke demo or other similar applications. Note, although Smoke is already multi-threaded using Windows Treads and/or TBB threads, these efforts at threading leave considerable room for improvement.

This will be interesting indeed! And close to real-life.
Smoke Demo is a good candidate. Just take Smoke and make as fast as possible.
Hmm... Although now Smoke is already a bad candidate, because some cheaters already start optimizing it speculatively :)

Dmitry_Vyukov · ‎02-10-2009

Quoting - jimdempseyatthecove

I agree, the contest should be judged on results, accuracy and performance. Because performance will vary on target platform might I suggest the following: Perform the tests on multiple different target platforms of varying numbers of cores, speeds, instructions sets,memory configurations etc.. Intel does have a test lab, I assume,with various configurations. Choose a selection criteria for platform such as no older than n years, not newer than not released. You may choose equal weighting or bias the results by some factor such as by number of particular processors sold. (e.g. Q6600 would have heavier weight than Extreme series).

I would prefer single defined platform, because I will not have enough time to make optimized solutions for many platforms. Although I can make it for single platform. One may think of it in the following way - I've made optimized solution for single platform, just because time was limited; but if it would be real-life problem, and I would have more time, and speed matters, then I will make many solutions for many platforms as optimized as I made in the contest for single platform. So the task is to determine how good solution I can make, not how many solutions I can make.

Dmitry_Vyukov · ‎02-10-2009

Quoting - fraggy

But most of these Softwares can be parallelize pretty easily : antivirus may load multiple file in memory, one file per core. video compression algorithm may work on different part of the video, if you can cut the video on enought pieces etc... You can do the same for spell/grammar corrector, render software etc... but not for all of them.

Some software are "not so easy to parallel" : you really need to multithreadvideo decompression algorithm, there is no point to decompress differents parts of the video at the same time... Same problem with speech recognizer or physics engine...

Parallelize that kind of "not so easy to parallel" software is more interesting than "classic" algorithm...

Yeah, many algorithms are trivially parallelizable, thus not interesting.
Many algorithms are hard to parallelizable, but have known published parallelization techniques, not interesting too - the task is to find right paper.
And many algorithms are hard to parallelize, and don't have known published parallelization techniques, and such techniques are publishable result or even PhD thesis. Not interesting too, because few or zero participants will figure out how to do it...

Dmitry_Vyukov · ‎02-10-2009

Quoting - fraggy

And the only judging criteria must be performance scaling on future manycore.

That's interesting judging criterion!
Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.
First, program is executed with SetProcessAffinity(GetCurrentProcess(), 1) - baseline case.
Then, program is executed with SetProcessAffinity(GetCurrentProcess(), 1 | 2) - how much speedup can you get from HT?
Then, program is executed on 6 cores of 1 processor - how much speedup can you get from 6 full-fledged cores?
Then, program is executed on 2 cores, but on 2 different processors - can you get advantage of NUMA?
And so on up to 48 cores.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

I would really like to see such judging creterion at least on 1 appropriate problem. What do you think?

fraggy · ‎02-10-2009

Quoting - Dmitriy Vyukov

Yeah, many algorithms are trivially parallelizable, thus not interesting.
Many algorithms are hard to parallelizable, but have known published parallelization techniques, not interesting too - the task is to find right paper.
And many algorithms are hard to parallelize, and don't have known published parallelization techniques, and such techniques are publishable result or even PhD thesis. Not interesting too, because few or zero participants will figure out how to do it...

You're right :)

Contest should be on any algorithms (trivially parallelizable or not, with knowned solution or not). But still, reward for algorithm with no knowned solution might be a little bit more interesting :)

Vincent, Maybe something like theLoebner Prize

Dmitry_Vyukov · ‎02-10-2009

I think that execution speed must be preferred over consistent well-structured solution that uses "right" libraries. It's approach taken by the AMD Multicore Threadfest, and I think it's right for this kind of contests. They allowed to use AMD APL Library, but that was neither obligatory, nor affects score. Judging criteria was very simple and formal. Problem for the round 4 of the AMD Multicore Threadfest was image filter, and the judging criteria was: SCORE = C1 / (execution_time + C2 * relative_error), in my opinion it's very good judging criteria.

In the AMD Multicore Threadfest they precisely described machine configution that will be used for the competion, they described number of processors, processor models, amount of cache, amount of memory, OS version, compiler version. IMO, it's very right too. Although, in general solutions must be general enough to run smoothly on any platform; but taking into account limited amount of time competitors just unable to develop such general solutions, so it's Ok to target one particular platform.

In the AMD Multicore Threadfest the only OS that was allowed is Linux. But I think that many programmers (including me) (who fell more comfortable with the Windows platform and toolchain) was very inconvinient with that. Huge amount of very limited time I spent on struggling with OS API, language dialect and especially with asm and intrinsics. So I think that both Linux and Windows must be allowed.

fraggy · ‎02-10-2009

Quoting - Dmitriy Vyukov

Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Vincent, running a test on a 48 hardware threads is a dream !!!

Dmitry_Vyukov · ‎02-10-2009

As for non-traditional programming languages, well, I think that it will limit number of participants in such contests significantly. I will not be installing and teaching new language and toolchain for the competition. And even if I will do, I will have no chances versus already expirienced in that languge participants. Although I think it's good to provide broad choice of languages for the competition - programmers always tend to make language shootouts - C/C++ programmers can use asm while Haskell programmers can use STM. It's interesting and fun.

As for generation of content for the ISN. I think that the good answer is to publish description and the code of the winning solution (and maybe of some other interesting solutions). Descriptions can be provided by the participants or by the organizers.

As for tasks for competitions, I may describe my experience with the AMD Multicore Threadfest. Round 4 problem was image filter. Although it was very interesting competiotion in itself, it was multicore/threading competition in no way. It was advanced math competition (and it was formally classified on the TopCoder that way), it required knowledge and application of such things as matrix decomposition and FFT based convolution (participants with no so advanced math was really out of the deal). And threading part in many solutions was as simple as single '#pragma omp parallel for' on outer-most loop (and it was perfectly enough!). As for me, it's just ridiculous to call it threading fest, it's 100% plain old single-threaded problem.

Dmitry_Vyukov · ‎02-10-2009

I would like to see as tasks something like the following.

Parallel concurrent garbage collector (parallel here means that garbage collection works in several threads, and concurrent means that garbage collection runs concurrently with user program (i.e. no stop the world)). Problem situation must define a set of interfaces between run-time and garbage collector, some interfaces must be defined by participant - for example, object header layout; and the task is to implement GC according to these interfaces. Test-suite will stress GC under different kinds of load.

Tricky concurrent data structure. Problem defines a set of use cases: read/write ratio, object size, object properties and so on, and interface to the data structure. The task is to implement that data structure.

Run-time for asyncronous message-passing library (Erlang-like). Problem defines a set of interfaces between library and user code and a set of test use cases. The task is to implement run-time.

Banking server. Server holds a number of accounts, accounts are organized into tree hierarchy (accounts/subaccounts), balance is connected to each account. Server handles a number of requests: adjustments to account, transfers between accounts, output balance of account, aggregations on several accounts, creation of new accounts. Accounts can be in several statuses (open, preclosed, closed), status defines a set of allowed operations. Each request and each result must be logged to some common journal. Also there are some global settings, which can be changed at run-time and define some aspects of the request processing. The task is to implement server run-time: request handlers, logging subsystem, settings change handler.

Such tasks advisedly require frequent synchronization between threads (like some real-life problems), and not just single 'omp parallel for' on outer-most loop as threading part. And also allows to feel what is 'my NUMA node' and what is 'remote NUMA node'; what is shared L2$; and what is rw-lock on highly contented data structure.

Dmitry_Vyukov · ‎02-10-2009

Quoting - fraggy

Contest should be on any algorithms (trivially parallelizable or not, with knowned solution or not). But still, reward for algorithm with no knowned solution might be a little bit more interesting :)

What's the point of trivially parallelizable problem in threading contest?

fraggy · ‎02-10-2009

Quoting - Dmitriy Vyukov

What's the point of trivially parallelizable problem in threading contest?

:)

a contest must be an opportunity to everyone. Some programmer don't have much experience in threading, a multithreaded quicksort could be interesting, for instance.

But I'am agree with you, if it's just a matter an openmp pragma to be use, there is no point...

From the Intel point of view, the more participant, the better :)

Vincent

jimdempseyatthecove · ‎02-10-2009

Quoting - Dmitriy Vyukov

I would prefer single defined platform, because I will not have enough time to make optimized solutions for many platforms. Although I can make it for single platform. One may think of it in the following way - I've made optimized solution for single platform, just because time was limited; but if it would be real-life problem, and I would have more time, and speed matters, then I will make many solutions for many platforms as optimized as I made in the contest for single platform. So the task is to determine how good solution I can make, not how many solutions I can make.

In all fairness the winners(awards)could be split in two or more categories:

Single socket
Multi-Socket
Multi-Socket NUMA

If you were to assign a winner overall then some sort of weights would have to be applied to the configurations.

Jim Dempsey

jimdempseyatthecove · ‎02-10-2009

Quoting - Dmitriy Vyukov

Quoting - fraggy

And the only judging criteria must be performance scaling on future manycore.

That's interesting judging criterion!
Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.
First, program is executed with SetProcessAffinity(GetCurrentProcess(), 1) - baseline case.
Then, program is executed with SetProcessAffinity(GetCurrentProcess(), 1 | 2) - how much speedup can you get from HT?
Then, program is executed on 6 cores of 1 processor - how much speedup can you get from 6 full-fledged cores?
Then, program is executed on 2 cores, but on 2 different processors - can you get advantage of NUMA?
And so on up to 48 cores.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

I would really like to see such judging creterion at least on 1 appropriate problem. What do you think?

The above requires that the core/thread management be performed external to the application and would not be part of the optimizations permitted by the programmer.

Jim

jimdempseyatthecove · ‎02-10-2009

Quoting - fraggy

Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Vincent, running a test on a 48 hardware threads is a dream !!!

Vincent, I disagree. One entrants single threaded program might run 4x faster than some other entrants 4-threaded program. The first entrants program might scale poorly while the other program might scale better. The chriteria should not be the slope of the curve but the area under the curve.

Jim Dempsey

jimdempseyatthecove · ‎02-10-2009

Dmitriy,

>>Parallel concurrent garbage collector

This implies your memory allocate/deallocate generates (much) garbage. The test program should be such that it adversely affects memory allocate/deallocate such that alternate means may be introduced to reduce the interaction. Often these alternate means generate garbage which periodically needs collection. The contest rules should be for performance of allocation/deallocation (best case, average case, worst case) regardless as to if garbage collection is performed concurrent with each allocation/deallocation or at some time later.

Your other suggestions are good. These fall into the class of distributing a base level program with input files and output files and a set of guidelines as to if the output file has to be exactly the same or has to pass an approximatly the same when passed through a verification program (also supplied). The user can do whatever they want with the sample program, but they must use the supplied input data as-is.

Jim Dempsey

Dmitry_Vyukov · ‎02-10-2009

Quoting - fraggy

Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Assume my solution is:
on single-core - 1x
on dual-core - 1.9x
And your solution is:
on single-core - 0.9x
on dual-core - 1.8x
Which is the best performance?

Now assume that we add quad-core into the tests.
My solution is:
on quad-core - 3.5x
Your solution is:
on quad-core - 3.6x
Which is the best performance now?

In single-core world performance was a single value - a scalar. For example, memory allocation/free takes 20 cycles, it's enough to characterize the performance of memory allocator.
But in multi-core world such approach is totally busted. Performance is not a scalar any more, performance is a function of a number of cores, or more precisely of a system configuration: performance = f(system_configuration). Ok, malloc/free takes 20 cycles, but what will be if I will start calling malloc/free from 2/4/8 cores? Will it be 20/20/20 cycles, or 200/500/6000 cycles? Will it supply me with free lunches?
Whether performance is perfectly linear y=x? Or is it y=0.8x? Or is it y=sqrt(x)? Or is it y=log(x)? Or is it y=x, when x=1..4, and y=4-x, when x>4? It's extremely critical to capture the whole y=f(x), and not some single y0, in order to evaluate some component.

But that just two different competitions - "squeeze max from today's multi-core" vs. "are you ready for many-core future?" (or "scalability at any price!"). Both are useful and interesting.

fraggy · ‎02-11-2009

Quoting - Dmitriy Vyukov

But that just two different competitions - "squeeze max from today's multi-core" vs. "are you ready for many-core future?" (or "scalability at any price!"). Both are useful and interesting.

maybe a mix could be interesting.

pure performance is always important, of course. The right solution could be a balance of both, a grad on pure performance and a grad on scaling performance.