Re: Dynamic scaling of application..possible?

kv_ishl · ‎05-29-2009

Hello all,
I will ask my question with a real world scenario. I have a 16-node Quad-core Xeon machine. I have two applications:
- app1 (Runs in 4 hours)
&
app2 (Runs in 10 hours)

Both of them are multithreaded-MPI enabled and I run both of them simultaneously. I specify 8 threads for each application thru MPI. Since app1 gets finished in 4 hours, for rest of the 6 hours while app2 is still running, my 8 nodes are simply lying idle.

How can I make use of this idle time for app2?

Thanks,
Blue

jimdempseyatthecove · ‎05-29-2009

A simple method might be to set the runtime priority of app2 lower than for app1.

A nearly as simple method would be to add to app 2 a means to detect if app1 is running (existance of a file or absense of a file would be sufficient or use shared mailbox file). This detection can be placed in an outer loop and/or made at some interval (e.g. once per minute) then when app2 sees that app1 is running set the runtime priority of app2 lower than the priority for app1 then when app2 sees app1 complete reset to normal priority.

If you need progress on all the threads in app2 then using the detection of running of app1, then in app2 instead of changing priority insert atappropriate loop level:

if(App1Running()) SwitchToThread();

You might want to make the test generalized. Have a means to specify priority of App(n) relative to other apps (e.g. file, database, registry entry, etc...) then a means for each app to (periodicaly) detect what other apps are running, then for the other apps running, compare their priority with your app priority. If any are higher, lower your priority to Below Normal, If none are higher, restore your priority to Normal.

if(HigherPriorityAppRunning()) SwitchToThread();

By the way,

Do you mean 16 Hardware Threads comprised of 4 Quad-Core Xeons?
(this could be assembled using1, 2, 3 or4 NUMA nodes)
or
Do you mean 64 Hardware Theads comprised of 16 Quad-Core Xeons?
(this could be assembled using4 to 16NUMA nodes)

Jim Dempsey

Michael_K_Intel2 · ‎06-10-2009

There is research going on making applications malleable in terms of processes and threads. AMPI and others extend the programming model of MPI with calls to allow programmers to manually trigger a change of the number processes executing the application. That could be an option, as well. It's, however,unlikely that theses MPI implementations are also capable of adjusting the thread count of the application.

For OpenMP, I was working on malleability features. Alas, they are not available in any production compiler, but only in a research compiler for OpenMP/Java on clusters.

Cheers,
-michael

Alain_D_Intel · ‎06-11-2009

Quoting - Michael Klemm, Intel

There is research going on making applications malleable in terms of processes and threads. AMPI and others extend the programming model of MPI with calls to allow programmers to manually trigger a change of the number processes executing the application. That could be an option, as well. It's, however,unlikely that theses MPI implementations are also capable of adjusting the thread count of the application.

For OpenMP, I was working on malleability features. Alas, they are not available in any production compiler, but only in a research compiler for OpenMP/Java on clusters.

Cheers,
-michael

Your question is ambiguous: why don't you try to run app 1 and 2 on 16 cores each one after the other?

What is your optimization constraint? idle time, total elapse time, average restitution time for users,
power consumtion ,etc ...

Each of these aspectsis adifferent problem ....

For general dynamic load balancing (mean management during execution), I'll distinguish3 level:

1) machine (cluster) level :ressource management policy definition andjob queue processing=> actually static
(ie: PBS,LSF software)
2) OS level: quite well done and usable on one node (see answer 1) but doesn't scale to cluster

3) application level : quite never done, even with fix ressource allocation (nb cores,nb nodes,etc..)

=>all 3 levels need to coperate (a policy definition, a communication scheme and prioritysettings)

=> we are very far from an efficient management of cluster ressources

A simple way, could be to have some checkpoint capabilitiesin applications, associated to restartwith modified ressources, and a penalty function to do it. Then it should be easier todeploy an efficientdynamic workload management.