Community
cancel
Showing results for 
Search instead for 
Did you mean: 
kmvinoth
Beginner
95 Views

MPI with 64bit compiler

Hi iam doing mpi coding with64bit intel compiler in a linux cluster which contains 64 bit processor.In my code each process is going to run the loop from zero to 2^64-1.Finally all of them is going to add and give me the result.I didnt get any warning or error message during compilation.but when i run the program my job is getting killed in the cluster irrespective of the time(i tried form 2 to 240 hours in the script)i give.i found out only one thing the application rank getting killed in the error message depend on the time i give in the script. sometimes it says rank 27 if i give another time it says rank 0 getting killedand so on...i dont know whats wrong with my code.i have given below the code and the error message.i hope some one can help me out.thanks in advance.
0 Kudos
6 Replies
jimdempseyatthecove
Black Belt
95 Views


Vinoth,


When you uncomment

//printf("n = %.30Len",n);

What does n print as?

Also, on some systems (long double) is equivilent to (double).

Insert some ASSERTS such that in debug builds your code will expose problems should the compile options generate doubles for long doubles.

Jim Dempsey
kmvinoth
Beginner
95 Views


Vinoth,


When you uncomment

//printf("n = %.30Len",n);

What does n print as?

Also, on some systems (long double) is equivilent to (double).

Insert some ASSERTS such that in debug builds your code will expose problems should the compile options generate doubles for long doubles.

Jim Dempsey

Hi Jim

n prints as 1.844674407370955161500000000000e+19(2^64-1).In my system double is not equivilent to long double, because when i use double and print n (n=1.844674407370955161500000000000e+19) when i define a new variable m where m=n-1 iam getting m=1.844674407370955161500000000000e+19 which is same as n. this problem doesn't happen when i use long double in that case m prints as 1.844674407370955161400000000000e+19. i have even inserted the asserts and checked the value of n,m,sv,ev.i have given the sv and ev values when i use 16 process(tp =16 & np = 0 to 15)to run the code. iam still not able to find where is the error and why my job is getting killed.

Regards
vinoth
[cpp][/cpp]
jimdempseyatthecove
Black Belt
95 Views


Your error message is comming from PBS. I assume this is a job scheduler for your server. The error message referrs to you application exceeding a runtime quota. This sounds like an issue you should discuss with the system administrator. If you are the system administrator then consult the PBS documentation (and related watchdog timmer applications that run in the background).

Jim Dempsey
kmvinoth
Beginner
95 Views


Your error message is comming from PBS. I assume this is a job scheduler for your server. The error message referrs to you application exceeding a runtime quota. This sounds like an issue you should discuss with the system administrator. If you are the system administrator then consult the PBS documentation (and related watchdog timmer applications that run in the background).

Jim Dempsey
ok jim. i will check this with my system administrator and if any problem persists i will get back to you. it will be better if you can explain something about the watchdog timer application which i am hearing for the first time.

Regards
vinoth
jimdempseyatthecove
Black Belt
95 Views


In a shared system (e.g. your server) the operator of the server (system administrator) must make sure that the system remain shared. Unless otherwise instructed by policy, no one user can commandeer the system forever.

It is not unusual for programs to enter a condition that causes an infinate loop (e.g. problem in convergence either due to data or programming oversight/error). Most "convergence" type programs should have a bail-out condition where it detects that convergence is impossible. Unfortunately, the system administrator cannot assume anything about the applications that run on the system, they can only guess. Therefore, as a guess, the system administrator will declare, unless told otherwise, any program running longer than xxx will be deemed as hung and will be terminated.

Think of this as a parking limit for your car. You get 2 hours, if you take longer you get towed. (exceptions for emergency vehicles). Your job now is to argue the case with your system administrator that you deserve special treatment. (the other users of the system may object to you bogging down the system).

Jim Dempsey
TimP
Black Belt
95 Views

It's normal practice to set time limits on jobs submitted in default queues, possibly with a shorter default time when you don't specify your time limit, and hold longer jobs for off-peak times.
Reply