- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iam doing mpi coding with64bit intel compiler in a linux cluster which contains 64 bit processor.In my code each process is going to run the loop from zero to 2^64-1.Finally all of them is going to add and give me the result.I didnt get any warning or error message during compilation.but when i run the program my job is getting killed in the cluster irrespective of the time(i tried form 2 to 240 hours in the script)i give.i found out only one thing the application rank getting killed in the error message depend on the time i give in the script. sometimes it says rank 27 if i give another time it says rank 0 getting killedand so on...i dont know whats wrong with my code.i have given below the code and the error message.i hope some one can help me out.thanks in advance.
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vinoth,
When you uncomment
//printf("n = %.30Len",n);
What does n print as?
Also, on some systems (long double) is equivilent to (double).
Insert some ASSERTS such that in debug builds your code will expose problems should the compile options generate doubles for long doubles.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
Vinoth,
When you uncomment
//printf("n = %.30Len",n);
What does n print as?
Also, on some systems (long double) is equivilent to (double).
Insert some ASSERTS such that in debug builds your code will expose problems should the compile options generate doubles for long doubles.
Jim Dempsey
Hi Jim
n prints as 1.844674407370955161500000000000e+19(2^64-1).In my system double is not equivilent to long double, because when i use double and print n (n=1.844674407370955161500000000000e+19) when i define a new variable m where m=n-1 iam getting m=1.844674407370955161500000000000e+19 which is same as n. this problem doesn't happen when i use long double in that case m prints as 1.844674407370955161400000000000e+19. i have even inserted the asserts and checked the value of n,m,sv,ev.i have given the sv and ev values when i use 16 process(tp =16 & np = 0 to 15)to run the code. iam still not able to find where is the error and why my job is getting killed.
Regards
vinoth
[cpp][/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your error message is comming from PBS. I assume this is a job scheduler for your server. The error message referrs to you application exceeding a runtime quota. This sounds like an issue you should discuss with the system administrator. If you are the system administrator then consult the PBS documentation (and related watchdog timmer applications that run in the background).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
Your error message is comming from PBS. I assume this is a job scheduler for your server. The error message referrs to you application exceeding a runtime quota. This sounds like an issue you should discuss with the system administrator. If you are the system administrator then consult the PBS documentation (and related watchdog timmer applications that run in the background).
Jim Dempsey
Regards
vinoth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In a shared system (e.g. your server) the operator of the server (system administrator) must make sure that the system remain shared. Unless otherwise instructed by policy, no one user can commandeer the system forever.
It is not unusual for programs to enter a condition that causes an infinate loop (e.g. problem in convergence either due to data or programming oversight/error). Most "convergence" type programs should have a bail-out condition where it detects that convergence is impossible. Unfortunately, the system administrator cannot assume anything about the applications that run on the system, they can only guess. Therefore, as a guess, the system administrator will declare, unless told otherwise, any program running longer than xxx will be deemed as hung and will be terminated.
Think of this as a parking limit for your car. You get 2 hours, if you take longer you get towed. (exceptions for emergency vehicles). Your job now is to argue the case with your system administrator that you deserve special treatment. (the other users of the system may object to you bogging down the system).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's normal practice to set time limits on jobs submitted in default queues, possibly with a shorter default time when you don't specify your time limit, and hold longer jobs for off-peak times.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page