- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have a little cluster with Oscar/CentOS 5.5. We are using torque and the Intel Cluster Toolkit. Torque and the ICT are configured and Jobs are running without problem at the moment. But the "elapsed time" displayed by Torque with a "qstat -a" is always 0. :'(
If we switch to openmpi, the elapsed time of the the running jobs are correctly updated.
Is this a known issue ? is there a solution ?
Best regards,
Guillaume
We have a little cluster with Oscar/CentOS 5.5. We are using torque and the Intel Cluster Toolkit. Torque and the ICT are configured and Jobs are running without problem at the moment. But the "elapsed time" displayed by Torque with a "qstat -a" is always 0. :'(
If we switch to openmpi, the elapsed time of the the running jobs are correctly updated.
Is this a known issue ? is there a solution ?
Best regards,
Guillaume
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's a known issue and it comes up from time to time (eg, elsewhere on this forum, http://software.intel.com/en-us/forums/showthread.php?t=76537 ). The issue is that IntelMPI isn't yet tightly integrated into Torque, and so information like CPU time doesn't get propagated back because Torque doesn't know which processes running on the node are the relevant processes to look at. OpenMPI, on the other hand, can be compiled with explicit torque support (but if you don't, you'll see the same isses).
Issues like elapsed CPU time are a nuisance, but this lack of integration can mean bigger problems if you have jobs fail - they won't be cleaned up properly when the job ends. Suspend/resume becomes impossible, too.
Rumour has it that the next version of IntelMPI, due to come out for SC10 in November, will have better torque integration support. Until then using OSU's mpiexec launcher ( http://www.osc.edu/~djohnson/mpiexec/index.php ) instead of those that come with intelmpi is supposed to work.
Issues like elapsed CPU time are a nuisance, but this lack of integration can mean bigger problems if you have jobs fail - they won't be cleaned up properly when the job ends. Suspend/resume becomes impossible, too.
Rumour has it that the next version of IntelMPI, due to come out for SC10 in November, will have better torque integration support. Until then using OSU's mpiexec launcher ( http://www.osc.edu/~djohnson/mpiexec/index.php ) instead of those that come with intelmpi is supposed to work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok. thx for reply. So I will wait a little bit.
Have a nice day
Have a nice day
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
the original problem was with this "initial" torque configuration:
# config of TORQUE:
create queue batch
set queue batch queue_type = Execution
set queue batch resources_max.cput = 168:00:00
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True
I have modified this torque config with this one and the problem has disappeared.
#
# Create queues and set their attributes.
#
#
# Create and definequeue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = 50
set queue long resources_max.walltime = 72:00:00
set queue long max_user_run = 10
set queue long enabled = True
set queue long started = True
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default Priority = 50
set queue default max_running = 48
set queue default route_destinations = small
set queue default route_destinations += long
set queue default enabled = True
set queue default started = True
#
# Create and define queue small
#
create queue small
set queue small queue_type = Execution
set queue small Priority = 100
set queue small resources_max.walltime = 02:00:00
set queue small max_user_run = 10
set queue small enabled = True
set queue small started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True
Best regards
the original problem was with this "initial" torque configuration:
# config of TORQUE:
create queue batch
set queue batch queue_type = Execution
set queue batch resources_max.cput = 168:00:00
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True
I have modified this torque config with this one and the problem has disappeared.
#
# Create queues and set their attributes.
#
#
# Create and definequeue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = 50
set queue long resources_max.walltime = 72:00:00
set queue long max_user_run = 10
set queue long enabled = True
set queue long started = True
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default Priority = 50
set queue default max_running = 48
set queue default route_destinations = small
set queue default route_destinations += long
set queue default enabled = True
set queue default started = True
#
# Create and define queue small
#
create queue small
set queue small queue_type = Execution
set queue small Priority = 100
set queue small resources_max.walltime = 02:00:00
set queue small max_user_run = 10
set queue small enabled = True
set queue small started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True
Best regards

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page