Which technologies do you recommend for clustering and cluster management?

Deleted_U_Intel · ‎11-01-2004

This question was asked during the

An Introduction to High Performance Computing: Parallel Computing Issues webcast. Here is the answer given by Tom Lehman's answer.

There are several that are available for free. The two that we play with the most within my group, one of them is called OSCAR and it's available from SourceForge. Another common clusterpackage from San Diego Supercomputer Center is calledRocks. Both will allow you to build a cluster relatively easily. It takes care of sorting out all of the communications paths between the members of your cluster, and basically I can build a Rocks cluster of, say, 256 nodes in about four hours. Of course, if you don't happen to have 256 nodes, maybe you're only doing four nodes, it'll take you about 45 minutes, max. But along with those packages are included usually management packages such as Ganglia or another package from NCSA called CluMon. These give you an overall picture of the health of the software on your cluster. They show you what load on any given processor is. You can see historical data as to where your load was and where it recommends that it's probably going to be going, et cetera. Also being built into these clustering monitors are some monitors for the hardware as well, so that you can determine that you've got nodes that perhaps have fan failures and maybe should be taken out of operation as soon as possible, or nodes that have flat-out failed because maybe they lost the power supply. So that's one form of cluster management. Another form of cluster management is the workload management. In most clusters the way that they're operated is in a batch processing system, where you submit your job as they did in days gone by to the master node, and then a queuing system puts you into the proper queue, and then will execute your job and send the results back to an appropriate place once the necessary processors are available. Those packages are also part of OSCAR and Rocks. Again, they're automatically installed and you just start using them after you've put your cluster together.

Message Edited by hagabb on 11-01-2004 11:08 AM

Message Edited by hagabb on 11-01-2004 11:18 AM

laurenceliew · ‎01-03-2005

Hi

And if you are using Rocks. Most of the Intel development tools: C/C++, Fortran, Intel MPI Library, Intel ClusterMath Kernel Libraryare available as a Rocks "Roll" - Rolls are a Rocks pre-packaged software distribution mechanism.

The Rocks Roll for Intel allows you to build your Rocks cluster with all the necessary Intel tools installed easily. (Basically just insert the Rocks Roll for Intel when prompted).

You will however need to getproduct licenses from Intel.

Cheers!
laurence
Scalable Systems

Message Edited by hagabb on 01-04-2005 06:42 AM

Message Edited by hagabb on 01-14-2005 07:23 AM

Henry_G_Intel · ‎01-20-2005

Hi Laurence,

I have a question about Scalable Rocks. Let's say I have several clusters at different geographical locations. They're different architectures (IA-32, EM64T, and Itanium)with a variable number of nodes but they're all configured with Scalable Rocks. Is it possible to use RxC and SGE to submit a job on one cluster that will run on a different cluster? For example, I have an Itanium application. I don't care where it runs. Can RxC and SGE be configured to find an available Itanium cluster to run my job if my local cluster is busy?

Thanks,

Henry

laurenceliew · ‎01-24-2005

Hi Henry

yes - that is possible with SGE.. as long as you select to run on the same binary platform and link in all the resources.

SGE allows you to build out campus/enterprise wide grids... not a problem.

Laurence

Henry_G_Intel · ‎01-24-2005

Hi Laurence,

I forgotabout I/O. Does SGE have a facility to move the input files for my application to the remote host, then move the output file back to my local host?

Let's say, for example, that I have an MPI application (compiled for Itanium) that reads one input file and writes one output file. I login to my local Itanium cluster and submit a job to SGE. My local cluster is busy so SGE transfers my job to another Itanium cluster in the grid. This cluster is at a different location. Does SGE have a mechanism to transfer the input file and executable to the remote cluster? Or, does SGE expecta copyof theexecutable and input file to already reside on the remote cluster?

Thanks,

Henry

laurenceliew · ‎01-26-2005

Hi Henry

This is a file staging issue.

If your /home is globally accessible - then it is simple as any machine SGE allocates to you - your application can get hold of the input files and write to the output directorty.

However if /home is not globally accessible - then it becomes more cumbersome. You will need to write your SGE script file such that you can copy (you start from a well know location ie server), but need to take into account that SGE will allocate to you a node which you do not know before hand.

SGE does have ENV vars which you can get the hostname but you will have to assume you know the layout of the directory structure...

You are basically grappling with the Globus IO issue here if your clusters are not connected to a single /home over LAN/WAN.

Laurence