Running Your Job - Research Computing

Condor works by match-making jobs in the queue to available machines in the pool. So far we have seen the two key elements of the pool: the job queue, which is displayed by condor_q and the pool status which is displayed by condor_status.

We will see how this works by submitting a simple job below.

Your First Condor Job

On the machine its-condor-submit copy the files hello.sh and hello.sub into your directory by running these commands or by copying the code below:

cp /home/dbrown10/teaching/phy607/hello.sh
cp /home/dbrown10/teaching/phy607/hello.sub

hello.sh

#!/bin/bash
# If any command fails, exit with a non-zero exit code
set -e
# Print a welcome message
/bin/echo "Hello, world!"
# Dump some information about the host that we are running on
/bin/echo -n "Host name is "
/bin/hostnamex
/bin/echo -n "Linux kernel version is "
uname -r
/bin/echo -n "Operating system install is "
cat /etc/issue
# Exit successfully
exit 0

hello.sub

universe = vanilla
executable = hello.sh
transfer_executable = true
should_transfer_files = yes
when_to_transfer_output = on_exit_or_evict
output = hello.$(cluster).$(process).out
error = hello.$(cluster).$(process).err
log = hello.$(cluster).log
queue 1

The script hello.sh is a simple shell script that prints a welcome message and then some information about the host that it is running in. We can run from the command line to see what it prints on the head node. If you run the command:

./hello.sh

then it will print

Hello, world!
Host name is its-condor-submit
Linux kernel version is 2.6.38-10-virtual
Operating system install is Ubuntu 11.04 \n \l

The Condor Submit File

To submit this script to the OrangeGrid pool as a job, we need the Condor submit file hello.sub. We will look at each line of this in turn to see what it does. The first line of the file hello.sub is

universe = vanilla

This line tells Condor that our program should be run in the vanilla universe. Condor has several universes for running jobs which specify exactly how Condor should handle the job.

Details of the different universes can be found in the Condor manual. In this exercise, we will look at the vanilla universe, which is can be used to run any regular executable that you can run on the command line, e.g. shell scripts, python programs, regular C/C++ programs, etc.

The next line of the le hello.sub is

executable = hello.sh

This line tells Condor which program should be run. executable requires either a relative path from the location of the submit file, or an absolute path to an executable. Since hello.sh is in the same directory as hello.sub we can use a relative path here. If hello.sh was in another location, it would be best to use an absolute path, for example:

executable = /home/dbrown10/bin/hello.sh

The next three lines of the le hello.sub tell Condor how to deal with the input and output data:

transfer_executable = true
should_transfer_files = yes
when_to_transfer_output = on_exit_or_evict

These lines tell Condor that it must transfer the input and output le to and from the computer that will actually execute our job. This is important in the Campus Condor pool, where each machine in the pool is independent and cannot see another machines files. Condor will automatically transfer back all files that your job creates, but if you require any input files, you
must explicitly specify them here. To do this, you can add an additional line to the submit file which lists the input files:

transfer_input_files = file1,file2

where file1,file2 is a comma-delimited list of all the files and directories to be transferred into the working directory for the job, before the job is started. Since the simple program we are running does not require any input files, this line is omitted from hello.sub.

The next three lines in the file hello.sub tell Condor where it should store the output (stdout) and error (stderr) messages from the job, and where the Condor log messages should be stored:

output = hello.$(cluster).$(process).out error = hello.$(cluster).$(process).err log = hello.$(cluster).log

The three commands specify the (relative) paths to the files that will store this information. Notice that we have used two Condor variables in these file names: $(cluster) and $(process). These two variables make up the two parts of the Condor job user ID that we saw earlier when running condor_q. For example, the job:

Condor Submitter

has a cluster number of 200749 and a process number of 495. Each time you submit a job to the Condor queue, it is given a unique cluster number. It is possible to submit many jobs in one go (as we will see later). The process number tracks the sub-jobs in each cluster.

Finally the last line in the Condor submit file hello.sub is

queue

This tells Condor to submit one job of this type to the pool. If we add an integer after cluster, we can submit multiple identical jobs to the pool. For example, if we change the last line of hello.sub to

queue 10

then ten identical jobs will be submitted to the pool. Each of these ten jobs will have the same cluster number, but a unique process number.

To submit our test job to the pool, run

condor_submit hello.sub

Condor will respond telling you that the job was submitted and give you the cluster number, for example:

Submitting job(s).
1 job(s) submitted to cluster 200757

Obviously your cluster number will be different, as each job has a unique number. To check the status of your job, run condor_q with either the cluster number or your user-name as the argument. For example:

condor_q 200757

will return the status of job 200757:

Condor Submitter

Notice that the job is in state I, which means that it is idle and waiting for a CPU to run on. After a while, your job will go into the running state and execute. How do you know this without constantly typing condor_q? Recall that in the Condor submit file, we specied a line that began with the log keyword. This line species the Condor job log le which is updated with your job’s status. You can watch this le using tail -F to see the status of your job.

Remember that this file contains the cluster name assigned to your job, so you will need to include that in the file name. You can check to what the file is called with ls. In the example above, you would run

tail -F hello.200757.log

Initially this file will contain the line

000 (200757.000.000) 04/16 18:39:22 Job submitted from host: <10.5.0.6:47605>

which indicates that the job has been submitted and is in the queue. Once Condor has found an available computer to run the job, it will print:

001 (200757.000.000) 04/16 18:40:30 Job executing on host: <10.5.40.28:59902>

When the Job Completes

Condor will print a message explaining how the job exited:

Condor Complete

The most important parts of this message are Normal termination and return value 0. The string Normal termination does not necessarily mean that the job worked, but rather that it ran to completion and did not exit with a problem like a segmentation fault. If your job completed successfully, Condor will indicate this by additionally printing return value 0 (assuming that you return zero on successful exit, as is standard on a UNIX system).

Since the program hello.sh exited successfully, we can look at its output. This script normally prints its messages to the terminal (stdout) and so we look in the output le to see these messages. Remember, the name of the output le contains the cluster and process numbers assigned to the job, so in this example we would use

less hello.200756.0.out

to look at the output. For this example, the output file contains

Hello, world!
Host name is MAX-E406B-02-S1-its-u11-boinc-20120415
Linux kernel version is 3.0.0-15-virtual
Operating system install is Ubuntu 11.10 \n \l

Although your exact output will be different, you can see that the host name is no longer its-condor-submit. In this case, the string MAX in the host name tells us that the job ran on an idle machine in the Maxwell School.

By default the ITS Condor pool is configured to send you an email when your job completes. If you check your Syracuse email account, you should have an email telling you that the job either completed or failed. If you do not want to receive these emails for a job, you can add the line

notification = Never

to your job’s Condor submit file. You may want to do this before starting the next section. More detail on this and other options are available on the man page for condor_submit.

Your First Condor Job Link

The Condor Submit File Link

When the Job Completes Link

Your First Condor Job

The Condor Submit File

When the Job Completes