When Things Go Wrong

To see what happens when a job fails, let us deliberately break the script hello.sh

First edit the Condor submit file hello.sub and change the number of jobs submitted in a cluster back to one:

queue 1

Next, edit the file hello.sh and change the line

/bin/hostname to /bin/hostnamex

Since the program hostnamex does not exist, our script will fail. Check this by running it on the command line:

./hello.sh

Now the script returns

Hello, world!
Host name is ./hello.sh: line 11: /bin/hostnamex: No such file or directory

If you immediately type

echo $?

you will see that the script has returned 1 to indicate failure, instead of a zero exit code. Now use condor_submit to run the broken script. Since it has the same name as our original script, we don’t need to change hello.sub, so we can just run as before.

Now, when the job completes you will see a di erent message in the Condor job log file:

 

 

 

Condor sill indicates Normal termination, but we can see that the script failed as now the return value is 1, rather than zero. If we look at the output of the script in the output file, we see:

Hello, world!
Host name is

The script exited before it printed the hostname, as we expect. To see the error message, we look in the error file:

less hello.200765.0.err

which contains the error message

/var/lib/condor/execute/dir_708/condor_exec.exe: line 11: /bin/hostnamex: No such file or directory

as expected.

In addition to non-zero exit codes, another common way that your code can go wrong is a segmentation fault. The computer will raise this when you try and dereference a null pointer or access memory outside your processes address space. A segmentation fault typically means that you have a bug in the memory management of your code (i.e. you have done something bad with pointers). To see what happens when a program fails in this way, copy the programs below to a sub-directory of your home directory on its-condor-submit using the commands:

cp /home/dbrown10/teaching/phy607/badcode.c .
cp /home/dbrown10/teaching/phy607/badcode .
cp /home/dbrown10/teaching/phy607/badcode.sub .

The C code is fairly simple:

#include <stdio.h>
int main ( void )
{
int *i = NULL;
printf( “i is %d\n”, *i );
return 0;
}

The printf() line will cause an error when the program tries to access the memory at the null address (an invalid memory address). Run the program badcode from the command line and note what happens. You should see the error message:

Segmentation fault

and no output from the code. Submit this program to Condor using the submit le badcode.sub.

When the program completes, you will see the following in the condor job log:

 

Condor Log

 

Notice now that there is an Abnormal termination as the operating system halted the program.  Condor is reporting signal 11. This is the Linux signal number for a segmentation fault, telling us that our program exited with a segmentation violation.