Monday, October 27, 2008

Watchdog Script for a Process

Many long running processes like to just disappear at random intervals, especially late in the day on Friday so that a client calls you at 5pm to fix it. While the following script doesn't solve the root cause of why your program is dying, sometimes there is no reasonable way to fix the root cause and a watchdog timer script will do the job.

Watchdog Script

I've written this with the intention of restarting a Tomcat based service, but if you fiddle with the arguments to FIND_PROC, you can find whatever you want. Just ONE BIG caveat: don't name your watchdog script so that the watchdog finds itself.

The first version of this script works on Solaris:
#!/bin/sh

# Find the pid of the process (PPID will be the shell that started it
) remember no spaces allowed between varnames, just equals sign, and the value
FIND_PROC=`pgrep myprocess`
#FIND_PROC='pgrep testscript`
# if FIND_PROC is empty, the process has died; restart it

if [ -z "${FIND_PROC}" ]; then
echo myprocess failed at `date`
nohup /export/home/path/to/your/process.sh
#nohup /export/home/admin/testscript.sh
fi

exit 0


If you can't use pgrep, there's another way to do this. You can use grep and awk.
FIND_PROC=`ps -ef |grep myprocess | awk '{if ($8 !~ /grep/) print $2}'`

ps picks up the grep process, so the awk script removes this from the listing. The column numbers will likely need adjusting-- just change "print $2" to "print" and run the command from the command line and you will see the whole like. Determine which column holds the PID.

You can read more about awk here.

You might also need to make the ps listing wider than the default 80 characters to avoid having the process name cut off. In this case, your ps may support the --columns argument.

Cron Entry

To edit the cron file, you may first need to set the correct editor. I am assuming this is vi (Solaris likes you to use the absolutely atrocious "ed" by default-- even now in 2008... as if vi isn't archaic enough). If you get stuck inside ed, just type "quit".
export EDITOR="vi"
Then run the crontab program
crontab -e
It may very well be empty if you haven't set up an cron jobs. The format is a columns separated by spaces. Minute, hour, day of month, month, day of week, command to execute. You can either put a number for the time columns, a list of numbers (separated by commas), a range (using a dash) or an asterisk (*) to specify all.

I used the following line for my watchdog script (which logs normal and error output to watchdog.log):
0,10,20,30,40,50 * * * * /export/home/admin/watchdog.sh >> /export/home/admin/wa
tchdog.log 2>&1

(note, I couldn't paste the above line into crontab -e, but I could retype it no problem... go figure)

1 comment:

Ben Krasnow said...

Put a & at the end of the nohup command line in the script. Without this, the script will not terminate until the process does. I guess that's okay in this watchdog context, but it might as well be proper.

Labels

Blog Archive

Contributors