2015-03-10

Setting Nagios Downtime from a script

I don't know why I've never done this until now.

Create a script "nagios_downtime" or whathaveyou:

#!/bin/bash

# check for usage
if [ $# -ne 4 ]
then
 echo "Usage: $0 <host> <service> <duration_in_sec> <message>"
 echo "Example: nagios_downtime web3 check_http 3600 'downtime reason'"
 echo "must run as nagios user"
 exit 1
fi

# snarf arguments
host=$1
svc=$2
dur=$3
message=$4

# calculate timestamps for now + duration
start=$(date +%s)
end=$((start + dur))

# initiate downtime
echo "[${start}] SCHEDULE_SVC_DOWNTIME;${host};${svc};${start};${end};1;0;0;nagiosadmin;${message}" > /var/lib/nagios3/rw/nagios.cmd

# print saying you did it
echo "$(date): scheduling downtime for ${svc} on ${host} for ${dur} seconds"

then when you're taking an action (say a backup) and want to programmatically set downtime from within your backup script, just shell out to sudo -u nagios nagios_downtime db5 check_mysql_replication 600 'downtime for backup'

For more information look at the Nagios External Command documentation. Specifically, this script uses the command to schedule service downtime.

keywords: nagios service downtime command line

p.s. this is not a resilient script. It doesn't check any input. Don't run this behind xinetd and think you have an API for remotely setting downtime in nagios.

Removing Accidental Chef Attributes

If you've accidentally set attributes on a bunch of nodes in a way that breaks your system, you want to delete them. This can happen when you accidentally include a cookbook you didn't mean to and have some attributes set in the attributes/default.rb file.

Here's how to fix it:

for i in $(cat /tmp/hosts); do
  echo -n "$i: "
  knife exec -E "nodes.transform('name:$i') {|n| puts n.hostname ; n.normal_attrs['my_bad_attribute_name'].delete('self')}"
done

Note: replace the nested array format with underscores. In other words node['my']['bad']['attribute']['name'] becomes my_bad_attribute_name