Donnerstag, 30. Juni 2011

Softwareraid, Linux und Nagios

Softwareraid ist eine tolle Sache. So kann man Raid1 verwenden um eine Partition zu spiegeln. Sollte eine Festplatte sterben, so läuft das System munter weiter und im Regelfall merkt man vom Ausfall nichts - und gerade da liegt das Problem - ersetzt man die defekte Festplatte nicht so wird man spätestens beim Ausfall der zweiten Festplatte auf das Problem aufmerksam, aber dann ist es auch schon zu spät.
Der Daemon mdmonitor kann hier helfen und eine Email versenden, wenn eine Festplatte ausfällt.
Betreibt man Nagios auf einer Maschine, so wäre es interessant den Status des Raidverbunds von Nagios überwachen zu lassen.
Auf der zu überwachenden Maschine(=Nagiosclient) benötigt man zuerst einen funktionierenden snmpd-Daemon.
Der snmpd-Dämon kann nun ein externdes Script ausführen und den Output des Scripts über snmp rausgeben.
Dazu wird auf dem Nagiosclient folgendes Script in das Verzeichnis /usr/share/snmp/exec installiert:

#!/usr/bin/env perl

# Get status of Linux software RAID for SNMP / Nagios
# Author: Michal Ludvig
#         http://www.logix.cz/michal/devel/nagios
#
# Simple parser for /proc/mdstat that outputs status of all
# or some RAID devices. Possible results are OK and CRITICAL.
# It could eventually be extended to output WARNING result in
# case the array is being rebuilt or if there are still some
# spares remaining, but for now leave it as it is.
#
# To run the script remotely via SNMP daemon (net-snmp) add the
# following line to /etc/snmpd.conf:
#
# extend raid-md0 /root/parse-mdstat.pl --device=md0
#
# The script result will be available e.g. with command:
#
# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.8072.1.3.2

use strict;
use Getopt::Long;

# Sample /proc/mdstat output:
#
# Personalities : [raid1] [raid5]
# md0 : active (read-only) raid1 sdc1[1]
#       2096384 blocks [2/1] [_U]
#
# md1 : active raid5 sdb3[2] sdb4[3] sdb2[4](F) sdb1[0] sdb5[5](S)
#       995712 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
#       [=================>...]  recovery = 86.0% (429796/497856) finish=0.0min speed=23877K/sec
#
# unused devices:

my $file = "/proc/mdstat";
my $device = "all";

# Get command line options.
GetOptions ('file=s' => \$file,
        'device=s' => \$device,
        'help' => sub { &usage() } );

## Strip leading "/dev/" from --device in case it has been given
$device =~ s/^\/dev\///;

## Return codes for Nagios
my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);

## This is a global return value - set to the worst result we get overall
my $retval = 0;

my (%active_devs, %failed_devs, %spare_devs);

open FILE, "< $file" or die "Can't open $file : $!";
while () {
        next if ! /^(md\d+)+\s*:/;
        next if $device ne "all" and $device ne $1;
        my $dev = $1;

        my @array = split(/ /);
        for $_ (@array) {
                next if ! /(\w+)\[\d+\](\(.\))*/;
                if ($2 eq "(F)") {
                        $failed_devs{$dev} .= "$1,";
                }
                elsif ($2 eq "(S)") {
                        $spare_devs{$dev} .= "$1,";
                }
                else {
                        $active_devs{$dev} .= "$1,";
                }
        }
        if (! defined($active_devs{$dev})) { $active_devs{$dev} = "none"; }
                else { $active_devs{$dev} =~ s/,$//; }
        if (! defined($spare_devs{$dev}))  { $spare_devs{$dev}  = "none"; }
                else { $spare_devs{$dev} =~ s/,$//; }
        if (! defined($failed_devs{$dev})) { $failed_devs{$dev} = "none"; }
                else { $failed_devs{$dev} =~ s/,$//; }

        $_ = ;
        /\[(\d+)\/(\d+)\]\s+\[(.*)\]$/;
        my $devs_total = $1;
        my $devs_up = $2;
        my $stat = $3;
        my $result = "OK";
        if ($devs_total > $devs_up or $failed_devs{$dev} ne "none") {
                $result = "CRITICAL";
                $retval = $ERRORS{"CRITICAL"};
        }

        print "$result - $dev [$stat] has $devs_up of $devs_total devices active (active=$active_devs{$dev} failed=$failed_devs{$dev} spare=$spare_devs{$dev})\n";
}
close FILE;
exit $retval;

# =====
sub usage()
{
        printf("
Check status of Linux SW RAID

Author: Michal Ludvig (c) 2006
        http://www.logix.cz/michal/devel/nagios

Usage: mdstat-parser.pl [options]

  --file=    Name of file to parse. Default is /proc/mdstat
  --device=    Name of MD device, e.g. md0. Default is \"all\"

");
        exit(1);
}



Anschließend fügt man am nagiosclient folgende Zeile ein in die Konfigurationsdatei des snmp-Daemons (üblicherweise die Datei /etc/snmp/snmpd.conf):
extend raid-md0 /usr/share/snmp/exec/nagios-linux-swraid.pl --device=md0

Der snmpd wird neu initialisiert - auf Redhat mit:
# service snmpd reload

Nun zurück zur Nagios-Maschine:
Dort wird das Shell-Script check_snmp_extend.sh in das Verzeichnis /usr/lib/nagios/plugins installiert mit folgendem Inhalt:
#!/bin/sh

# Nagios "check" for querying output of scripts
# from remote servers via SNMP "extend" mechanism.
#
# Author Michal Ludvig (c) 2006
#        http://www.logix.cz/michal/devel/nagios
#

# Example configuration
# =====================
# for monitoring SW RAID arrays. Any other service
# that can be checked with a script can be monitored
# with this approach.
#
# Put the following lines into nagios' configuration:
#
# ---- cut here ----
# $USER10$=/usr/local/nagios/libexec.local
#
# define command{
#       command_name    check_snmp_extend
#       command_line    $USER10$/check_snmp_extend.sh $HOSTADDRESS$ $ARG1$
#       }
#
# define service{
#       use                     generic-service
#       host_name               server.domain
#       service_description     RAID status
#       check_command           check_snmp_extend!raid-md0
# }
# ---- cut here ----
#
# On the host server.domain configure SNMP extension
# with name "raid-md0".
# Configuration goes to /etc/snmp/snmpd.conf or similar.
#
# ---- cut here ----
# extend raid-md0 /usr/local/bin/nagios-linux-swraid.pl --device=md0
# ---- cut here ----
#
# That's all. Just note that older versions of
# Net-SNMP package did not support "extend" keyword.
# You will have to use "exec" with check_snmp_exec.sh
#
# Both check_snmp_exec.sh and nagios-linux-swraid.pl
# scripts are available from:
#    http://www.logix.cz/michal/devel/nagios
#
# Enjoy!
# Michal Ludvig

. /usr/lib/nagios/plugins/utils.sh || exit 3

SNMPGET=$(which snmpget)

test -x ${SNMPGET} || exit $STATE_UNKNOWN

HOST=$1
shift
NAME=$1
shift
COMMUNITY=$1

test "${HOST}" -a "${NAME}" || exit $STATE_UNKNOWN

RESULT=$(snmpget -v2c -c ${COMMUNITY} -OvQ ${HOST} NET-SNMP-EXTEND-MIB::nsExtendOutputFull.\"${NAME}\" 2>&1)

STATUS=$(echo $RESULT | cut -d\  -f1)

case "$STATUS" in
        OK|WARNING|CRITICAL|UNKNOWN)
                RET=$(eval "echo \$STATE_$STATUS")
                ;;
        *)
                RET=$STATE_UNKNOWN
                RESULT="UNKNOWN - SNMP returned unparsable status: $RESULT"
                ;;
esac

echo $RESULT
exit $RET



Nun wird das nagios-Kommando check_snmp_extend erzeugt indem man in die nagios-Konfiguration (z.B.: in die command.cfg) folgende Zeilen einfügt:
define command{
        command_name    check_snmp_extend
        command_line    $USER1$/check_snmp_extend.sh $HOSTADDRESS$ $ARG1$ $_HOSTSNMPCOMMUNITY$
}

Nun kann man ein Service zur Überwachnung des Raidverbundes definieren:
define service{
        use                     generic-service
        host_name               NAGIOSCLIENT
        service_description     RAID status md0
        check_command           check_snmp_extend!raid-md0
}

Die entsprechende Hostdefinition für den NAGIOSCLIENT:
define host{
        use                     generic-linux
        host_name               IRGEND_EIN_HOSTNAME
        alias                   XXXX
        address                 IRGEND_EINE_IPADRESSE
        _SNMPVERSION            1
        _SNMPCOMMUNITY          public
}

Womit auch die Variable _SNMPCOMMUNITY erklärt wäre, die in der Kommandodefinition für nagios verwendet wird!