safte-monitor - Linux SAF-TE SCSI enclosure monitor

renamed to safte-monitor due to trademark issue
"SAFTEMON" is a registered trademark of StorCase Technology Inc.

safte-monitor reads disk enclosure status information from SAF-TE (SCSI
Accessible Fault Tolerant Enclosures). SAF-TE is a component of SES
(SCSI Enclosure Services) which is common on most SCSI disk enclosures
these days. safte-monitor can monitor multiple SAF-TE devices and will
automatically probe and detect them.

The information retreived includes power supply and fan status,
temperature, audible alarm, drive faults, array critical/failed/rebuilding
state and door lock status. safte-monitor logs changes in the status of these
enclosure elements to syslog and can optionally execute an alert help program
with details of the component failure. This could send a pager message
for example. Temperate alert limits also be set.

safte-monitor is specifcally useful if you have equipment deployed in remote
locations or unattended over weekends when no-one may hear an audible alarm.

By default safte-monitor will run in the background and log to syslog. It can be
run in one-shot scan mode to print found SAF-TE device info using the -p flag.
It will also respond to web requests on port 8123 and display HTML output
of enclosure status.

usage:	./safte-monitor [-h] [-p] [-n] [-a] [-T] [-t <max_temp>] [-A <alert_prog>]

-h     show this help message
-p     print - print device scan information then exit
-T     log temperature changes
-N     alert for non critical state changes
-A <f> program to run for alerts
-t <n> max temperature (default 35.0 celcius)
-n     numeric sg device names eg. /dev/sg0 (default)
-a     alpha sg device names eg. /dev/sga


By default temperatures and temperature limits are in Celcius. This can be
changed to Farenheit by removing the -DUSE_CELCIUS from the Makefile.


Example alert helper program:
-----------------------------

#!/bin/sh
echo ALERT device=$1 message=$2 system=$3 partno=$4 code=$5 | \
	mail -s 'safte alert' root


Example output from a scan:
---------------------------

# ./safte-monitor -p
SAF-TE Device ESG-SHV SCA HSBP M10 (0:0:6:0)
no. of fans           = 0
no. of power supplies = 1
no. of device slots   = 4
door lock installed   = 0
no. of temp sensors   = 1
audible alarm         = 0
no. of thermostats    = 0
power supply 0 is okay and on
device slot 0 disk present,active,unconfigured
device slot 1 insert ready
device slot 2 insert ready
device slot 3 insert ready
temp sensor 0 is 25.0 celcius and okay
overall temperature is okay

SAF-TE Device CNSi G8324 (2:0:0:50)
no. of fans           = 4
no. of power supplies = 2
no. of device slots   = 6
door lock installed   = 1
no. of temp sensors   = 3
audible alarm         = 1
no. of thermostats    = 0
power supply 0 is okay and on
power supply 1 is okay and on
fan 0 is operational
fan 1 is operational
fan 2 is operational
fan 3 is operational
device slot 0 disk not present,no error
device slot 1 disk not present,no error
device slot 2 disk not present,no error
device slot 3 disk present,no error
device slot 4 disk present,no error
device slot 5 disk present,no error
door lock is locked
speaker is off
temp sensor 0 is 25.6 celcius and okay
temp sensor 1 is 25.6 celcius and okay
temp sensor 2 is 21.7 celcius and okay
overall temperature is okay


Example syslog messages:
------------------------

a temperature change (if option is selected to log temp changes)

SAF-TE Device CNSi G8324 (2:0:0:52): temp sensor 2 changed from '23.9 degrees' to '22.8 degrees'

a power supply failing

SAF-TE Device CNSi G8324 (2:0:0:51): power supply 1 changed from 'okay and on' to 'malfunctioning and commanded on'
SAF-TE Device CNSi G8324 (2:0:0:51): ALERT power supply 1 malfunctioning and commanded on

the power supply back up again

SAF-TE Device CNSi G8324 (2:0:0:51): power supply 1 changed from 'malfunctioning and commanded on' to 'okay and on'

a drive in an array fails

SAF-TE Device CNSi G8324 (2:0:0:51): device slot 4 changed from 'disk present,no error' to 'disk present,critical array'
SAF-TE Device CNSi G8324 (2:0:0:51): ALERT device slot 4 is disk present,critical array
SAF-TE Device CNSi G8324 (2:0:0:51): device slot 5 changed from 'disk present,no error' to 'disk present,faulty'
SAF-TE Device CNSi G8324 (2:0:0:51): ALERT device slot 5 is disk present,faulty

after a rebuild has been started

SAF-TE Device CNSi G8324 (2:0:0:51): device slot 4 changed from 'disk present,critical array' to 'disk present,rebuilding,critical array'
SAF-TE Device CNSi G8324 (2:0:0:51): ALERT device slot 4 is disk present,rebuilding,critical array
SAF-TE Device CNSi G8324 (2:0:0:51): device slot 5 changed from 'disk present,faulty' to 'disk present,rebuilding,critical array'
SAF-TE Device CNSi G8324 (2:0:0:51): ALERT device slot 5 is disk present,rebuilding,critical array

and now the array is okay again

SAF-TE Device CNSi G8324 (2:0:0:51): device slot 4 changed from 'disk present,rebuilding,critical array' to 'disk present,no error'
SAF-TE Device CNSi G8324 (2:0:0:51): device slot 5 changed from 'disk present,rebuilding,critical array' to 'disk present,no error'


Bugs
---- No known bugs

Please send bug reports to michael@metaparadigm.com


To do
-----
* Make celcius/farenheit a command line option.
* Set temp limits seperately for different enclosures or individual sensors.
* implement SNMP traps for alerts - this could be done with an external alert
  helper program.
* Add a remote interface for checking status (SNMP)
* Add mon compatible service test agent
* Finish off autoconfigure support
* Make safte-monitor dynamically rescan for safte devices after startup to
  eliminate the need for a restart to the daemon
* Integrate with software RAID

The SAF-TE spec can be found here: http://www.nstor.com/white_papers.cfm

Copyright Metaparadigm Pte. Ltd. 2001. Michael Clark <michael@metaparadigm.com>


This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as published
by the Free Software Foundation.
