-
Monitoring
NeDi does monitoring as well as discovery. The program moni.pl is used to check the health and uptime of devices, and you can combine it with trap.pl for SNMP trap translation, syslog.pl for log messages, and nedi.pl itself for the monitoring of discovery events. NeDi uses levels and triggers to categorize and alert you when monitoring finds something interesting. Discovered devices are not monitored by default. Any thresholds (CPU, Mem etc.) and notification triggers are applied from nedi.conf. Syslog events only receive a level of 30 (Other), and thus can't generate alerts.
In order to monitor targets they need to be added to the monitoring table, since devices and nodes are dynamically overwritten by the network discovery (nedi.pl) and you don't want to lose the list of monitored devices each time this happens. You can do this in Devices-List or Nodes-List by first filtering the devices you want to monitor with the list controls, then clicking the "Monitor" button. Alternatively you can add single targets in Devices-Status by clicking on the binoculars
. Once added to monitoring, targets can be further configured in in Monitoring-Setup.
The monitoring daemon moni.pl first sends non-blocking uptime requests to all SNMP targets. Afterwards all other targets are tested sequentially (factoring in availability of their dependencies). For example, a dual homed web-server will only be checked if at least one of the connected switches returned an SNMP uptime.
- TCP ping is used by default for nodes and non-SNMP devices (this can be changed to ICMP in Monitoring-Setup)
- Uptime (or SNMP-Engine time, if set in .def) is chosen for devices as it can detect intermittent reboots as well
- BGP peers can be monitored as well, if BGP4-MIB is supported on a device
- IF oper-status can be monitored as well (e.g. on router or server switches)
- The monitoring daemon should be started automatically. It also relies on nedi.conf, where you can set the interval between polls, how many tests a device can fail before it is marked as down, and how alerts should be sent
- If you change the settings, they will be effective as of the next polling cycle. If you want to see results immediately, restart the daemon from System-Services
- If a target is reported to be down, an entry is created in the incidents table with the start time set to the time it's detected at. The end time will be added automatically, when the target is responding again. Incidents are acknowledged by classification in Monitoring-Incidents
Due to limitations of the SNMP perl module and non-blocking requests, latencies are not accurate unless you modify Net::SNMP's Message.pm:
Line 23:
use Time::HiRes;
Line 691 or so, above debug output in send():
$this->{_transport}->{_send_time} = Time::HiRes::time;
Message Flows
The following diagram explains how events (originating from syslog, trap, discovery and monitoring) are processed.