Hardware Monitoring with Nagios on OpenBSD
Nagios (formerly “Netsaint”) is the monitoring system of choice at my employer, End Point. Nagios is extremely flexible out of the box, and with a bit of coaxing you can have it watch for hardware failures on your OpenBSD machines, as I describe below. I’m assuming you have a working Nagios and NRPE installation, and know how to install software from source.
How we use Nagios at End Point
We monitor local and remote services on all production End Point managed servers. Nagios tells us about problems before our clients notice them, and allows us to manage proactively. Anything Nagios monitors is a “service.”
Local services
Local services are things you can only know about if you’re running a monitoring program on the monitored host. The core Nagios plugin, NRPE makes this kind of monitoring easy and secure.
A selection of services NRPE allows us to monitor:
- Load,
- Disk space,
- Number of processes,
- Number of “zombie” (or “stuck” processes),
- Number of users.
Remote Services
These are services you can reach directly via TCP/IP connections – things like:
- SSL- and non-SSL websites,
- SMTP server performance,
- Web server uptimes,
- Interchange e-commerce application servers,
- OpenSSH,
- PostgreSQL and MySQL database servers,
- FTP servers.
Hardware Monitoring on OpenBSD servers
An End Point client uses an OpenBSD server we set up for them as an alternative to expensive, proprietary Cisco hardware. This machine is very important to the client:
- It serves as a VPN connecting offices thousands of miles away,
- It brokers voice-over-IP and shapes the bandwidth to ensure high call quality,
- It handles web traffic, and
- It passes incoming mail to the mail server behind it.
It’s doing a lot. It’s got to stay up. They asked us if we could set up our Nagios system to make sure they know if a case or CPU fan starts to die, as dead hardware is about the only thing that’ll bring an OpenBSD server down.
NRPE has the ability to monitor hardware sensors on Linux machines built in, but not OpenBSD. No problem – OpenBSD (since 3.9) has an excellent hardware sensors framework built in. You can access the sensors info via the ‘sysctl’ command:
sysctl -a | grep ‘hw\.sensors’Our OpenBSD firewall/router gives output like:
hw.sensors.0=lm0, VCore A, volts_dc, 2.61 V hw.sensors.1=lm0, VCore B, volts_dc, 0.86 V hw.sensors.2=lm0, +3.3V, volts_dc, 3.23 V hw.sensors.3=lm0, +5V, volts_dc, 5.00 V hw.sensors.4=lm0, +12V, volts_dc, 11.31 V hw.sensors.5=lm0, -12V, volts_dc, -8.36 V hw.sensors.6=lm0, -5V, volts_dc, -7.72 V hw.sensors.7=lm0, 5VSB, volts_dc, 4.51 V hw.sensors.8=lm0, VBAT, volts_dc, 2.98 V hw.sensors.9=lm0, Temp1, temp, 32.00 degC / 89.60 degF hw.sensors.10=lm0, Temp2, temp, 41.00 degC / 105.80 degF hw.sensors.11=lm0, Temp3, temp, 44.50 degC / 112.10 degF hw.sensors.12=lm0, Fan1, fanrpm, 2860 RPM hw.sensors.13=lm0, Fan2, fanrpm, 2636 RPMThis tells me that sensors 9 and 10 are probably case/chassis temp sensors and sensor 11 is probably the CPU temp (because it’s consistently the hottest). Sensors 12 and 13 are fans, though I’m not quite sure which is the CPU or case fan – it doesn’t really matter, as it’s a problem if either dies.
check_hw_sensors
If you wanted to roll your own NRPE plugin with a perl or shell script, the output of sysctl alone would be enough to get you started – NRPE and Nagios are that flexible.
We can do better, though: A quick web search turned up check_hw_sensors , an NRPE plugin to monitor OpenBSD hardware attributes. The neat thing about this plugin is that it can use the /etc/sensorsd.conf file for its limits: meaning you can use the OpenBSD “sensorsd” daemon AND NRPE to monitor openBSD hardware. Installing check_hw_sensors is easy. Unfortunately, the documentation is a bit sparse, perhaps because the OpenBSD sensors framework is relatively new.
You can configure the check_hw_sensors plugin to have the typical “warning” and “critical” levels, but in our case we want all failures to be treated as critical – if a fan is running too slow (outside the normal bounds) or a CPU is too hot, it’s a problem that needs to be addressed immediately.
/etc/sensorsd.conf
What I did is match up the output of the sysctl -a command above with entries in /etc/sensorsd.conf, which now looks like:
hw.sensors.9:high=50C hw.sensors.10:high=60C hw.sensors.11:high=60C hw.sensors.12:low=1000 hw.sensors.13:low=1000What this configuration says is this: if the fans go below 1000 RPM (where normal operating speeds are well above 2000 rpm), or the various temp sensors go above 50 or 60 degrees celcius (where normal operating temperatures are 30 to 40 degrees celsius), we’ve reached a critical state. Sound the klaxons.
Getting NRPE to notice this stuff
Our /etc/nrpe.cfg file has the line:
command[check_hardware]=/usr/local/bin/check_hw_sensors -f
which tells the check_hw_sensors plugin to use the /etc/sensorsd.conf file for its level configuration. One could configure each hardware sensor with its own NRPE check and get notified separately about failures.
If any of the sensors go above or below the high/low levels we configured, check_hw_sensors will tell NRPE that we’re in a critical state. You’ll need to restart the nrpe daemon on the OpenBSD box to get it to notice the new settings.
Configure your monitoring host to use the check_hardware command like any other NRPE check, and you’ll have near real-time hardware monitoring.
Summary
OpenBSD has an excellent hardware monitoring framework. The check_hw_sensors plugin – with a bit of configuration to your hardware – makes it easy to monitor your OpenBSD hardware attributes from a central monitoring server. Nagios and check_hw_sensors gives you near real-time monitoring of hardware on your OpenBSD servers.
