Solaris Nagios Checks
With a production Solaris box (Solaris Mailserver) on my network, I needed to setup Nagios monitoring for the box. For the most part it was easy - a few services like IMAP and SMTP were monitored externally by attempting a login. A few other services were able to be monitored using Perl, Python, and Shell plugins, mostly found on my existing systems. I also made heavy use of the plugins available from NagiosExchange, my first stop when looking for plugins. Some of the plugins that I found needed to be modified a bit, mainly changing the script location (where it looks for
/usr/local/nagios/libexec to my default location (for Solaris) of
For the main plugins, I used the nagiosp package from Blastwave.org. After installation, I tested some of them, found the arguments I wanted and copied the command line into a text editor, then symlinked them in the
/export/home/nagios directory. All checks are executed actively from my main Nagios box via check_by_ssh.
After setting up my usual service checks in addition to mailserver-specific ones, I found a few things missing:
- The check_ram script that I got seemed to be reporting RAM usage incorrectly.
- The check_cpu script was showing constant 100% CPU usage, 90% user and 10% system. This didn't seem to jive with that SMC was showing.
- I didn't know how to monitor hardware (temperature, fans, etc.)
- Also, I wanted monitoring on the IDE disk.
With all of my Solaris work, one information source has come up in infinite Google searches - the blog of Ben Rockwood, Director of Systems for Joyent, an outspoken evangelist of OpenSolaris, and a great source of Solaris-related information. I now read his blog daily. For anyone who happens to find my little corner of the web, I hihgly recommend his blog as a source of information, advice, and news.
One of the things that I monitor on all of my production servers is hardware status - fans, temperatures, power supplies, etc. Granted my first Solaris box - a Sun Blade 150 workstation - was a simple 1 processor, 1 power supply box, so it's a lot simpler than, say, my HP Proliant ML370 with 2 processors, 3 PSUs, and 3 fan zones. Looking through Nagios Exchange, I was able to find [p_view=887&tx_netnagext_pi1[page]=20%3A10 check_prtdiag] (appears to originally be from here) which uses a config file and a generic Perl-based parser to parse the output of
prtdiag, the Solaris built-in hardware diagnostics command. Running
prtdiag -v on the machine gave me output that included CPU status, PCI device status, fan status, and temp status (okay or... not okay?). That's pretty much what I need.
The main issue was that the check_prtdiag plugin includes configurations for a number of SunFire and Sun Enterprise servers, but not my Sun Blade 150 workstation. I've never taken the time to really learn Perl (I happen to prefer Python), and my RegEx experience isn't anywhere near as good as it should be.
When I actually ran
check_prtdiag, I got:
# ./check_prtdiag Unrecognized escape \s passed through at ./check_prtdiag line 276. Unrecognized escape \s passed through at ./check_prtdiag line 276. Unable to identify system type !
I don't know perl. So as far as I'm concerned, the script doesn't work. So, the bottom line - I'm probably going to write my own check script using Python or shell (sed, awk, grep, the whole deal).
First, I'll admit, having my own mail server is a convenience, not a necessity. I only have one user account setup, and all of my mail is forwarded to my ISP's server and then pulled to my mail server with fetchmail. If this whole thing went down, it wouldn't be a big deal. So, I'm using a used (University surplus) machine. As a result, monitoring hard disk health is pretty important to me. Unfortunately, smartmontools doesn't currently have IDE support under Solaris. The best thing I could find is
iostat -Exn which shows various IO stats and error counts for the disk.
I can't find any existing check script for this (just errors, not stats) so I'm going to add this to my list of ToDo's also.