Network Monitoring Comparison
NOTE: This page is a work in progress.
As posted in my blog (here and here), I decided to test some of the network monitoring software packages available.
At the risk of alienating any of the developers, my criteria for test selection were:
- Open-source.
- An all-in-one solution that includes monitoring, alerting, and trending (so I'm not testing Nagios, Cacti, Munin, etc.).
- Asset management is a big plus.
and some other things that I can't think of off hand.
So, as posted in my blog, I've narrowed the running (at least for the first round of tests) down to three systems:
While I got recommendations for all of those on my blog, including a comment from Jeff G. of OpenNMS, Dave Dennis of GroundWork was nice enough to drop me an e-mail, so GroundWork will be up first.
Unfortunately, my original idea of parallel testing is *not* going to work. I'll be testing in Xen virtual machines on one of my best boxes - a 1U Dell PowerEdge 650 with a 2.4 GHz P4, 768MB RAM, and 29 GB of free disk space - but this is still well below the minimum specs for GroundWork.
Contents |
Comparison of Packages
Being that I run Nagios (v2) right now, I'm leaving it in here for comparison:
Please Note: I make no guarantee to the accuracy of this information. I tried to extract most of it from the respective packages' docs, but I don't know that it's 100% accurate, and I can't be sure it hasn't changed yet.
Overview and Features
Nagios | GroundWork OS | OpenNMS | Hyperic HQ | Zabbix | Zenoss Core | Splunk (Free) |
|
---|---|---|---|---|---|---|---|
Basic Specs | |||||||
License | GPLv2 | GPL | GPLv2 | ||||
Server Recommended Processor | n/a | 2.4GHz P4 | 1GHz+ P4 | P4 | |||
Server Recommended RAM | n/a | 2GB | 1 GB | 100 MB | |||
Server Recommended Disk Space | n/a | 60 GB | 10 GB | 256 MB | |||
Agent Recommended Processor | n/a | 500 MHz+ | |||||
Agent Recommended RAM | n/a | 256MB | |||||
Agent Recommended Disk Space | n/a | 500MB | |||||
Interface Server | Any web server w/ CGI | Apache | Tomcat/JBoss | ||||
Database | Many (MySQL for me) | MySQL 5 | |||||
Interface Language | CGIs - mix of languages | PHP | Java Web App | Python | |||
Agent Type | NRPE, SSH, SNMP, etc. | Python Plugin - mostly SNMP. | Java Monolithic | ||||
Agent OS's | Almost Anything | Linux, Windows, most Unixes | Any. | ||||
Server OS | Almost anything | Linux, Unix | Almost Any. |
Linux, FreeBSD, OS X |
|||
Scalability | Ok (distributed) | Excellent | Resource Hog | ||||
Major Features | |||||||
Service
Alerts |
Yes. | Yes. | Yes. | Yes. | |||
Service
Alerts |
Yes. | Through Agent | Yes. | ||||
Inventory | ??? | Yes (Basic) | Yes. | ||||
Performance Monitoring (Graphical) |
3rd Party | Yes. (Default.) | Yes. | ||||
GUI Configuration | 3rd Party | Yes. (Default.) | |||||
Automatic Discovery | ??? | Yes. (Mostly.) | Yes. | ||||
Default Configurations | No. | Yes. | Yes. (ZenPacks) |
||||
Log Monitoring | Plugin | ???? | Yes. | Yes. | Yes. | ||
Config File Monitoring | Plugin | ???? | |||||
Plugin / Agent Type | Script | Python | Java | ||||
Ease of Plugin Creation | Easy. (script) |
Easy. (Python) | Moderate - Diff. (Java plugin or script) |
Easy. (Python, |
|||
Graphical
Mapping |
3rd Party | No. | Yes. | Yes. | |||
Network
Topology Map |
Yes. And 3-D. | No. | Yes. | Yes. | |||
Nagios Plugin Support? | Yes. | XML File. | Thru GUI. | n/a | |||
Ticketing? | Script. | HTTP Request Alert | Email Parser. | ||||
Additional Features | |||||||
Nagios | GroundWork OS | OpenNMS | Hyperic HQ | Zabbix | Zenoss Core | Splunk (Free) |
|
Configuration | Text File | Web GUI | |||||
Ease of Customization | Easy - Scripts | Difficult/Impossible | |||||
Host Groups | Yes | Yes | |||||
Host/Service Templates | Yes | Metrics Only | |||||
Alert Templates | Yes | No | |||||
Time-Based Alerts | Yes | No | Yes. | ||||
Multi-Metric Alerts | No | ??? | |||||
Easy Escalation Schemes | Sort-of | Yes | None at all. |
||||
Alert Options | Anything script-based |
Email, SMS, XMPP, Script, HTTP GET/Post, SNMP Trap | Email, SMS | ||||
Range of Available
Plugins |
Large Builtin and Nagios-Exchange |
Limited. | |||||
RBAC | No. | No. | |||||
LDAP Auth. | No. | No. | |||||
User Groups | Yes. | No. | Yes. | ||||
Nagios | GroundWork OS | OpenNMS | Hyperic HQ | Zabbix | Zenoss Core | Splunk (Free) |
Comparison with my Nagios
Here's a comparison between what I monitor on Nagios, and what the other contestants can monitor.
D: Default on
A: Default on, automatic configuration
I: Installed by default (but not enabled)
C: Custom script possible / relatively easy
X: Impossible or very time-consuming (more so than Nagios)
R: Requires Client Server/Service reconfiguration/customization
U: Untested
3: Third-Party Plugin Available
W: Work-In-Progress (Either OEM or Third-Party)
Nagios | Hyperic HQ | GroundWork OS | OpenNMS |
Apache Status | R | ||
APC SmartUPS - ALL (SNMP) | |||
APCUPSD (all) | |||
Bacula - Nightly Job Status | |||
Bacula - Nightly Job Time | |||
Bacula - Weekly Job Time | |||
DNS - Dig for Hostname | |||
DNS - Domain Name Expiration | |||
HTTP - Response (Webmin) | |||
HTTP - Check Remote Response | |||
IMAP Status, Response Time | |||
Linux - SSH Server | |||
Linux - Parition Free Space | |||
Linux - Current Users | |||
Linux - Load Avg | |||
Linux - Ping/Avail. | |||
Linux - CPU Load | |||
Linux - Uptime | |||
Linux - Total Procs | |||
Linux - NTP Offset | |||
MySQL - Server Status | |||
Ping - External | |||
Proliant - CPUs | X | ||
Proliant - Fans | X | ||
Proliant - Free Memory | X | ||
Proliant - PSUs | X | ||
Proliant - Temp | X | ||
SMTP - Status, Response Time | |||
Solaris - Package DB | |||
Solaris - Partition Free Space | |||
Solaris - Current Users | |||
Solaris - CPU | |||
Solaris - Load | |||
Solaris - NTP Offset | |||
Solaris - Processor | |||
Solaris - Free RAM | |||
Solaris - SMF Service Status | |||
Solaris - SWAP | |||
Solaris - Uptime | |||
Solaris - Sun SPARC Hardware | |||
Switch - Ping | |||
Switch - Telnet | |||
TCP - Bacula FD/SD |
Stuff Not Monitored With Nagios
Here's a comparison between what I would like to monitor on Nagios, and what the other contestants can monitor (i.e. stuff I haven't implemented in Nagios yet).
D: Default on
A: Default on, automatic configuration
I: Installed by default (but not enabled)
C: Custom script possible / relatively easy
X: Impossible or very time-consuming (more so than Nagios)
R: Requires Client Server/Service reconfiguration/customization
U: Untested
3: Third-Party Plugin
W: Work-In-Progress (Either OEM or Third-Party)
Nagios | Hyperic HQ | GroundWork OS | OpenNMS |
Alerting - Create Ticket | |||
Asterisk - Status | |||
Asterisk - Load | |||
Asterisk - Trunk Status | |||
Dell - Hardware | W(3) | ||
IPcop - Status | |||
IPcop - Interface Status | |||
IPcop - VPN Status | |||
MythTV - Backend Status | |||
MythTV - Tuner Status | |||
Printer - Status | |||
Printer - Alerts | |||
Status checks on Remote Subnet with no gateway | |||
Xen - domU Performance | |||
Xen - domU Status |
Cross-System Compatibility
Hyperic HQ: Hyperic HQ has the ability (non-default) to use custom scripts as plugins, OR to use Nagios plugins.
Hyperic HQ
Part I - Installation
- setup Xen virtual machine running OpenSuSE 10.3 base packages. (3 hours, some server problems, some Xen problems, and some time learning Xen administration from the CLI)
- Download
hyperic-hq-installer-3.2.0-607-x86-linux.tgz
from Hyperic and extract. - Browse to http://support.hyperic.com/confluence/display/DOC/Full+Installation+Guide
cd
intohyperic-hq-installer
and run./setup.sh -full
- The installation can't be run as root (though I assumed it would need root privileges).
- I selected to install all 3 components - Server, Shell, and Agent.
- Well, whoops! Sorta stupid to not allow installation as root, when the default location to install to is
/home/hyperic
. How do they expect an arbitrary user to install there? Even worse, it appears that the default OpenSuSE 10.3 installation doesn't come withsudo
(!!!!) so I can't try that. - As root, create
/home/hyperic
and chown to my user. - Repear the above steps (well, hopefully not all of them).
- Default ports for everything - web GUI on 7080, HTTPS web GUI on 7443, jnp service on 2099, mbean server on 9093,
- Change domain names in default URLs to logical ones for my test environment (no real DNS, just IPcop hosts, so devel-hyperic1.localdomian). I hope that I can change these later, or even better that absolute paths aren't used too much, as this will screw with my idea of using SSH port forwarding for remote access.
- Leave the default SMTP server alone and change it later - I odn't even have mail running here at the apartment.
- Use the built-in PostgreSQL database with default port of 9432.
- Go with the defaults for everything after this.
- Everything runs nicely, and then it tells you to login to another terminal as root and run a script. I'm not sure I like this method, but I guess it works. Login and do it.
- How will it start the builtin database? As my user???? Yup. postgres is running as my user. Wonderful. Nothing in the install document mentioned user creation. Was this just assumed? Because in the naive world I live in, most installer scripts (think Nagios) create a user for you, or tell you to.
- Setup script complete. A few instructions follow...
- Run
/home/hyperic/server-3.2.0/bin/hq-server.sh start
... as my user. Note to self: setup a user for Postgres and Hyperic. Believe it or not, but it booted - but followed with the message, "Login to HQ at: http://127.0.0.1:7080/" - Browsed to http://devel-hyperic1:7080 and was greeted by a startup page, saying that the server was 18% finished booting. My, I yearn for little C binaries and a PHP frontend.
- Page turns blank and stops there. I refresh, and get a login page. I enter my username and password, and get a little message box where the "invalid password" box usually is - says "Server is still booting". This is over a minute later. I'm happy to see Apache/Coyote1.1, but would like to be able to get into Hyperic in less time than it takes the machine to boot to a graphical login screen (ok, granted, I'm running XFCE). In SuSE's YaST Xen Monitor, I see that the VM is at 45% of its' 464MB RAM, and 90% CPU - with 8.5% consumed by dom0.
- CPU usage for the VM drops to 1% and I login again. BAM! Hyperic HQ. Aside from the fact that it shows NO resources... oh... start the Agent.
- Start the Agent on the VM running Hyperic. It asks me for the server IP address. What, no DNS? I enter the IP as it is... for now. I keep everything at defaults, including using the hqadmin username and password. Successfully started.
- BAM! In Dashboard, I see the auto-discovered host with the right hostname, as well as Tomcat, Agent, JBoss, and PostgreSQL. Amazing! Click "Add to Inventory".
- Check out the "Resources" -> "Browse" screen. It knows this machine is OpenSuSE 10.3, and I see my four services (listed above). Of course, no metrics yet, but I see the correct IP, gateway, DNS, vendor (SuSE), kernel version, RAM, architecture, and CPU speed.
- Looking through the "Inventory" screen, I see everything - NICs and MACs, running servers and one service (a CPU resource). What more could a man want in...let's see.. just over an hour!
- I really *love* the "Views" screen which, even out-of-the-box, allows "Live Exec" information from cpuinfo, df, ifconfig, netstat, top, who, and more.
- Well, it's 03:35, and I have work and class tomorrow. I think it's time to give Part I a rest. But first...
- Go to the "Platform" page for my one machine and... YES! Graphs are starting to appear!
- Following the suggestion here, I enable log and config tracking on the platform for
/var/log/warn
and/etc/hosts
, respecitvely. - Before I call it a night (now 03:42), I stop back at the downloads page and grab the Linux x86 Agent for the dom0 machine, hoping to get some physical information as well. While I'm at it, I grab the Linux AMD64 Agent to try on my laptop. I create "hyperic" users on each system. On the base Xen server, I give it a shot and get "Unable to register agent: Error communicating with agent: Unauthorized". Same thing on the laptop.
- Did a little reading here. As to keeping all of the defaults, it turns out that both clients had firewalls blocking TCP port 2144. I opened it up on both, and also set the IP address (that the server uses to contact the client) to the correct ones. Viola! Now I have 3 clients connected, and gatheirng data for the next ~16 hours until I have time to check it out agian.
More to come in Part II tomorrow - actually doing something with Hyperic. For now (04:08), time to sleep.
Part II - Configuration
Unfortunately, I haven't had much time to play with Hyperic in the two days since installation. The most I've really done is setup Agents on my laptop, desktop, and the host machine (both dom0 and domU for Hyperic), so that they start to collect data.
While I found a lot of upsetting stuff in the features list (see below), I decided to go ahead and add some other devices. On the network at the apartment, I have two manageable switches (a Linksys and a 3Com) - which pretty much make up the sum of non-host equipment. I also have an IPcop box, though I assume the standard Linux Agent will handle that. The one item missing that I have at home is my set of APC SmartUPS UPSs with SNMP cards, but I guess I'll just have to skip them for this review.
First, I went in and added a platform (Resources->Browse, Tools Menu->Add Platform) for the 3Com switch (a SuperStack II Switch 3300). It showed successful creation - but nothing else. I went in and entered the SNMP community string, IP, and version (1). In about a minute or so, I started to see metrics - Availability, IP Forwards, IP In Receives, an IP In Received per Second. While it's quite basic, that's good for a starting point. While the [http://support.hyperic.com/confluence/display/DOCSHQ30/Network+Device+platform Network Device Platform] documentation lists lots of metrics that can be enabled, I'd also like telnet availability and - my big one since I use a "cute" (crappy) IPcop installation for local DNS, a dig on DNS to make sure the entry is there. In the Monitor screen, I was able to enable a bunch of additional metrics (by clicking on the "Show All Metrics" link), though there's also no way (that I can find) to monitor the status of individual ports.
Next, I browsed through the "Administration" pages, setup a few users, and started setting *way* more default metrics for various platforms, services, and servers. While I don't have mail running yet, that will come this weekend. While I added a lot of things as "Default On", I still need to go back and add more things in the templates as Indicators.
I also added some escalations, though they're quite simple - you can notify HQ users or "other users" by email or SMS, write to SysLog, or suppress alerts for 0 minutes to 24 hours. Hopefully I'll also find a plugin for Asterisk integration. One striking omission is user groups. Also, the concept of "Roles" (maybe their idea of groups?) is only available in the Enterprise version.
At this point, I also notice one other majoe issue, though perhaps I'll find a solution in my experimentation - there doesn't be a way to setup default alerts for metrics. If they have all of this platform, server, and service information defined as default templates, why not just have a way to assign default users (and groups) to these objects, and have default alerts generated?
In terms of Apache 2.2 monitoring, out-of-the-box, nothing worked. No metrics at all. Firstly, Hyperic requires the mod_status module. Persoanlly, I'd rather handle all of that through a backend, like Nagios. Secondly, it got the pidfile and apache2ctl paths wrong. Furthermore, it has no "smart" checking for resources - while my Apache 2.2 resource config was clearly wrong (wrong PID file path, no mod_status), Hyperic didn't detect this and was showing the resource as "Down".
After that, I setup a bunch of alerts for things that I thought would be off-kilter a lot (like WARN log entries on my laptop, high memory usage on some stressed machines, etc.) as well as log and config file monitoring and alerts for them. While I didn't have mail working yet, I figured I might as well get that stuff running.
On the Xen dom0 host that runs the Hyperic vm (box called xenmaster1), I wasn't able to add config file tracking for any of the /etc/xen/ files. At this point I notice some serious shortcomings - not only is it not possible to define a template of alerts for a given platform/server/service, it's also impossible to define a template for alerts. I also noticed that it's not possible to define groups of contacts. This wasn't much of a problem for my test installation - the alerts are only going to my roommate and I - but it would surely be an issue in any larger setting.
At this point in configuration, I come to a make-or-break point. With some of these shortcomings, I really need a way to call a script with alert information when an alert is generated - whether it's to dial out through Asterisk or just automatically create a ticket for the problem.
Adding alerts is a cumbersome process. You have to browse to a page for a specific metric - which means going to the page for a specific platform, server, or service - and then opening the page for that metric. The actual alert creation takes up two pages - one for the metric, threshold, and time-based criteria, and a second for who to alert. This means that to add alerts for a machine, you need to view the platform page as well as the services and servers pages, and each metric therein.
What's Missing / What I Don't Like
- No plugin for HP Proliant hardware, even in
- No plugin for Xen dom0 monitoring.
- No notion of contact groups.
- As of yet, no plugin for Dell hardware management.
- Plugins have to be written in Java (though there is an option to run a script as a plugin)
- No plugin for LDAP
- Apache 2.2 plugin didn't detect the right paths for my default SuSE installation, and requires mod_status.
Looking at the Product Comparison Chart, the Open Source version is missing some key features. Among the features that HQ Enterprise (usually with Gold or Diamond subscription) has which the basic HQ (F/OSS variant) lacks are :
- global alert templates for groups of resources
- multi-conditional alerting
- recovery alerts (?!?!?!)
- performace baselining for alert thresholds
- time-of-day alerting - this is a big problem for me.
- automatic corrective actions for alert responses
- RBAC
- external authentication support with LDAP or Kerberos
- (resource) group alerts
- SNMP traps
- resource-type alerts
- No apparent way to add a service check (Script or Nagios Plugin) to a resource (especially Network Device) that executes on the Hyperic server.
What's There that I love
- Just install an Agent, and it detects your system and starts monitoring. Period. That's it. No need to setup config files or specify lots of information.
- Web GUI works flawlessly with SSH port forwarding, and allows me to use multiple tabs for concurrent tasks.
- The Web GUI provides easy, usually intuitive settings and administration.
- With the platform/server/service templates, you can get default monitoring metrics within a few minutes, with little to no work.
- Intuitive alert escalations requiring little to no prior knowledge.
- Builtin RSS feeds of Alerts, control actions, etc.
- Amazing automatic graphs of many Indicators
- Built-in tracking and auditing of log files and config files
Plugin Architecture
As far as I can tell, all of the plugin checks execute as Java on the client. Therefore, while it can monitor the status of, say, an Apache configuration, it doesn't test actual connectivity. If someone mis-configures a firewall and blocks port 80, you won't be alerted (unless you setup alerting for requests per minute lower than a given threshold.
Subjective Review
Overall, it seems that HyperFORGE has very few plugins - between the base install and HyperFORGE, it's way behind Nagios, and what I've come to expect as a Nagios user. More importantly, writing a plugin that would take 20 minutes for Nagios becomes a time-intensive procedure.
Performance isn't wonderful compared to lighter solutions like Nagios. My test installation, at 5 platforms, 29 servers, and 290 services (4 agents) was collecting 274 metrics per minute. It was in a Xen VM with 464 MB dedicated memory, 10GB HDD, on a system with a 2.4GHz P-4. It's running OpenSuSE 10.3 as dom0 and this domU, as well as a minimal CentOS domU. Running normally with no GUI activity, I'm seeing 96-98% memory usage, 70MB swap usage, and load averages of about 0.50. Not exactly hopeful for scaling up to my whole network, given my minimal hardware resources. For comparison, my Nagios machine runs on a 1GHz box with 1GB RAM, and uses about 20% of RAM normally, even with Apache, MySQL, Nagios, and Munin.
The lack of LDAP authentication is a problem, as I've been planning on moving all of my systems and applications to LDAP, and hopefully moving everything web-based to an SSO architecture. I also find the lack of recovery alerts and time-of-day alerting to be especially troubling, both since Nagios has them, and because I rely heavily on them - such as disabling performace alerts when machines are running backups, or disabling all alerts during my router's weekly restart period (0400-0415 Sunday).
Also, while the GUI makes setup largely a breeze, I still need to define alerts and actions manually for every host, along with manually specifying the thresholds. Moving my Nagios configuration with 17 hosts and 300+ services over would be a big pain. The Enterprise version offers a lot more, but the Open Source version should at least let me define templates for resources that specify default alerts for a list of services/servers (Nagios allows this) so that, for example, I could specify all alerts I want for any Linux host that has those services running.
Bottom Line
Hyperic is extremely easy-to-use, has a wonderful GUI, and an easy all-in-one Agent-based setup. Unfortunately, Agent plugins are much more difficult to develop than Nagios scripts, and there are many fewer available than for Nagios. It has little support for hardware-level monitoring (i.e. HP or Dell servers). It provides easy graphical trending and statistics, the type of setup that would take weeks to achieve with Nagios combined with MRTG or Cacti. Furthermore, in my testing, without even reading the manual I was able to get a level of monitoring in an hour that would have taken me days by editing Nagios config files by hand. While the default metrics are quick to implement, it seems that many major features are restricted to the Enterprise version.
The Bottom Line
In terms of extendability, you just can't beat Nagios. With enough time and sweat, you can do anything. With other tools, you get various levels of ease-of-use and ease-of-configuration (many times orders of magnitude faster) at the expense of reduced functionality and customizability. It's a matter of weighing the time savings of more advanced tools against the features that you need.