Network Monitoring Comparison

From JasonAntmanWiki

NOTE: This page is a work in progress.

As posted in my blog (here and here), I decided to test some of the network monitoring software packages available.

At the risk of alienating any of the developers, my criteria for test selection were:

Open-source.
An all-in-one solution that includes monitoring, alerting, and trending (so I'm not testing Nagios, Cacti, Munin, etc.).
Asset management is a big plus.

and some other things that I can't think of off hand.

So, as posted in my blog, I've narrowed the running (at least for the first round of tests) down to three systems:

While I got recommendations for all of those on my blog, including a comment from Jeff G. of OpenNMS, Dave Dennis of GroundWork was nice enough to drop me an e-mail, so GroundWork will be up first.

Unfortunately, my original idea of parallel testing is *not* going to work. I'll be testing in Xen virtual machines on one of my best boxes - a 1U Dell PowerEdge 650 with a 2.4 GHz P4, 768MB RAM, and 29 GB of free disk space - but this is still well below the minimum specs for GroundWork.

Comparison of Packages

Being that I run Nagios (v2) right now, I'm leaving it in here for comparison:

Please Note: I make no guarantee to the accuracy of this information. I tried to extract most of it from the respective packages' docs, but I don't know that it's 100% accurate, and I can't be sure it hasn't changed yet.

Overview and Features

Open-Source Network, Server, Service & Log Monitoring Overview
	Nagios	GroundWork OS	OpenNMS	Hyperic HQ	Zabbix	Zenoss Core	Splunk (Free)
Basic Specs
License	GPLv2			GPL		GPLv2
Server Recommended Processor	n/a	2.4GHz P4		1GHz+ P4	P4
Server Recommended RAM	n/a	2GB		1 GB	100 MB
Server Recommended Disk Space	n/a	60 GB		10 GB	256 MB
Agent Recommended Processor	n/a			500 MHz+
Agent Recommended RAM	n/a			256MB
Agent Recommended Disk Space	n/a			500MB
Interface Server	Any web server w/ CGI	Apache		Tomcat/JBoss
Database	Many (MySQL for me)					MySQL 5
Interface Language	CGIs - mix of languages	PHP		Java Web App		Python
Agent Type	NRPE, SSH, SNMP, etc.		Python Plugin - mostly SNMP.	Java Monolithic
Agent OS's	Almost Anything			Linux, Windows, most Unixes	Any.
Server OS	Almost anything			Linux, Unix	Almost Any.	Linux, FreeBSD, OS X
Scalability	Ok (distributed)		Excellent	Resource Hog
Major Features
Service Alerts (on remote system)	Yes.			Yes.		Yes.	Yes.
Service Alerts (from local system)	Yes.			Through Agent			Yes.
Inventory	???			Yes (Basic)		Yes.
Performance Monitoring (Graphical)	3rd Party			Yes. (Default.)		Yes.
GUI Configuration	3rd Party			Yes. (Default.)
Automatic Discovery	???			Yes. (Mostly.)		Yes.
Default Configurations	No.			Yes.		Yes. (ZenPacks)
Log Monitoring	Plugin			????	Yes.	Yes.	Yes.
Config File Monitoring	Plugin			????
Plugin / Agent Type	Script		Python	Java
Ease of Plugin Creation	Easy. (script)		Easy. (Python)	Moderate - Diff. (Java plugin or script)		Easy. (Python, XML)
Graphical Mapping (Google Maps, etc.)	3rd Party			No.	Yes.	Yes.
Network Topology Map	Yes. And 3-D.			No.	Yes.	Yes.
Nagios Plugin Support?	Yes.			XML File.		Thru GUI.	n/a
Ticketing?	Script.		HTTP Request Alert	Email Parser.
Additional Features
	Nagios	GroundWork OS	OpenNMS	Hyperic HQ	Zabbix	Zenoss Core	Splunk (Free)
Configuration	Text File			Web GUI
Ease of Customization	Easy - Scripts			Difficult/Impossible
Host Groups	Yes			Yes
Host/Service Templates	Yes			Metrics Only
Alert Templates	Yes			No
Time-Based Alerts	Yes			No		Yes.
Multi-Metric Alerts	No			???
Easy Escalation Schemes	Sort-of			Yes	None at all.
Alert Options	Anything script-based		Email, SMS, XMPP, Script, HTTP GET/Post, SNMP Trap	Email, SMS
Range of Available Plugins	Large Builtin and Nagios-Exchange			Limited.
RBAC				No.		No.
LDAP Auth.				No.		No.
User Groups	Yes.			No.		Yes.
	Nagios	GroundWork OS	OpenNMS	Hyperic HQ	Zabbix	Zenoss Core	Splunk (Free)

Comparison with my Nagios

Here's a comparison between what I monitor on Nagios, and what the other contestants can monitor.
D: Default on
A: Default on, automatic configuration
I: Installed by default (but not enabled)
C: Custom script possible / relatively easy
X: Impossible or very time-consuming (more so than Nagios)
R: Requires Client Server/Service reconfiguration/customization
U: Untested
3: Third-Party Plugin Available
W: Work-In-Progress (Either OEM or Third-Party)

Nagios	Hyperic HQ	GroundWork OS	OpenNMS
Apache Status	R
APC SmartUPS - ALL (SNMP)
APCUPSD (all)
Bacula - Nightly Job Status
Bacula - Nightly Job Time
Bacula - Weekly Job Time
DNS - Dig for Hostname
DNS - Domain Name Expiration
HTTP - Response (Webmin)
HTTP - Check Remote Response
IMAP Status, Response Time
Linux - SSH Server
Linux - Parition Free Space
Linux - Current Users
Linux - Load Avg
Linux - Ping/Avail.
Linux - CPU Load
Linux - Uptime
Linux - Total Procs
Linux - NTP Offset
MySQL - Server Status
Ping - External
Proliant - CPUs	X
Proliant - Fans	X
Proliant - Free Memory	X
Proliant - PSUs	X
Proliant - Temp	X
SMTP - Status, Response Time
Solaris - Package DB
Solaris - Partition Free Space
Solaris - Current Users
Solaris - CPU
Solaris - Load
Solaris - NTP Offset
Solaris - Processor
Solaris - Free RAM
Solaris - SMF Service Status
Solaris - SWAP
Solaris - Uptime
Solaris - Sun SPARC Hardware
Switch - Ping
Switch - Telnet
TCP - Bacula FD/SD

Stuff Not Monitored With Nagios

Here's a comparison between what I would like to monitor on Nagios, and what the other contestants can monitor (i.e. stuff I haven't implemented in Nagios yet).
D: Default on
A: Default on, automatic configuration
I: Installed by default (but not enabled)
C: Custom script possible / relatively easy
X: Impossible or very time-consuming (more so than Nagios)
R: Requires Client Server/Service reconfiguration/customization
U: Untested
3: Third-Party Plugin
W: Work-In-Progress (Either OEM or Third-Party)

Nagios	Hyperic HQ	GroundWork OS	OpenNMS
Alerting - Create Ticket
Asterisk - Status
Asterisk - Load
Asterisk - Trunk Status
Dell - Hardware	W(3)
IPcop - Status
IPcop - Interface Status
IPcop - VPN Status
MythTV - Backend Status
MythTV - Tuner Status
Printer - Status
Printer - Alerts
Status checks on Remote Subnet with no gateway
Xen - domU Performance
Xen - domU Status

Cross-System Compatibility

Hyperic HQ: Hyperic HQ has the ability (non-default) to use custom scripts as plugins, OR to use Nagios plugins.

Hyperic HQ

Part I - Installation

setup Xen virtual machine running OpenSuSE 10.3 base packages. (3 hours, some server problems, some Xen problems, and some time learning Xen administration from the CLI)
Download hyperic-hq-installer-3.2.0-607-x86-linux.tgz from Hyperic and extract.
Browse to http://support.hyperic.com/confluence/display/DOC/Full+Installation+Guide
cd into hyperic-hq-installer and run ./setup.sh -full
1. The installation can't be run as root (though I assumed it would need root privileges).
2. I selected to install all 3 components - Server, Shell, and Agent.
3. Well, whoops! Sorta stupid to not allow installation as root, when the default location to install to is /home/hyperic. How do they expect an arbitrary user to install there? Even worse, it appears that the default OpenSuSE 10.3 installation doesn't come with sudo (!!!!) so I can't try that.
4. As root, create /home/hyperic and chown to my user.
5. Repear the above steps (well, hopefully not all of them).
6. Default ports for everything - web GUI on 7080, HTTPS web GUI on 7443, jnp service on 2099, mbean server on 9093,
7. Change domain names in default URLs to logical ones for my test environment (no real DNS, just IPcop hosts, so devel-hyperic1.localdomian). I hope that I can change these later, or even better that absolute paths aren't used too much, as this will screw with my idea of using SSH port forwarding for remote access.
8. Leave the default SMTP server alone and change it later - I odn't even have mail running here at the apartment.
9. Use the built-in PostgreSQL database with default port of 9432.
10. Go with the defaults for everything after this.
11. Everything runs nicely, and then it tells you to login to another terminal as root and run a script. I'm not sure I like this method, but I guess it works. Login and do it.
12. How will it start the builtin database? As my user???? Yup. postgres is running as my user. Wonderful. Nothing in the install document mentioned user creation. Was this just assumed? Because in the naive world I live in, most installer scripts (think Nagios) create a user for you, or tell you to.
13. Setup script complete. A few instructions follow...
Run /home/hyperic/server-3.2.0/bin/hq-server.sh start... as my user. Note to self: setup a user for Postgres and Hyperic. Believe it or not, but it booted - but followed with the message, "Login to HQ at: http://127.0.0.1:7080/"
Browsed to http://devel-hyperic1:7080 and was greeted by a startup page, saying that the server was 18% finished booting. My, I yearn for little C binaries and a PHP frontend.
Page turns blank and stops there. I refresh, and get a login page. I enter my username and password, and get a little message box where the "invalid password" box usually is - says "Server is still booting". This is over a minute later. I'm happy to see Apache/Coyote1.1, but would like to be able to get into Hyperic in less time than it takes the machine to boot to a graphical login screen (ok, granted, I'm running XFCE). In SuSE's YaST Xen Monitor, I see that the VM is at 45% of its' 464MB RAM, and 90% CPU - with 8.5% consumed by dom0.
CPU usage for the VM drops to 1% and I login again. BAM! Hyperic HQ. Aside from the fact that it shows NO resources... oh... start the Agent.
Start the Agent on the VM running Hyperic. It asks me for the server IP address. What, no DNS? I enter the IP as it is... for now. I keep everything at defaults, including using the hqadmin username and password. Successfully started.
BAM! In Dashboard, I see the auto-discovered host with the right hostname, as well as Tomcat, Agent, JBoss, and PostgreSQL. Amazing! Click "Add to Inventory".
Check out the "Resources" -> "Browse" screen. It knows this machine is OpenSuSE 10.3, and I see my four services (listed above). Of course, no metrics yet, but I see the correct IP, gateway, DNS, vendor (SuSE), kernel version, RAM, architecture, and CPU speed.
Looking through the "Inventory" screen, I see everything - NICs and MACs, running servers and one service (a CPU resource). What more could a man want in...let's see.. just over an hour!
I really *love* the "Views" screen which, even out-of-the-box, allows "Live Exec" information from cpuinfo, df, ifconfig, netstat, top, who, and more.
Well, it's 03:35, and I have work and class tomorrow. I think it's time to give Part I a rest. But first...
Go to the "Platform" page for my one machine and... YES! Graphs are starting to appear!
Following the suggestion here, I enable log and config tracking on the platform for /var/log/warn and /etc/hosts, respecitvely.
Before I call it a night (now 03:42), I stop back at the downloads page and grab the Linux x86 Agent for the dom0 machine, hoping to get some physical information as well. While I'm at it, I grab the Linux AMD64 Agent to try on my laptop. I create "hyperic" users on each system. On the base Xen server, I give it a shot and get "Unable to register agent: Error communicating with agent: Unauthorized". Same thing on the laptop.
Did a little reading here. As to keeping all of the defaults, it turns out that both clients had firewalls blocking TCP port 2144. I opened it up on both, and also set the IP address (that the server uses to contact the client) to the correct ones. Viola! Now I have 3 clients connected, and gatheirng data for the next ~16 hours until I have time to check it out agian.

More to come in Part II tomorrow - actually doing something with Hyperic. For now (04:08), time to sleep.

Part II - Configuration

Unfortunately, I haven't had much time to play with Hyperic in the two days since installation. The most I've really done is setup Agents on my laptop, desktop, and the host machine (both dom0 and domU for Hyperic), so that they start to collect data.

While I found a lot of upsetting stuff in the features list (see below), I decided to go ahead and add some other devices. On the network at the apartment, I have two manageable switches (a Linksys and a 3Com) - which pretty much make up the sum of non-host equipment. I also have an IPcop box, though I assume the standard Linux Agent will handle that. The one item missing that I have at home is my set of APC SmartUPS UPSs with SNMP cards, but I guess I'll just have to skip them for this review.

First, I went in and added a platform (Resources->Browse, Tools Menu->Add Platform) for the 3Com switch (a SuperStack II Switch 3300). It showed successful creation - but nothing else. I went in and entered the SNMP community string, IP, and version (1). In about a minute or so, I started to see metrics - Availability, IP Forwards, IP In Receives, an IP In Received per Second. While it's quite basic, that's good for a starting point. While the [http://support.hyperic.com/confluence/display/DOCSHQ30/Network+Device+platform Network Device Platform] documentation lists lots of metrics that can be enabled, I'd also like telnet availability and - my big one since I use a "cute" (crappy) IPcop installation for local DNS, a dig on DNS to make sure the entry is there. In the Monitor screen, I was able to enable a bunch of additional metrics (by clicking on the "Show All Metrics" link), though there's also no way (that I can find) to monitor the status of individual ports.

Next, I browsed through the "Administration" pages, setup a few users, and started setting *way* more default metrics for various platforms, services, and servers. While I don't have mail running yet, that will come this weekend. While I added a lot of things as "Default On", I still need to go back and add more things in the templates as Indicators.

I also added some escalations, though they're quite simple - you can notify HQ users or "other users" by email or SMS, write to SysLog, or suppress alerts for 0 minutes to 24 hours. Hopefully I'll also find a plugin for Asterisk integration. One striking omission is user groups. Also, the concept of "Roles" (maybe their idea of groups?) is only available in the Enterprise version.

At this point, I also notice one other majoe issue, though perhaps I'll find a solution in my experimentation - there doesn't be a way to setup default alerts for metrics. If they have all of this platform, server, and service information defined as default templates, why not just have a way to assign default users (and groups) to these objects, and have default alerts generated?

In terms of Apache 2.2 monitoring, out-of-the-box, nothing worked. No metrics at all. Firstly, Hyperic requires the mod_status module. Persoanlly, I'd rather handle all of that through a backend, like Nagios. Secondly, it got the pidfile and apache2ctl paths wrong. Furthermore, it has no "smart" checking for resources - while my Apache 2.2 resource config was clearly wrong (wrong PID file path, no mod_status), Hyperic didn't detect this and was showing the resource as "Down".

After that, I setup a bunch of alerts for things that I thought would be off-kilter a lot (like WARN log entries on my laptop, high memory usage on some stressed machines, etc.) as well as log and config file monitoring and alerts for them. While I didn't have mail working yet, I figured I might as well get that stuff running.

On the Xen dom0 host that runs the Hyperic vm (box called xenmaster1), I wasn't able to add config file tracking for any of the /etc/xen/ files. At this point I notice some serious shortcomings - not only is it not possible to define a template of alerts for a given platform/server/service, it's also impossible to define a template for alerts. I also noticed that it's not possible to define groups of contacts. This wasn't much of a problem for my test installation - the alerts are only going to my roommate and I - but it would surely be an issue in any larger setting.

At this point in configuration, I come to a make-or-break point. With some of these shortcomings, I really need a way to call a script with alert information when an alert is generated - whether it's to dial out through Asterisk or just automatically create a ticket for the problem.

Adding alerts is a cumbersome process. You have to browse to a page for a specific metric - which means going to the page for a specific platform, server, or service - and then opening the page for that metric. The actual alert creation takes up two pages - one for the metric, threshold, and time-based criteria, and a second for who to alert. This means that to add alerts for a machine, you need to view the platform page as well as the services and servers pages, and each metric therein.

What's Missing / What I Don't Like

No plugin for HP Proliant hardware, even in

HyperFORGE.

No plugin for Xen dom0 monitoring.
No notion of contact groups.
As of yet, no plugin for Dell hardware management.
Plugins have to be written in Java (though there is an option to run a script as a plugin)
No plugin for LDAP
Apache 2.2 plugin didn't detect the right paths for my default SuSE installation, and requires mod_status.

Looking at the Product Comparison Chart, the Open Source version is missing some key features. Among the features that HQ Enterprise (usually with Gold or Diamond subscription) has which the basic HQ (F/OSS variant) lacks are :

global alert templates for groups of resources
multi-conditional alerting
recovery alerts (?!?!?!)
performace baselining for alert thresholds
time-of-day alerting - this is a big problem for me.
automatic corrective actions for alert responses
RBAC
external authentication support with LDAP or Kerberos
(resource) group alerts
SNMP traps
resource-type alerts
No apparent way to add a service check (Script or Nagios Plugin) to a resource (especially Network Device) that executes on the Hyperic server.

What's There that I love

Just install an Agent, and it detects your system and starts monitoring. Period. That's it. No need to setup config files or specify lots of information.
Web GUI works flawlessly with SSH port forwarding, and allows me to use multiple tabs for concurrent tasks.
The Web GUI provides easy, usually intuitive settings and administration.
With the platform/server/service templates, you can get default monitoring metrics within a few minutes, with little to no work.
Intuitive alert escalations requiring little to no prior knowledge.
Builtin RSS feeds of Alerts, control actions, etc.
Amazing automatic graphs of many Indicators
Built-in tracking and auditing of log files and config files

Plugin Architecture

As far as I can tell, all of the plugin checks execute as Java on the client. Therefore, while it can monitor the status of, say, an Apache configuration, it doesn't test actual connectivity. If someone mis-configures a firewall and blocks port 80, you won't be alerted (unless you setup alerting for requests per minute lower than a given threshold.

Subjective Review

Overall, it seems that HyperFORGE has very few plugins - between the base install and HyperFORGE, it's way behind Nagios, and what I've come to expect as a Nagios user. More importantly, writing a plugin that would take 20 minutes for Nagios becomes a time-intensive procedure.

Performance isn't wonderful compared to lighter solutions like Nagios. My test installation, at 5 platforms, 29 servers, and 290 services (4 agents) was collecting 274 metrics per minute. It was in a Xen VM with 464 MB dedicated memory, 10GB HDD, on a system with a 2.4GHz P-4. It's running OpenSuSE 10.3 as dom0 and this domU, as well as a minimal CentOS domU. Running normally with no GUI activity, I'm seeing 96-98% memory usage, 70MB swap usage, and load averages of about 0.50. Not exactly hopeful for scaling up to my whole network, given my minimal hardware resources. For comparison, my Nagios machine runs on a 1GHz box with 1GB RAM, and uses about 20% of RAM normally, even with Apache, MySQL, Nagios, and Munin.

The lack of LDAP authentication is a problem, as I've been planning on moving all of my systems and applications to LDAP, and hopefully moving everything web-based to an SSO architecture. I also find the lack of recovery alerts and time-of-day alerting to be especially troubling, both since Nagios has them, and because I rely heavily on them - such as disabling performace alerts when machines are running backups, or disabling all alerts during my router's weekly restart period (0400-0415 Sunday).

Also, while the GUI makes setup largely a breeze, I still need to define alerts and actions manually for every host, along with manually specifying the thresholds. Moving my Nagios configuration with 17 hosts and 300+ services over would be a big pain. The Enterprise version offers a lot more, but the Open Source version should at least let me define templates for resources that specify default alerts for a list of services/servers (Nagios allows this) so that, for example, I could specify all alerts I want for any Linux host that has those services running.

Bottom Line

Hyperic is extremely easy-to-use, has a wonderful GUI, and an easy all-in-one Agent-based setup. Unfortunately, Agent plugins are much more difficult to develop than Nagios scripts, and there are many fewer available than for Nagios. It has little support for hardware-level monitoring (i.e. HP or Dell servers). It provides easy graphical trending and statistics, the type of setup that would take weeks to achieve with Nagios combined with MRTG or Cacti. Furthermore, in my testing, without even reading the manual I was able to get a level of monitoring in an hour that would have taken me days by editing Nagios config files by hand. While the default metrics are quick to implement, it seems that many major features are restricted to the Enterprise version.

The Bottom Line

In terms of extendability, you just can't beat Nagios. With enough time and sweat, you can do anything. With other tools, you get various levels of ease-of-use and ease-of-configuration (many times orders of magnitude faster) at the expense of reduced functionality and customizability. It's a matter of weighing the time savings of more advanced tools against the features that you need.

Network Monitoring Comparison

Contents