One of my professional duties in my past ten years was monitoring systems. Even my diploma thesis was dedicated to distributed monitoring (altough my professor sucked badly ). Apart from a few custom-programmed scripts to analyze special situations (e.g. proxy clusters) I used tools that fellow administrators will find familiar: Nagios and Cacti. And another less famous text-configuration-based monitoring tool called Cricket. That worked well somehow but Cricket was hard to learn for my coworkers and Cacti seems unreliable and fundamentally broken in terms of SNMP checking. Besides why do I have to set up availability checking in Nagios and set up checking of the same parameters in another software to draw graphs? Then in 2009 I came across an open-source software I hadn’t heard of before: Zabbix. And although it has a few rough edges it seems way more professional than other common tools (the commercial tools I saw were even worse than the open-source variants). I tried it and after a lot of reading and trying it looks like it has a good potential to replace Nagios and Cacti. So I thought I’d sum up my personal experiences with all of these tools.
Their makers claim that it’s the “industry-standard in IT infrastructure monitoring“. Honestly it’s a great tool but considering how many years it has been existing it barely evolved. During my diploma thesis in the year 2000 I wrote an alternative software that I called “MrNetwork” that dealt with flaws that Nagios hasn’t even fixed today. Still Nagios is a tool I have used for many years and it is very reliable.
- open source
- large community
- many powerful plugins (and own plugins are easy to create: just write a program that prints a one-line string and set a certain return code)
- easy-to-use web frontend
- debugging plugins is moderately simple.
- many thought-out features like host groups or notification options that make your life easier
- dependencies (so that you don’t get 100 alerts if a router between the Nagios server and other servers went down)
- nagvis plugin with a great interactive editor that draws nice management-suitable graphs (although I found the ndo2db interface hard to set up at first and a little flaky)
- The focus is on availability checking – you don’t get fancy graphs on the values that are monitored (e.g how the CPU load was over time). So you’ll need a second tool and set up the same checks there just to get graphs. But availability percentages are computed automatically.
- Textual configuration that has so many different settings that you need to look up the parameters often. A web-based configuration would probably be better (and is available as an add-on but I haven’t tested it).
- Third-party plugins are often very badly programmed and barely documented that it appeared easier to reinvent the wheel. (“Look, ma, I can has plugins.”)
- Some views on the web interface are not very obvious (e.g. clicking on the title of a host group gives a nice view of all hosts with all services).
- Many plugins don’t have corresponding configuration entries so you have to find out how they work and write configuration entries yourself (and those which are preconfigured take some archeology to find out which parameters they expect). This is a huge time-waster for beginners. And in your services configuration you have to verify your checks configuration to understand the meaning of each parameter. Or do you remember what “check_http 80!john!doe!10!30!body” is supposed to check?
- Every set of parameters of a certain plugin requires a distinct configuration entry. The plugins have dozens of configuration switches that you may need one day. Want to set a timeout on HTTP checks? Write another check configuration. Want to check for a certain string in the HTTP response? Write another check configuration. And so on.
- Most checks are run from the Nagios server itself (the NRPE plugin to do the checks on the respective remote systems somehow refused to work properly here) which is suboptimal and puts a lot of load on the server.
- By default every alert triggers a notification. So if you can’t define proper dependencies (e.g. if you want to check your web server in all 30 supported languages and there is some logical error) then you will get spammed with alerts.
As Nagios does not support plotting graphs of the monitored values I was in need of another piece of software. Basically Cricket is a frontend to RRD (which stores data in a rotating/round-robin file that keeps data of the last X minutes/hours/days). It has a textual configuration that takes a lot of getting used to. It’s main principle is inheritance of settings – they call it “configuration tree“. Which means you have a master DEFAULTS file that contains general settings like how to query SNMP. In a subdirectory you define a certain class of devices that you want to monitor – e.g. routers (the DEFAULTS are inherited to this level). Within the routers directory you can just define a list of routers you want to monitor. All settings are inherited from “above” (parent directories). It’s more a geek tool for shell lovers.
- very quick to monitor a large set of similar devices once the general device class is defined
- simple web interface
- very reliable
- can monitor SNMP values (it does this very well) or execute external scripts – thus can be easily extended
- flexible graphing – you can sum up values of two graphs into a new graph (aka “mtargets” – multiple targets)
- different check frequencies can be configured for different subtrees through cron (by default values are collected every 5 minutes – this can be set as low as one minute if needed)
- the textual configuration is error-prone (leading to funny Perl errors that can be hard to debug)
- users may expect to see all parameters of a certain device instead of all devices having a certain parameters (“Give me the statistics of router42″ instead of “Let’s see the temperature of all routers we have.”)
- customizing the graphs (drawn by RRD tool) isn’t trivial
- Frequency of checks is by default 5 minutes. Before RRD can draw the first value of a graph it needs three values. So you’ll be waiting 15 minutes before you see any results.
- RRD rounds data by default. So the yearly graph doesn’t show the peaks that the daily graphs do. (This can be fixed by not graphing the average values but the maximum values.) Long-time archiving of graphed data is not possible without throwing away the RRD files and manually customizing them. Changing the monitoring frequency (aka “heartbeat”) is not possible either without throwing away the data and starting from scratch either.
- No proper built-in alerting in case certain thresholds are exceeded.
Another frontend to RRD – and a pretty sophisticated one. Nearly everything is configured through its web interface. And the result is beautiful. It’s not entirely reliable though and SNMP support (at least in version 0.8.7b) is a big fail. I like Cacti because its user interface is much better than Cactis but it’s less reliable and flexible.
- Fine-grained permissions system. So a certain user may get read-only access to a certain subtree.
- The tree where graphs are placed can be configured freely so you get exactly the view you want.
- Doesn’t hide the RRD magic very well. The user is easily confused by templates, data sources and the like.
- Graphing sometimes just stops working for no reason or values are missing although the server isn’t overloaded and other software doesn’t show such outages. According to a quick search on the lazyweb I’m not the only one with such effects.
- Setting up many systems means a lot of clicking in the web interface. Setting up new kinds of checks (aka “templates”) means even more clicking and is very error-prone.
- The quality of some third-party templates I tested was pretty bad. Creating new templates is tedious, error-prone, frustrating and close to black magic. Nothing for the casual user at least.
- Doesn’t handle SNMP correctly (this is the biggest fail in my opinion and makes it unusable here). Although it knows how to query indexes (e.g. ifDescr to get the names of your network interfaces) it just seems to stored fixed OIDs. So once the SNMP tables change the order or number of items (which isn’t unusual) then suddenly other parameters get graphed.
- Frequency of checks is by default 5 minutes. Increasing the frequency leads to missing data and wrong results.
- As it uses RRD and needs 3 valid values you won’t see that your monitoring fails until you wait 3×5=15 minutes. Not suitable for impatient non-smokers like me. :)
- Debugging failed checks is close to impossible. If a check fails then I find myself clicking around randomly trying to find typos because the alternative is digging around in database entries.
- The web interface is sometimes confusing. A refresh of SNMP tables is done by clicking a unmeaning green circle icon. Adding new items to a list is done by clicking an inconspicious “Add” link that doesn’t even look like a link.
- Another UI confusion: graphs are created from the “devices” view. But they are deleted from the “graph management” view.
- No alerting in case certain thresholds are exceeded. Another tool like Nagios would still be needed to notify you.
- Can’t sum up multiple targets so monitoring failover clusters doesn’t work well.
- RRD averages data when putting daily values into weekly values, weekly into monthly and monthly into yearly. So the yearly graph doesn’t show the peaks that the daily graphs do. (This can be fixed by not graphing the average values but the maximum values.) Long-time archiving of detailed graphed data is not possible (RRD). Changing the monitoring frequency (aka “heartbeat”) is not possible either without throwing away the data and startin from scratch.
People pointed me to Zenoss which is supposed to offer the same functionality as other monitoring systems but is much more integrated. So this short list is more a quick one-day-experimental expression than a thorough analysis. But in the end much of the fuss is just marketing.
- Beautiful web interface
- Nice gimmicks like google maps integration to show you where your servers are down worldwide. Only makes sense when monitoring networks with several/many remote locations.
- Can partly discover parameters of systems automatically. That works for static routes (although I wonder why the heck I want to monitor static routes), file systems and network interfaces. On the other hand processes can’t be discovered automatically and have to be set up manually.
- Does not come with a specific agent but plays rather well with plain old SNMP.
- Can monitor Windows through native WMI.
- Large fanbase.
- Web interface feels slow (Zope is bloated)
- Opaque operation. It does not tell what’s actually going on. You can add monitoring and check back later if anything worked they way you want.
- Questionable reliability. In a test here I was monitoring a running process. The process was said to be down and suddenly went to “up” after a while although nothing had changed on the system.
- Just one dashboard. Several dashboards similar to Zabbix “screens” would be nice.
- Configuration and data is spread across MySQL, the internal Zope database storage and RRD files on disk.
- The dependency graph is a nice Flash-based applet displaying how the systems are connected. But it does not show any details about the systems aside from whether they are up or down. And clicking on a host does not take you anywhere but center on the system. Beautiful but it could do so much more than just look beautiful.
- Limited open-source version. Full version needs to be paid for.
I’m using the backported Zabbix 1.8.2 on Debian Lenny here. Debian Lenny’s native 1.4 version lacked some interesting features like proper SNMP handling. Zabbix seems close to the perfect monitoring system I had always dreamt of. I would have designed it differently in some aspects though.
- Availability Monitoring (like Nagios) and graphing (like Cacti) is combined into one tool.
- Highly configurable. User John may just get an SMS for problems of high severity during the weekend and on weekdays get a Jabber message. Even automatic actions like restarting services can be set up.
- The notifications actually help the person who gets the message. “Low disk space on /var on web5″ with an additional comment is pretty helpful even when sent via SMS. Notifications are completely customizable with macro variables.
- Very performant. A Zabbix agent can be installed on the systems (available for several operating systems – even for Windows) which gathers the information on each system efficiently. The agent can even call scripts or shell one-liners to gather information. This kind of data collection is very efficient. You will need a server with a good I/O performance and a lot of RAM though so that the database can work efficiently. At first I virtualized Zabbix on a Debian server on a VmWare server with 1 GB of RAM. The database access became so slow that showing graphs or recent events made me store at a busy mouse cursor for up to a minute. On a server with 4 GB of RAM, a large MySQL key buffer and SAS disks the system runs well again.
- Collecting items (gather information about system parameters) happens at set intervals. You don’t have to wait for several minutes until you see results (it usually takes half a minute). Each item can have a custom check interval. So you can check for the CPU load every 30 seconds but check the number of free inodes on /home just once an hour.
- Fast web interface.
- Sophisticated monitoring of web sites. Zabbix can follow a path of simulated mouse clicks on a web site and check for functionality and response time.
- Real-time graphs. Values are by default collected every 30 seconds. You quickly see where you are going.
- Permissions system. Certain users can be limited to certain views.
- Gathered data is stored in a database (MySQL, PostgreSQL, SQLite) instead of an unflexible RRD file. Storage periods (aka “history”) can be configured freely. Backing up the database is all there is to be done.
- Templates (that can even link to further templates) save time in setting up many checks.
- Graphs (plots of values over time) can be customized like which items are plotted and in what way. Even pie charts are possible.
- Even the parameters that don’t get an explicit graph can be graphed at any time. E.g. the agent has tracked the CPU load on a system that you never cared about then you can graph that with one mouse-click.
- Screens and slide shows can be used for high-level views (aka “dashboards”) or to be displayed on a big geeky display. They can combine textual display of the status as well as clocks, ad-hoc graphs or predefined graphs.
- Very flexible trigger expressions. For example you can tell a trigger to fire if the average system load over the last 15 minutes is above a certain value. As all measured parameters are stored in a backend database you can use all kinds of mathemical expressions. Like firing a trigger if the average number of running processes during the last half hour is above 50. All other software I tested just has access to the last value gathered.
- Alerting/notifications can be scripted easily by using shell scripts.
- Remote monitoring made easy by using a Zabbix proxy.
- Paid support and paid custom programming available. But the software is completely open-sourced.
- 320 page PDF manual with screenshots and nice references. (Although I’d personally prefer an online help. Currently the “?” link within the web interface just points to the PDF that you can download.)
- A lot of mouse moving and clicking is required to set up things. For example setting up an alert if the free space on a disk on a certain server is getting too low then you need to set up hosts, items, triggers and actions. Some of the clicking seems redundant. E.g. I didn’t find a way to create triggers automtically for a set of checks. If I monitor how full the “/home” partition is then I’d like to set a threshold in the same configuration step.
- Takes a little patience understanding the concepts (because there is no hidden magic) although they make sense after half a day.
- The web interface is crammed full of features. For casual users it’s confusing to navigate. In a real-life network you find yourself setting host variables, juggling templates and unless you remember everything you did you will not get a good overview of your configuration. Zabbix is very complex but in my opinion it would need an even better web interface to deal with its featurs properly.
- The map editor was close to unusable at first but has improved in version 1.8. It still takes a lot of time to set up the maps. The map editor could really use fewer mouse clicks to set up the map. I was also missing a feature to add current item values to the map (like the server room temperature or bandwidth on our load balancer). You can just add triggers which occupy a lot of space on the map, too.
- You can just return one value per item. Sometimes you need to return a good/bad value plus some additional information. Nagios for example delivers a return code for OK/WARNING/ALERT and also a text string. In Zabbix this are different items.
- Zabbix gets painful when you want to monitoring different assets of the same kind. Different network interfaces, disk partitions, MySQL instances or web server ports. Templates are pretty useless here. You will have to copy every item and trigger.
- Does not detect the available assets in a monitored server automatically. Imagine that you want to monitor the space on all disk partitions on a system. You will have to copy over or create the check items manually or define all possible checks in a template and disable those you don’t need. Cacti handles that better by offering you a list of partitions to monitor. Zenoss can do that partly. The zabbix agent should be able to handle such a service discover automatically. (The built-in “Discovery” feature rather seems to detect new servers in a given network range automatically. But that’s something different.)
- Hard to debug. Why was an action not run? Who would get alerted for a certain trigger? Why has a value become ‘unknown’ without a reason? Of course there are reasons for what Zabbix does. But it often takes clicking and guessing instead of telling the user.
See my series of Zabbix screencasts if you like to learn more.
See also: Ben Rockwood’s blog Further similar software I didn’ test thoroughly: Hyperic and Opsview. All of the above tools are great. I’m not meaning to say that “Zenoss is total crap” for example. The differences are subtle. And whether a piece of software suits your needs really depends on your expectations. I love that all this software is available as open-source. And a totally unscientific but fun analysis of the community is counting the number of active people in the respective channels on the Freenode IRC network:
- #nagios: 133 users
- #cacti: 58 users
- #cricket: 2 users
- #zenoss: 54 users
- #zabbix: 61 users
Either Nagios has the largest fanbase or perhaps that means that the majority of people needs help with it. :)