Server Monitoring – Few Winners

10 May

As a programmer, I like to know how my applications are handling. I like pretty graphs of response times and I really want to know if they blow up. In our department we’ve been running a very old installation of Big Brother (BB) from before Quest. I kid you not. It’s old but it works with relatively little fuss and sheer lack of a compelling enough competitor has kept it humming away all these years.

Still, BB is very simplistic and we’d like to set up into something from this century. We’ve been literally watching for years for a suitable Open Source replacement to emerge but nothing seems to fit. Of course, we’re familiar with Hobbit but it’s in the same vein as BB. We were never really that happy with BB but it was already in place. Inertia is a powerful force.

And just yesterday I read that Nagios has been forked. Maybe I’m not the only one unhappy with the available choices. I’m about this close to writing my own. Might be fun.

Below is more or less my personal list of gripes, minus the names of the guilty. I really have no interest in gunning down well-meaning projects. Of course, some score better than others but none seem to do it all. You’ll notice I’m mostly concerned with the server itself, since for the most part the agents work great.

Want to add to the list? What do you use to monitor your apps?

  • Slow. Most of the time, if I’m checking the site I just want to see a graph or check that a specific service is working. It shouldn’t take forever. That includes navigation – it should be easy to find historical data.
  • Use existing agents. Every monitor doesn’t need it’s own agents, there are plenty out there. $new-fangled-monitor would ideally work with agents I can apt-get (with Nagios being high on that priority list).
  • All configuration for alerts, plugins and tests should be stored on the server and centrally managed.
  • Should be able to make mass changes to alerts and agent configurations, something most lack.
  • If it uses a database, it should be able to use the major Open Source databases and at least Oracle (if I’m forced).
  • It should automatically alert on obvious things. If I’ve setup a ping or HTTP test on a server I probably also want to know if it stops responding. Just allow for a way to override the default.
  • Should only alert once. I don’t really want to take the time to designate which alerts are critical and which are not. That can add up to a lot of configuration time, and I have plenty of stuff to watch. I’ll get the email and decide if it’s worth checking out right now or not. Not to mention, since one outage can cause cascading outages, I don’t want to also cause an email outage.
  • On that note, it should have easily adjustable change windows for planned maintenance.
  • Configuration should be dead simple. I have better things to do than spend all day fooling with the monitor server. I don’t mind editing text files so much, but they need to be well documented like Apache’s. The problem with text files is often times you don’t know the possible values. XML is for programs and not for end users to hand edit.
  • Should integrate with the network. Here our network is unfortunately run by mostly Windows servers. At least that means I shouldn’t have to setup users, manage passwords, etc. Single signon with Kerberos or NTLM is a must.
  • On that note, don’t require logins for status pages. Or at least be able to allow access for any authenticated domain user. It’s not a state secret. They’re already on the network if they reached the monitor. If they cared, they could ping the server themselves. Automated monitoring is supposed to make it easier.
  • RSS feeds or portlets and possibly some embeddable AJAX widget would be a great way to integrate with the applications and various other web servers. I’d love to have a page in my own web apps were users could check the status of various systems and progress on fixing them.
  • Give me a way to configure a page or dashboard just for stake holders. I want to email them a URL and let them see for themselves that the application is working.
  • It should look nice, too. I’m not sure why, but most of the monitoring solutions are ugly. Again, I want to give this to the business and let them get a warm fuzzy that everything is working. It should be simple, professional and quickly communicate where problems lie. They’re not going to build their own dashboard with flashing lights and server pictures. They just want to know what broke.
  • Should have a developer’s API. Everything and everybody knows HTTP. We have great proxies and load balancers. Firewalls know all about HTTP. It doesn’t make sense to write a new protocol. Should be usable from shell scripts.
  • Should always page if an agent stops sending updates. That seems kinda basic, but I shouldn’t have to configure an alert for each and every one. Of course, still allow for a way to override the default.
  • A nice mobile page is a must. I might not be in the office or I might be upgrading my workstation again.
  • Should work through, over and under firewalls. Unbelievably, this was an issue with one I tried.
  • Speaking of dumb problems, one I tried would show a blank page if my cookie expired. I’d have manually remove it to login again. Not awesome. The basics are important.
  • Statically typed language. Edit: eek, that’s what I meant. I know that’s a bit controversial but this is only my personal preference. Simply put, I’m probably going to install a monitor and forget about it until it breaks. I’ve been bitten by upgrading the PHP/Python/Perl package often enough that I’d prefer something less prone to incompatible changes.
  • It should not require an agent or convoluted configuration to setup a simple HTTP test from the monitor server itself. Oddly, one I’d tried required something like 30 clicks to setup a simple ping, not to mention a lot of head-scratching.

Server should run on Linux, obviously.

Would you use something that matched that description?

6 Responses to “Server Monitoring – Few Winners”

  1. Shahbaz Javeed 11. May, 2009 at 11:33 am #

    Looks like what you need is the opensource Community edition of Groundwork Monitor at http://www.groundworkopensource.com/community/downloads/. It uses nagios as the core scheduling engine and can utilize most (if not all) nagios plugins without any changes. As a satisfied user I can tell you that it functions well for what you need.

    As far as the need to have an agent-less setup you can use SNMP (use the net-snmp package for RHEL/CentOS) as the agent to provide a lot of the information that you might otherwise need to login to the machine to get. I realize agent-less and “SNMP agent” don’t seem to jibe :) but I consider SNMP to be more part of the OS than anything else since you’ll find it implemented as part of Linux, Solaris, Windows, HP/UX and any other respectable server OS. Alternatively you can always stick to logging in using ssh to gather data but that won’t work on Windows machines.

  2. Dino 11. May, 2009 at 4:20 pm #

    You can try JOPR .. Open source java project .. from Redhat community

  3. Mike Johnson 11. May, 2009 at 7:41 pm #

    Interesting, I must have missed that one. Looks like it was just released in October?

    I’ve downloaded and installed. Pretty easy so far. Although it looks and acts a lot like Hyperic. I wrote about 5 or so of the list items thinking about Hyperic. :-) Jopr does have an improved interface.

    So far seems to have some weird Firefox bug on on of my machines (can’t view the right pane). Not a big deal. Otherwise I like that it doesn’t prompt me to buy the enterprise version just to get user roles.

    Unfortunately, adding a simple http check is throwing NullPointerExceptions in the log…

    I like that the alerts configuration has an availability option — I never really grokked hyperic’s alert. It’s availability metric was a percentage so I ended up making alerts when it was less than 95% or something.

    I’ll keep playing with it. Seems better than most but there are a few wish list items left.

  4. Kevin 12. May, 2009 at 3:14 am #

    Yeah, would never use java to monitor anything… Why have a monitoring tool that takes more resources that most of the services you would monitor?

    Anyway, BB is painful. Very painful at times. The only “great” thing about it is that you can write just about any test you want and send alerts using the ‘bb’ binary.

  5. Ian McGowan 15. May, 2009 at 4:16 am #

    This is one of those rare cases where Sturgeon’s law fails – instead of 90% of everything being crap, 100% of monitoring solutions are…

Trackbacks/Pingbacks

  1. Client-side charting for Leemba | Another guy named Mike - 19. Dec, 2009

    [...] nobody’s heard from me in quite a while because I’ve been hard at work building my server monitor. Sorry, I suck. I have been, amongst other things, trying to get charting [...]