Monitoring software question

Dan Egli ddavidegli at gmail.com
Mon Oct 7 02:46:48 MDT 2013


Hey folks, I got what may be a common or may be a bizarre question about
network monitoring software.  I've heard of packages like Nagios and things
that supposedly let you monitor various boxes from a central point. That's
great, but my understanding of them (admittedly incomplete) is that they
monitor the server itself (basically pinging it) and check what ports are
open/responding. What if I have a specific program that I want to ensure
runs on certain computers. I don't just want to throw a check into crontab
(i.e. [ -z "$(pgrep progname)" ] && /usr/local/sbin/progname &) because if
it DOES fail, I want to know of the failure and to go read the logs to
determine WHY it failed. Maybe it's a hardware failure, maybe it's
software, maybe it's just a config issue. But re-running the program isn't
going to help if I haven't determined WHY it's crashing in the first place.
However, the program doesn't listen() on any ports. It is more of a client
than a server/daemon, but it's mission critical just the same. Just like
most clients, it connects to a central location, sends and receives data,
then closes the connection to work on what it just received. Think programs
like seti at home or the distributed.net client (no, it's not them, but they
have very similar network functionality). Can Nagios (or any other network
monitor) actually connect to the server and read it's process table (or in
some other way determine that the process is actually present)? I want some
kind of central notification for this since the project deals with over 40
systems each running a portion of the overall task. One program/computer
failing won't kill the work, but it will slow it down, and in this case
time is money.



If absolutely necessary, I can setup a periodic check in crontab and have
it ssh into the main server to set some kind of warning flag file or
something in the event of the check determining that the program has died,
and then have another check on the server to notice the warning, but that
seems like I'd be reinventing the wheel when I'd think many monitoring
programs would already have similar functionality. Why go through all that
effort if I can find a program that will do it all for me? Besides, I'd
still have to combine that with a general health status. If the computer
itself ever failed, then the warning wouldn't get triggered since the ssh
task never executed. So I'd have to combine some health monitoring software
(to make sure the machine is actually alive and responding) with the
scripts. That REALLY seems like overkill. I'm hoping someone knows of a
solution that would help me avoid re-inventing the logic to warn of things.



Thanks all!


--- Dan


More information about the PLUG mailing list