Monitoring software question

Dan Egli ddavidegli at gmail.com
Wed Oct 9 02:37:40 MDT 2013


On October 7, 2013, at 8:37pm, Brian Christiansen wrote:



> NRPE: Active Check (Nagios initiating the check

> NSCA: passive check (the server initiating the check)



Now I'm lost. I thought Nagios ran on the server? If it does, how does NPRE
have Nagios initiate the check and NSCA have the server initiate the check.
That would be the same thing, wouldn't it?



Perhaps we have different setups in mind. In the setup I'm working on,
there is one "central server" and then a whole bunch (sounds like around
60) of "client stations" that will periodically pull workloads from the
internet, through the internet connection managed by the "central server".
What I'm looking for is to have that central server warn someone via e-mail
and other alert methods (SMS messaging, etc...) when one of the clients
either fails to respond to the ping or fails to see an active
(non-zombied!) process on any of the the client stations.



Would that be NRPE or NSCA?



Thanks!
--- Dan


On Mon, Oct 7, 2013 at 2:16 PM, Dan Egli <ddavidegli at gmail.com> wrote:

> Hey folks, I got what may be a common or may be a bizarre question about
> network monitoring software.  I've heard of packages like Nagios and things
> that supposedly let you monitor various boxes from a central point. That's
> great, but my understanding of them (admittedly incomplete) is that they
> monitor the server itself (basically pinging it) and check what ports are
> open/responding. What if I have a specific program that I want to ensure
> runs on certain computers. I don't just want to throw a check into crontab
> (i.e. [ -z "$(pgrep progname)" ] && /usr/local/sbin/progname &) because if
> it DOES fail, I want to know of the failure and to go read the logs to
> determine WHY it failed. Maybe it's a hardware failure, maybe it's
> software, maybe it's just a config issue. But re-running the program isn't
> going to help if I haven't determined WHY it's crashing in the first place.
> However, the program doesn't listen() on any ports. It is more of a client
> than a server/daemon, but it's mission critical just the same. Just like
> most clients, it connects to a central location, sends and receives data,
> then closes the connection to work on what it just received. Think programs
> like seti at home or the distributed.net client (no, it's not them, but they
> have very similar network functionality). Can Nagios (or any other network
> monitor) actually connect to the server and read it's process table (or in
> some other way determine that the process is actually present)? I want some
> kind of central notification for this since the project deals with over 40
> systems each running a portion of the overall task. One program/computer
> failing won't kill the work, but it will slow it down, and in this case
> time is money.
>
>
>
> If absolutely necessary, I can setup a periodic check in crontab and have
> it ssh into the main server to set some kind of warning flag file or
> something in the event of the check determining that the program has died,
> and then have another check on the server to notice the warning, but that
> seems like I'd be reinventing the wheel when I'd think many monitoring
> programs would already have similar functionality. Why go through all that
> effort if I can find a program that will do it all for me? Besides, I'd
> still have to combine that with a general health status. If the computer
> itself ever failed, then the warning wouldn't get triggered since the ssh
> task never executed. So I'd have to combine some health monitoring software
> (to make sure the machine is actually alive and responding) with the
> scripts. That REALLY seems like overkill. I'm hoping someone knows of a
> solution that would help me avoid re-inventing the logic to warn of things.
>
>
>
> Thanks all!
>
>
> --- Dan
>


More information about the PLUG mailing list