HW Raid monitoring

Jacob Albretsen jakea at xmission.com
Mon Mar 18 09:13:40 MDT 2013

On Monday, March 18, 2013 08:59:57 AM Nicholas Leippe wrote:
> On Mon, Mar 18, 2013 at 12:05 AM, Dan Egli <ddavidegli at gmail.com> wrote:
> > *All this discussion about raid levels and what not has brought to my mind
> > a different, if related, question. One of the reasons I like software raid
> > is that it's easy to monitor. For example, I could have a cron script that
> > runs once every 15 minutes for example and checks the status of the
> > /proc/mdstat file to ensure any raid(s) listed show status of Healthy. But
> > how do you do something like that for a Hardware raid? How can you tell,
> > for example, if drive #3 in a HW raid10 has failed? This is something I
> > honestly don't know off my head. I know many of you folks have had
> > experience with HW raid and device failures in that array. How do you
> > know?
> > There's no file you can check like mdstat is there? I'd think this would
> > be
> > especially important for remote hosted/co-located servers.*
> IME it's vendor specific. Some of the cards I've used had their own
> monitoring software. Others had a utility that you could use to query
> and thus write your own monitoring plugin. Some had nothing--they
> would just beep and then you'd have to use their access tool (a front
> end to their bios software) and navigate their menus to figure it out
> and deal with it--*possibly* could be automated via an expect script,
> but not easily--navigating an ncurses-type interface.

For example, Dell has monitoring tools (Open Manage) for its servers, and 
there is a Nagios plugin you can use to monitor the the health of the RAID and 
other hardware.  When I got it going where I am at, we quickly found two 
servers with a degraded RAID that needed fixing.  I also was able to find a 
couple of other hardware problems that Dell fixed for me.

More information about the PLUG mailing list