Monitoring Hard Drive failures through Kaseya

Does losing client’s data keep you up at night?

As a managed services provider, one of our biggest fears is loss of client’s data.  It is probably the single most thing that we worry about and discuss constantly.     At Network Depot (our local MSP division), we spend considerable resources monitoring backups to ensure that if a disaster strikes, we will be prepared.

One of the things that seems to have eluded us, is the consistent monitoring of server hard drive failures.   From time to time we notice an array that has a degraded or failed drive, but we were not getting notified via the standard event set monitors.

A few months ago we figured out what was happening.    It turns out that Dell and HP seem to think that a degraded drive is NOT a ERROR event type,  it only merits a WARNING, and we weren’t paying attention to warnings.

We created a new event set called “VA – Hard Drive Warnings” and started monitoring the System log for any events with the word “degraded”.   This started bearing fruit, and we included it in all new templates across all our systems, but it didn’t seem to find everything.

Often as MSPs we get so lost in the weeds that we don’t have time to step back and see the big picture.   Yesterday,  our CTO Benjamin spent his Sunday researching this issue and discovered and documented the different types of events that are generated by HP and Dell.

HP –   If you have HP servers, you need to make sure that HP Insight Manager is installed.   It is the Insight Manager that writes these events.

Log: System
Source: Storage Agents

Event ID

Type

Description

1216

Warning

Predictive Failure

1200

Warning

Rebuilding

1200

Informational

OK

1216

Informational

OK

1200

Warning

Recovering
1216 Error

Disk Failed

 

Dell:   If you have  Dell servers, you need to make sure that the Dell OpenManage Server Administrator is installed and configured.  Dell OpenManage creates the event log entries.

Log: System
Source:  Server Administrator

Event ID

Type

Description

2065

Informational

Rebuild Started

2158

Informational

Physical disk online

2121

Informational

Device Returned to normal

2052

Informational

Physical disk inserted

2057

Warning

Virtual disk degraded

2049

Warning

Physical disk removed

2123

Warning

Redundancy lost: Virtual disk

2050

Warning

Physical Disk is offline

2048

Error

Physical disk failed

2299

Error

Bad PHY slot

 

We updated our event set and I encourage all of you to review your server settings and templates and add this event set . You would select the System Log and be sure to check Error and Warning.

image

 

BTW, I can’t stress enough that you MUST make sure that your vendor’s management software is installed and configured correctly!    Without that, it is unlikely you will get these errors!

I hope this helps!   If anyone has any suggestions for improvement, or has settings for IBM or Lenovo, please let me know, and I will update the post!