Monitoring Hard Drive failures through Kaseya

Does losing client’s data keep you up at night?

As a managed services provider, one of our biggest fears is the loss of client’s data.  It is probably the single most thing that we worry about and discuss constantly.     At Network Depot (our local MSP division), we spend considerable resources monitoring backups to ensure that if a disaster strikes, we will be prepared.

One of the things that seems to have eluded us, is the consistent monitoring of server hard drive failures.   From time to time we notice an array that has a degraded or failed drive, but we were not getting notified via the standard event set monitors.

A few months ago we figured out what was happening.    It turns out that Dell and HP seem to think that a degraded drive is NOT an ERROR event type,  it only merits a WARNING, and we weren’t paying attention to warnings.

We created a new event set called “VA – Hard Drive Warnings” and started monitoring the System log for any events with the word “degraded”.   This started bearing fruit, and we included it in all-new templates across all our systems, but it didn’t seem to find everything.

Often as MSPs, we get so lost in the weeds that we don’t have time to step back and see the big picture.   Yesterday,  our CTO Benjamin spent his Sunday researching this issue and discovered and documented the different types of events that are generated by HP and Dell.

HP –   If you have HP servers, you need to make sure that HP Insight Manager WBEM is installed.   It is the WBEM that writes these events to the Windows Event Logs.   You can find a complete description of the events here:  https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04436799

Log: System
Source: HP SAS, HP SCSI, HP SmartArray

We included the following in the Event Set referenced below

Source: HP SAS

Source Event ID
HP SAS 102
HP SAS 103
HP SAS 202
HP SAS 204
HP SAS 311
HP SAS 312
Source HP SATA
Source Event ID
HP SATA 604
HP SATA 605
Source HP SCSI
Source Event ID
HP SCSI 3
HP SCSI 5
HP SCSI 8
HP SCSI 10
Source:  HP SmartArray
Source Event ID
HP SmartArray 102
HP SmartArray 103
HP SmartArray 104
HP SmartArray 202
HP SmartArray 204
HP SmartArray 206
HP SmartArray 207
Dell:   If you have  Dell servers, you need to make sure that the Dell OpenManage Server Administrator is installed and configured.  Dell OpenManage creates the event log entries.

Log: System
Source:  Server Administrator

Event ID

Type

Description

2065

Informational

Rebuild Started

2158

Informational

Physical disk online

2121

Informational

Device Returned to normal

2052

Informational

Physical disk inserted

2057

Warning

Virtual disk degraded

2049

Warning

Physical disk removed

2123

Warning

Redundancy lost: Virtual disk

2050

Warning

Physical Disk is offline

2048

Error

Physical disk failed

2299

Error

Bad PHY slot
 

We updated our event set and I encourage all of you to review your server policy settings or templates and add this event set. You would select the System Log and be sure to check Error and Warning.

BTW, I can’t stress enough that you MUST make sure that your vendor’s management software is installed and configured correctly!    Without that, it is unlikely you will get these errors!

I hope this helps!   If anyone has any suggestions for improvement or has settings for IBM or Lenovo, please let me know, and I will update the post!