Does losing client’s data keep you up at night?
As a managed services provider, one of our biggest fears is loss of client’s data. It is probably the single most thing that we worry about and discuss constantly. At Network Depot (our local MSP division), we spend considerable resources monitoring backups to ensure that if a disaster strikes, we will be prepared.
One of the things that seems to have eluded us, is the consistent monitoring of server hard drive failures. From time to time we notice an array that has a degraded or failed drive, but we were not getting notified via the standard event set monitors.
A few months ago we figured out what was happening. It turns out that Dell and HP seem to think that a degraded drive is NOT a ERROR event type, it only merits a WARNING, and we weren’t paying attention to warnings.
We created a new event set called “VA – Hard Drive Warnings” and started monitoring the System log for any events with the word “degraded”. This started bearing fruit, and we included it in all new templates across all our systems, but it didn’t seem to find everything.
Often as MSPs we get so lost in the weeds that we don’t have time to step back and see the big picture. Yesterday, our CTO Benjamin spent his Sunday researching this issue and discovered and documented the different types of events that are generated by HP and Dell.
HP – If you have HP servers, you need to make sure that HP Insight Manager is installed. It is the Insight Manager that writes these events.
Log: System
Source: Storage Agents
Event ID |
Type |
Description |
1216 |
Warning |
Predictive Failure |
1200 |
Warning |
Rebuilding |
1200 |
Informational |
OK |
1216 |
Informational |
OK |
1200 |
Warning |
Recovering |
1216 | Error |
Disk Failed |
Dell: If you have Dell servers, you need to make sure that the Dell OpenManage Server Administrator is installed and configured. Dell OpenManage creates the event log entries.
Log: System
Source: Server Administrator
Event ID |
Type |
Description |
2065 |
Informational |
Rebuild Started |
2158 |
Informational |
Physical disk online |
2121 |
Informational |
Device Returned to normal |
2052 |
Informational |
Physical disk inserted |
2057 |
Warning |
Virtual disk degraded |
2049 |
Warning |
Physical disk removed |
2123 |
Warning |
Redundancy lost: Virtual disk |
2050 |
Warning |
Physical Disk is offline |
2048 |
Error |
Physical disk failed |
2299 |
Error |
Bad PHY slot |
We updated our event set and I encourage all of you to review your server settings and templates and add this event set . You would select the System Log and be sure to check Error and Warning.
BTW, I can’t stress enough that you MUST make sure that your vendor’s management software is installed and configured correctly! Without that, it is unlikely you will get these errors!
I hope this helps! If anyone has any suggestions for improvement, or has settings for IBM or Lenovo, please let me know, and I will update the post!