Disk Failures

From JasonAntmanWiki
Jump to: navigation, search

Especially with such a small server farm at home, and a small budget, disk failures are especially troubling. I rarely have spare drives around, and a single failure can wreak havoc. As a result, I try to keep on top of the latest research on storage reliability and failure prediction/detection.

Failure Prediction

The SMARTmonTools package is a good place to start, as it allows access to drives' internal SMART (Self-Monitoring, Analysis and Reporting Technology System) technology to gather information on the current health of a drive.

The most commercial servers include some sort of additional monitoring. The Compaq/HP Proliant servers have advanced RAID controllers that incorporate fault-tolerance and monitoring. Their logs can be accessed via the hpasm package's hpimlview command.


As of writing, storage failure analysis is a growing topic in IT. There have been few good studies of large disk populations for failure analysis. One of the leading studies was done by Google Labs: Failure Trends in a Large Disk Drive Population. It is also available as a {http://labs.google.com/papers/disk_failures.pdf PDF].

Notice - this is a static HTML mirror of a previous MediaWiki installation. Pages are for historical reference only, and are greatly outdated (circa 2009).