SMART Study Mode
Click for demo video Click to see user's feedback Click for demo video
 

A point-by-point examination of SpinRite's S.M.A.R.T. System Monitor
The information below will make much more sense if you have already read the preceding SMART technology introduction page. If you have jumped ahead, please consider reading that short page first, by clicking this link.
If SpinRite's “DynaStat” screen is the heart & soul of SpinRite's data recovery,
then its S.M.A.R.T. System Monitor is the heart & soul of SpinRite's equally
important long-term drive maintenance and failure prediction capability.

1trThe SMART specification defines a wide range of event and health attributes which might apply to any mass storage device (they even work with today's SSD solid state devices). Because the SMART standard has been around since the mid 1990's, not all attributes make sense for modern drives. So drives will pick and choose which attributes make sense for them, and SpinRite will display and analyze whatever any given drive considers to be important. (The parameters shown for this 2 terabyte drive are typical and useful.)

SMART attributes reported by SpinRite (as shown above near #1):


ecc corrected : (Error Correction Code Used) This is probably the single most important and useful parameter for determining a drive's health status. Modern drives incorporate extremely sophisticated error correction capabilities which allow them to recover the originally written data even when it can no longer be read as it was written. This is done by appending a block of sophisticated “checksum” data to the end of each data sector to allow the original state of unknown missing bits to be reconstructed. Although error correction has always been present in hard drives, it has gradually evolved from being used as an exception, to being used much more routinely and even continuously because the limits of today's modern drives have been set with the assumption that all small errors can be easily corrected. So data no longer needs to be perfectly stored. This also means that the amount of on-the-fly error correction being performed can divulge a great deal of valuable information about a drive's true health status.

rd chan margin : (Read Channel Margin) This parameter reflects the available operating margin (headroom, leeway, etc.) present in the drive's data reading electronics. The SMART specification makes it available, but this is one of those attributes that has generally fallen into disuse as drive technology has evolved. We had room to display it, and if it were used it would be important, so it was included in case.

relocated sect : (Relocated Sectors) As drive storage densities have skyrocketed, data bits have become so small that microscopic imperfections in the drive's storage platters have become large enough to swallow large numbers of bits. The read/write head passing over the surface slightly deforms the platter, and over time can either induce new defects or increase the size of existing defects. (This is how new defects appear where there were none before.) When the number of bits being lost inside a defect becomes worrisome (there is a limit to how many bits the drive's ECC error correction can reconstruct) the drive will spontaneously relocate the data from a sector, which is now considered to be defective, to a spare location. Since drives have a limited pool of available spare sectors, this SMART data parameter provides some indication of the diminishing size of the drive's spare sector pool.

realloc events : (Reallocation Events) This is closely related to the previous “Relocated Sectors” attribute. Drives usually display one or the other, so SpinRite makes both available.

recal retries : (Recalibration Retries) Early drives used a “stepping motor” head positioning system that could sometimes “misstep” to deliver the drive's read/write heads to the wrong location. When this occurred, the drive would decide that it was lost. So it would retract its heads back to track zero (a known location) to “recalibrate” their position. Modern drives employ “servo feedback” technology that can directly read the head's current location without returning home. It is not known whether the “recalibration retries” attribute will have any value in modern drives, but SpinRite includes it in case it might be used and useful.

cabling errors : An often overlooked source of apparent disk drive errors do not originate in the drive at all, but in the interconnecting cable that attaches every drive to its external interface or motherboard. Although these errors are rare, many SpinRite users have been surprised when SpinRite alerted them to this unusual trouble in their system. Because it can be a hidden source of errors, drives typically count and report their occurrence . . . and SpinRite definitely keeps an eye on them!

uncorrectable : (Uncorrectable Sectors) This is the drive's (and drive owner's) worst nightmare. The term “uncorrectable” implies that “error correction” was tried but failed. As mentioned above, a sector's missing bits can only be corrected up to a certain point, based upon the ECC algorithms used inside the drive. If a sector is read while an error is still small enough for all of its missing bits to be determined, but the size is large enough to “frighten” the drive about the severity of the error, the sector's corrected data will be immediately relocated to safety. But if the sector has not been read in a long time, and a latent error has been allowed to grow too large, error correction will be impossible and the drive will need to fail the read request and report the sector as uncorrectable, which means unreadable. (Many parts of SpinRite were designed to deal with even this dire situation, but that's another topic.)

write errors : Although write errors are far less common than read errors, it is possible for a drive to be unable to locate the sector whose data it has been asked to overwrite and replace. When this occurs it is unable to successfully accept and record (write) the data it has been given. Drives may report these problems through the SMART system using this attribute.

1trEach SMART attribute consists of a “health” value and a separate “threshold” value. The idea is that, for any attribute pair, the closer the health value falls toward the threshold, the more worried we should be. After examining many thousands of drives spanning a decade of production and use, SpinRite has evolved a simpler interpretation and display of these values: SpinRite subtracts the “health” value from the “threshold” and displays the difference. This conveniently shows the one thing we really care about, which is how far above the danger threshold the drive's health currently stands for each parameter. This simple subtraction converts two confusingly interrelated parameters into just one meaningful result, where more is better and zero is bad. (And the difference CAN go negative . . .  which means really bad!)

So now we have just one “zero-based” value which we can use to represent the current state of health for each SMART attribute. The numerical display (shown just below the red #2 above) displays that current number and the maximum value we have seen for it during THIS run of SpinRite. This is useful because the entire SMART system provides its most useful feedback only while the drive is actually being used. (Any drive can sit there and spin. The question is, can it also write and read what it has written?) When this screen snapshot above was taken, the drive's reported “ecc corrected” parameter had dropped from a high of 114 (probably seen just before SpinRite began) to a current value of 110. Although it still has a long way to go to get down to zero, as we will see later, this is a cause for some concern.

1trThe bar-graphs shown in the region labelled #3 above, with the heading “margin”, form a visual representation of the two numeric values shown immediately to the right of each bar. Each bar is anchored at a value of zero (0) on the left, and the value for max sets the bar's “scale” at the far right. Finally, the current value sets the amount of cyan color filling the bar from the left toward the right. In the snapshot above, the amount of cyan coloring has been pulled back toward the left and reduced, revealing the RED colored squares beneath. This provides a visual indication that, for its corresponding SMART health attribute, the current value has fallen from a maximum value recently seen. (It's showing RED because that's never good news.)

1trRestating what was explained in #3 above, any RED squares, which may appear at the far right of the bar-graph, are “revealed” as the bar-graph's cyan coloring is “pulled back” toward the left any time the current SMART health attribute drops to a value below the maximum that SpinRite has witnessed during this run on the drive.

1trEvery SMART attribute has an associated “blob” of binary data which SpinRite dutifully displays under the “raw data” heading (see #5 above). This data represents the drive's private internal scratch pad which it uses to perform various SMART-related work. Though the drive's ultimate goal is the distillation of this into the various SMART “health” indices we've been describing, SpinRite does much more with this data. After analyzing hundreds of drives of all makes and models, a series of “heuristic” algorithms were developed to extract meaning from these binary blobs. The exciting and worthwhile result is the topic of items 6 & 7 which follow:

1trThe primary data SpinRite extracts from the drive's raw data blobs is the count of the underlying events which the drive then reduces into the simplified SMART health attribute. As we've seen in points 1 through 4 above, the drive's own feelings about its current state of health provides an important guide. But in simplifying the past into a single “health” number some useful information is not preserved. Thus, the data appearing in the column under “error count” are the actual counts of respective events of the types described in section 1 above. But this intriguing information presents us with a dilemma: What's a high number? What should it be? When should we worry? The next section goes a long way toward answering that question . . . 

1trRegion #7 of the SMART System Monitor contains three columns: “minimum”, “error rate”, and “maximum.” All three columns are error rates, so the center label serves as both a reminder and as a label implying “current error rate” where the others are minimum and maximum. Any “rate” must have “units” to confer meaning to the data, which is what makes this final display so powerful and important, because these numbers are calculated in counts of their respective events per million sectors of data transferred. That gives the numbers real meaning.

Putting it all together

What you will see below are common sense rules of thumb that we and SpinRite's many users have developed over decades of watching drives live and die. As you'll see, this is “data” rather than “conclusions”, so the data's interpretation is up to us. But it is only be running SpinRite across a drive that this data can be obtained — both to allow the drive to set its SMART health attributes, and to allow SpinRite to collect this potentially valuable information.

Given everything you now know, consider this:

The reason why running SpinRite over a drive forms the best preventive maintenance
possible
, is that it is only by attempting to read its own data (which the drive needs
us to ask it to do), that it can itself detect early signs of trouble and take corrective
action . . . before the trouble grows too severe to be corrected without loss of data.

It's obvious when you think about it. Modern drives that store, for example, 2 terabytes of data in 512-byte sectors will contain more than 3.9 BILLION sectors. (Even if they use the newer “jumbo” 4k-byte sectors, that's still close to HALF A BILLION sectors!) Until the drive tries, it has NO IDEA whether it can actually read any particular one of those sectors. Such large drives may have written the sector a year before and NEVER had even a single occasion to read it back. So whether it's actually able to read it correctly is anyone's guess . . . until it tries.

So, running SpinRite over a drive serves a very important dual purpose: It allows the drive to read and assess the condition of every single sector of data it contains and, thanks to SpinRite's SMART System Monitor, we are able to watch it work, peering into the drive's operation to develop the best possible sense for the drive's otherwise unknowable and undetectable overall condition.

So . . . 
If the screen shown above was of SpinRite running on YOUR
drive, what does this tell you, and how should you feel?

bulletSeeing RED: For starters, the appearance of RED coloration at the right-hand end of any SMART parameter bar-graph is a clue that things are not all wonderful. 100% healthy drives will generally not experience any “depression” of any of their SMART attributes, no matter how aggressively the drives are used. As was mentioned above, the SMART system only provides useful feedback when a drive is placed under stress and being asked to do difficult things. Unfortunately, what has sometimes become difficult for modern drives, because they have been pushed right up to their theoretical limits, is reading their own data!

Consequently, the lowered “health” status for the ecc corrected attribute above indicates that the drive itself is being surprised by how much error correction it is being forced to use to successfully read its own sectors of data. That's probably not good.

On the other hand, SpinRite processes the SMART data so that a current health index of zero represents the drive's SMART attribute finally being lowered to the “danger threshold” level. So the fact that, from where we started, at “114” we are still well above zero at “110” indicates that things can likely get much worse before we really have cause to worry.

Also, the relocated sectors attribute has not dropped at all. And assuming that the “raw data” for relocated sectors (shown beneath the #5 above) represents the actual count of relocated sectors, even if only recently, it is showing an encouraging value of all 0's. From this we would conclude that while the drive is perhaps somewhat surprised that it is needing to work harder than it expects to read all of its own sectors, none of them have, so far, been worrisome enough to induce it to relocate any to a safer location.

bulletThe Error Counts: Turning to the “error count” column (#6 above) we see some large numbers which will probably be continually increasing as SpinRite moves through the drive. Why do we think they'll be increasing? Because the minimum rate of errors per million sectors of data transferred (shown in the next column over) is in the many thousands for each. This tells us that there has never been a block of a million sectors transferred where fewer than eight or five thousand ecc or seek errors were encountered. So it seems likely that this will continue to be the case.

But as the owner of this drive, you would have be wondering: "Okay, so far we have more than 187 MILLION total sectors that needed correcting. Is that a low number?, medium?, or high?" Since different makes, models, and generations of drives are often entirely different internally, it's difficult, if not impossible, to be sure what a high number is for any specific single drive. Focusing just on these counts for a moment, we need something to compare these numbers with. And we have two possibilities: another drive of the same make and model, or this drive in the past:

Because internal drive technology may vary dramatically, any inter-drive comparisons need to be made within the same generation, if not identical drives. Since it's not uncommon for people to purchase more than one of a type of drive, this becomes feasible. Of course, it would be best to have three, so that an “outlier” differing significantly from the other two, could be detected.

But even if an owner only has a single drive of one type, keeping an eye on that one drive's SMART System Monitor numbers over time provides a valuable indication of the drive's condition. If, at the end of every SpinRite run, the final counts are recorded, the drive's owner will have an easy-to-interpret record of how much of what happened when. As long as those numbers remain relatively constant over time—pretty much independent of what they are—and including when the drive is new, it would be a generally safe bet that, even if in an absolute sense they are high, that's what they are supposed to be for that particular drive.

Similarly, if after several runs of SpinRite, over the course of a year or two or more, those error counts were to suddenly increase dramatically, there is probably no better early warning predictor available. If a drive which, for years, expended a relatively uniform level of effort reading its data, were to suddenly require significantly more effort to do exactly the same amount of work . . . at a minimum, that's a drive whose critical data you want to have well backed up, and probably run SpinRite on more often.

bulletThe Error Rates: If you've been reading this closely, and have already studied the screen above, you may have noticed the single biggest concern revealed by SpinRite's careful processing of the drive's SMART data: the huge difference between the minimum and maximum rate of ecc corrections (per million sectors of data transferred).

Up to the point where this picture was taken, SpinRite had encountered a one million sector region requiring only 0.83% of its sectors to be corrected (8,323 sectors out of one million). But it also encountered another one million sector region where more than 28.6% of those sectors needed correcting to be read correctly. That's more than one out of every four . . . and there's a story there somewhere because that's a difference of nearly 35 to 1 in the trouble the drive had with two different one-million sector regions of the drive. And also notice that even the most recent one-million sector region processed incurred 103,901 corrections for those one million sectors, so 10.3% of them needed correcting.

It would be really wonderful to know how large (long, actually) each of those required error corrections were. The drive knows. But the most we're able to obtain from the SMART system is their count — and an assessment of how the drive feels about it in the corresponding overall “health” parameter (which it did lower slightly). Perhaps the errors were predominantly only one or two bits long, which could be chalked up to random occurrences caused by the fact that modern drives have been pushed so close to the edge that they are relying upon error correction to obtain reliability. (28.6% of one million sectors requiring correction pretty much demonstrates that assertion!)

The BEST ADVICE for the owner of this drive would probably be to immediately re-run SpinRite on the drive and compare the second pass results to that first pass. Another of the huge benefits of running SpinRite at “Level 4” on a drive, is that after each sector has been read and corrected, it is freshly re-written—actually twice, first with all of its bits inverted, and a second time with them inverted again, thus putting them back the way they were. What SpinRite's users have learned is that a second pass usually returns much lower error correction numbers . . . which conclusively demonstrates the concept of a drive's data aging and become less easily, and reliably, read. People lacking our users' experience are often skeptical. But it's easy enough to demonstrate conclusively.

If re-running SpinRite WERE to show a dramatic reduction in errors encountered, that might give the drive's owner some sense for how often SpinRite should be run on that drive for preventive maintenance. Although we have no way of knowing how long any of those errors are, we can at least see their count, and get some estimation of how that number increases the longer SpinRite is not run on the drive.

What does it all mean?

The SMART system is obviously not a panacea or oracle that answers all questions. In fact, as we've seen above, it can leave us with more questions than we initially had! But even with much left up to the judgement and experience of SpinRite's owner, this SMART System Monitor provides a crucially valuable view into the drive's inner workings.

It is also worth mentioning that we chose the screen above because it serves as a useful but not excessively alarming tutorial. It is typical of what SpinRite's users might encounter. However, many users have reported much more frightening displays consisting of entirely SOLID RED bar-graphs and explicit (dire) warnings from SpinRite of impending doom and imminent drive failure (which SpinRite does produce at times when there's virtually no doubt about what the numbers portend.

Most owners initially purchase SpinRite to help with a data
recovery emergency. But then, after seeing it in action, they
use it preemptively to prevent any future data loss.

You may press your browser's BACK button, or
click the image again to return the previous page.


Jump to top of page
Gibson Research Corporation is owned and operated by Steve Gibson.  The contents
of this page are Copyright (c) 2016 Gibson Research Corporation. SpinRite, ShieldsUP,
NanoProbe, and any other indicated trademarks are registered trademarks of Gibson
Research Corporation, Laguna Hills, CA, USA. GRC's web and customer privacy policy.
Jump to top of page

Last Edit: May 07, 2013 at 15:35 (1,630.22 days ago)Viewed 19 times per day