We added these commands in response to inefficiencies (and in some case firmware bugs), associated with the built-in self-test functions found in most SCSI and Fibre Channel disk drives. We wanted to provide a tool that would scan the entire disk and produce a report of all errors (or warnings/retries) by block number. The administrator and storage vendor could analyze and correct the most common errors such as unrecoverable read/write errors due to a failed sector without having to re-run the self-test after repairing the next bad block. (Self-tests only report one error, then they stop).
Like the self-tests described in the Self-Test Diagnostics ANSI section, all of these tests are safe to run in a live environment with user I/O running in the background. As the scrubbing self tests described in this section are controlled by the host, there is additional overhead. This overhead is one I/O per 512, 520 or whatever block-size you have times the number of blocks there are on the disk drive. As only one block is read at a time (with -scrub) or only 32 blocks are read at a time with (-scrubq), the test would generally take 30 minutes to several hours to run, even on a system with little overhead.
If you have to test multiple drives, it is best to run multiple instances of the program concurrently. CPU overhead is almost zero. The bottleneck is your disk I/O channel.
-scrubq |
Initiates full media read test, with 32-block chunk size |
-scrub |
Initiates full media read test, with 1-block chunk size. |
-scrubr |
Pseudo-random read test using SEEK(10 SCSI Command) |
-scrubs |
Sequential read fitness test using SEEK(10 SCSI Command) |
-scrubv |
May be combined with either option above to set verbose mode so that errors, percentage complete, and remaining time appear as they are discovered. |
-scrubt |
This terminates any fitness test on the first error and causes the program to return error code #11 (SCRUB_T_ERR). The -scrubt must be combined with the -scrub, or -scrubq command. |
May be combined with any of the above options to utilize 16-byte SCSI commands READ(16) and WRITE(16) |
|
May be combined with any of the above options to utilize 12-byte SCSI commands READ(12) and WRITE(12) |
.
Notes:
| • | If -scrubv is used without either -scrubq or -scrub, -scrubv will assume -scrub was entered and immediately begin the test. |
| • | All options record errors in the event log, and each error line includes the make/model and device name for the disk as part of that error. |
| • | Only one disk is tested at a time. If you want to test multiple drives concurrently, launch extra instances of the program and point each of them to a different disk or to a different range of disks using wild cards. |
| • | The scrubbing tests are not limited to disk drives. They may be run on optical media such as CD and DVDs, as well as ATAPI (IDE) devices. You would do this in order to perform an optical media certification which would insure that every block of the CD/DVD was readable w/o errors. (If you find a problem, do not bother trying to remap it on a read-only device. |
| • | Running the scrub tests on optical devices would also uncover and report other hardware problems, even if the drives are IDE. |
| • | As of this version of the documentation, we have not tested remapping DVD R/W media in event a defect has been found. It should work, but we do not have means to test this now. |
| • | These tests can be made with peripherals set to any block size, up to 2048 bytes. However, your host operating system or SCSI/Fibre channel controller may not recognize 520-byte or 528-byte formatted disk drives. |
| • | The scrub tests will terminate prematurely after 8190 different blocks report problems. |
| • | Due to limitations in SGI's IRIX operating system that require pass-through I/O to have exclusive access, scrubbing functions typically take 2 - 3 times longer under that O/S. It will have significant system overhead as the device must get opened/closed between hundreds of millions of I/Os. If you use -scrubq, then the performance impact is minor. |
| • | (The -16 and -12 options are mutually exclusive, as are the -scrubr and -scrubs commands. |
Self-Test Characteristics
Test Option |
Description |
Type of Test / Methodology |
Strengths |
Weaknesses
|
||||||||||||||||
-stsb |
short background (ANSI-defined test, built into the device's firmware) |
|
|
|
||||||||||||||||
-steb |
extended background (ANSI-defined test, built into the device's firmware) |
|
|
|
||||||||||||||||
scrub test |
|
|
|
|||||||||||||||||
quick scrub test |
|
|
|
|||||||||||||||||
random seek test |
|
|
|
|||||||||||||||||
sequential seek test |
|
|
|
|||||||||||||||||
terminate on first error |
|
|
|
|||||||||||||||||
Verbose scrub |
Combine with -scrub or -scrubq to show results in foreground. |
|
|
Example Results
[root@BOSS smartmon]# ./smartmon-ux -scrubv -scrub /dev/sg9
SMARTMon-ux [Release 1.26, Build 22-APR-2004] - Copyright 2001-2004 SANtools, Inc. http://www.SANtools.com
Discovered SEAGATE ST373405FC S/N "3EK0V6SG" on /dev/sg9 [SES] (Not Enabling SMART)(70007 MB)
(Note percentage complete information and time remaining will appear and automatically update as this procedure progresses. This is not shown below)
Beginning SANtools fitness test for SEAGATE ST373405FC at /dev/sg9 (143374740 blocks, blocksize=512)
Block 145614 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available
Block 145615 Sense: 3/11/00 [Drive media failed] Unrecovered read error
Block 145616 Sense: 3/11/00 [Drive media failed] Unrecovered read error
Block 145617 Sense: 3/11/00 [Drive media failed] Unrecovered read error
Block 145618 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available
Block 145619 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available
Block scrubbing error summary:
Block 145614 4/32/00 Count=1 [Controller/drive hardware failed] No defect spare location available
Block 145615 3/11/00 Count=3 [Drive media failed] Unrecovered read error
Block 145616 3/11/00 Count=3 [Drive media failed] Unrecovered read error
Block 145617 3/11/00 Count=3 [Drive media failed] Unrecovered read error
Block 145618 4/32/00 Count=3 [Controller/drive hardware failed] No defect spare location available
Block 145619 4/32/00 Count=2 [Controller/drive hardware failed] No defect spare location available
Program Ended.
Completion and Test Time
The -scrub command reports errors at the block level, by reading each block individually. As such, it sacrifices speed for granularity. Our 146GB 15000RPM SAS disk takes 10 hours to complete using these options. If you don't care about individual block numbers, but still want a count of the bad blocks, then use the -scrubq which reads 32 blocks at a time. The same disk that took 10 hours to test with the -scrubq command takes 32 minutes to complete.
If you just need a pass-fail test too see if a particular disk has any read problems, then be sure to add the -scrubt option so that it terminates on the first error. The results below were run on the same disk which has bad blocks which we created with this software on blocks 123 and 456.
Slow, Detailed Report
# time /etc/smartmon-ux -scrub /dev/rdsk/c4t15d0s0
SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com
Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary:
Block 123 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Block 456 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Program Ended.
real 10h35m40.22s
user 27m8.57s
sys 2h43m53.15s
Faster Report
# time ./smartmon-ux -scrubq /dev/rdsk/c4t15d0s0
SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com
Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary:
Blocks 96 - 112 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Blocks 448 - 464 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Program Ended.
real 32m15.85s
user 2m20.74s
sys 5m18.14s
Fastest
# time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0
SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com
Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary:
Blocks 96 - 128 4/09/00 Count=1 [Controller/drive hardware failed] Track following error
real 0m1.67s
user 0m0.00s
sys 0m0.02s
If your disks support background media scanning, then you can just ask the disk if it has any problems via the -bmsr command (assuming scanning is enabled). This will generate a report based on the last background scan the selected disk ran, and any subsequent activity since that scan. It will take less than a second to report all bad blocks on the disk, regardless of how many you have and where they are located. The disk retains this information through power-cycles.
# time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0
SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com
Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Tue Jun 10 12:18:51 2008
Accumulated power-on minutes: 134911 [94 days]
Number of background scans performed: 37
Background scanning status: medium scan halted, waiting for interval timer expiration
Background scan percentage completed: 0.00
Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo
0 133 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries
1 117114 2bf620f OK recovered via in-place rewrite Recovered error Recovered data with retries
2 130954 7b ERR waiting for WRITE Controller/drive hardware failed Track following error
3 130954 1c8 ERR waiting for WRITE Controller/drive hardware failed Track following error
4 130954 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries
5 131392 37fc8 OK recovered via in-place rewrite Recovered error Recovered data with retries
6 133380 38039 OK recovered via in-place rewrite Recovered error Recovered data with retries
7 133792 d699104 OK recovered via in-place rewrite Recovered error Recovered data with retries
8 134753 dccde66 OK recovered via in-place rewrite Recovered error Recovered data with retries
9 134755 e2bede7 OK recovered via in-place rewrite Recovered error Recovered data with retries
Program Ended.
real 0m0.25s
user 0m0.00s
sys 0m0.02s
Page url: http://www.santools.com/santool/index.html?self_testdiagnostics_santoo.htm