|
Self-Test Diagnostics - SANtools |
Top Previous Next |
|
We added these commands in response to inefficiencies (and in some case firmware bugs), associated with the built-in self-test functions found in most SCSI and Fibre Channel disk drives. We wanted to provide a tool that would scan the entire disk and produce a report of all errors (or warnings/retries) by block number. The administrator and storage vendor could analyze and correct the most common errors such as unrecoverable read/write errors due to a failed sector without having to re-run the self-test after repairing the next bad block. (Self-tests only report one error, then they stop).
Like the self-tests described in the Self-Test Diagnostics ANSI section, all of these tests are safe to run in a live environment with user I/O running in the background. As the scrubbing self tests described in this section are controlled by the host, there is additional overhead. This overhead is one I/O per 512, 520 or whatever block-size you have times the number of blocks there are on the disk drive. As only one block is read at a time (with -scrub) or only 32 blocks are read at a time with (-scrubq), the test would generally take 30 minutes to several hours to run, even on a system with little overhead.
If you have to test multiple drives, it is best to run multiple instances of the program concurrently. CPU overhead is almost zero. The bottleneck is your disk I/O channel.
Self-Test Commands
. Notes:
Self-Test Characteristics
Example Results [root@BOSS smartmon]# ./smartmon-ux -scrubv -scrub /dev/sg9 SMARTMon-ux [Release 1.26, Build 22-APR-2004] - Copyright 2001-2004 SANtools, Inc. http://www.SANtools.com Discovered SEAGATE ST373405FC S/N "3EK0V6SG" on /dev/sg9 [SES] (Not Enabling SMART)(70007 MB) (Note percentage complete information and time remaining will appear and automatically update as this procedure progresses. This is not shown below) Beginning SANtools fitness test for SEAGATE ST373405FC at /dev/sg9 (143374740 blocks, blocksize=512) Block 145614 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available Block 145615 Sense: 3/11/00 [Drive media failed] Unrecovered read error Block 145616 Sense: 3/11/00 [Drive media failed] Unrecovered read error Block 145617 Sense: 3/11/00 [Drive media failed] Unrecovered read error Block 145618 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available Block 145619 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available
Block scrubbing error summary: Block 145614 4/32/00 Count=1 [Controller/drive hardware failed] No defect spare location available Block 145615 3/11/00 Count=3 [Drive media failed] Unrecovered read error Block 145616 3/11/00 Count=3 [Drive media failed] Unrecovered read error Block 145617 3/11/00 Count=3 [Drive media failed] Unrecovered read error Block 145618 4/32/00 Count=3 [Controller/drive hardware failed] No defect spare location available Block 145619 4/32/00 Count=2 [Controller/drive hardware failed] No defect spare location available
Program Ended.
Completion and Test Time The -scrub command reports errors at the block level, by reading each block individually. As such, it sacrifices speed for granularity. Our 146GB 15000RPM SAS disk takes 10 hours to complete using these options. If you don't care about individual block numbers, but still want a count of the bad blocks, then use the -scrubq which reads 32 blocks at a time. The same disk that took 10 hours to test with the -scrubq command takes 32 minutes to complete.
If you just need a pass-fail test too see if a particular disk has any read problems, then be sure to add the -scrubt option so that it terminates on the first error. The results below were run on the same disk which has bad blocks which we created with this software on blocks 123 and 456.
Slow, Detailed Report # time /etc/smartmon-ux -scrub /dev/rdsk/c4t15d0s0 SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary: Block 123 4/09/00 Count=3 [Controller/drive hardware failed] Track following error Block 456 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Program Ended.
real 10h35m40.22s user 27m8.57s sys 2h43m53.15s
Faster Report # time ./smartmon-ux -scrubq /dev/rdsk/c4t15d0s0 SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary: Blocks 96 - 112 4/09/00 Count=3 [Controller/drive hardware failed] Track following error Blocks 448 - 464 4/09/00 Count=3 [Controller/drive hardware failed] Track following error
Program Ended. real 32m15.85s user 2m20.74s sys 5m18.14s
Fastest # time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0 SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Block scrubbing error summary: Blocks 96 - 128 4/09/00 Count=1 [Controller/drive hardware failed] Track following error
real 0m1.67s user 0m0.00s sys 0m0.02s
If your disks support background media scanning, then you can just ask the disk if it has any problems via the -bmsr command (assuming scanning is enabled). This will generate a report based on the last background scan the selected disk ran, and any subsequent activity since that scan. It will take less than a second to report all bad blocks on the disk, regardless of how many you have and where they are located. The disk retains this information through power-cycles.
# time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0 SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB) Background Media Scan Report @ Tue Jun 10 12:18:51 2008 Accumulated power-on minutes: 134911 [94 days] Number of background scans performed: 37 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 133 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries 1 117114 2bf620f OK recovered via in-place rewrite Recovered error Recovered data with retries 2 130954 7b ERR waiting for WRITE Controller/drive hardware failed Track following error 3 130954 1c8 ERR waiting for WRITE Controller/drive hardware failed Track following error 4 130954 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries 5 131392 37fc8 OK recovered via in-place rewrite Recovered error Recovered data with retries 6 133380 38039 OK recovered via in-place rewrite Recovered error Recovered data with retries 7 133792 d699104 OK recovered via in-place rewrite Recovered error Recovered data with retries 8 134753 dccde66 OK recovered via in-place rewrite Recovered error Recovered data with retries 9 134755 e2bede7 OK recovered via in-place rewrite Recovered error Recovered data with retries
Program Ended. real 0m0.25s user 0m0.00s sys 0m0.02s
|