Self-Test Diagnostics - SANtools

Top  Previous  Next

We added these commands in response to inefficiencies (and in some case firmware bugs), associated with the built-in self-test functions found in most SCSI and Fibre Channel disk drives. We wanted to provide a tool that would scan the entire disk and produce a report of all errors (or warnings/retries) by block number. The administrator and storage vendor could analyze and correct the most common errors such as unrecoverable read/write errors due to a failed sector without having to re-run the self-test after repairing the next bad block. (Self-tests only report one error, then they stop).

 

Like the self-tests described in the Self-Test Diagnostics ANSI section, all of these tests are safe to run in a live environment with user I/O running in the background. As the scrubbing self tests described in this section are controlled by the host, there is additional overhead. This overhead is one I/O per 512, 520 or whatever block-size you have times the number of blocks there are on the disk drive. As only one block is read at a time (with -scrub) or only 32 blocks are read at a time with (-scrubq), the test would generally take 30 minutes to several hours to run, even on a system with little overhead.

 

If you have to test multiple drives, it is best to run multiple instances of the program concurrently. CPU overhead is almost zero. The bottleneck is your disk I/O channel.

 

Self-Test Commands

-scrubq

Initiates full media read test, with 32-block chunk size

-scrub

Initiates full media read test, with 1-block chunk size.

-scrubr

Pseudo-random read test using SEEK(10 SCSI Command)

-scrubs

Sequential read fitness test using SEEK(10 SCSI Command)

-scrubv

May be combined with either option above to set verbose mode so that errors, percentage complete, and remaining time appear as they are discovered.

-scrubt

This terminates any fitness test on the first error and causes the program to return error code #11 (SCRUB_T_ERR). The -scrubt must be combined with the -scrub, or -scrubq command.

-16

May be combined with any of the above options to utilize 16-byte SCSI commands READ(16) and WRITE(16)

-12

May be combined with any of the above options to utilize 12-byte SCSI commands READ(12) and WRITE(12)

       .

Notes:

If -scrubv is used without either -scrubq or -scrub, -scrubv will assume -scrub was entered and immediately begin the test.
All options record errors in the event log, and each error line includes the make/model and device name for the disk as part of that error.
Only one disk is tested at a time. If you want to test multiple drives concurrently, launch extra instances of the program and point each of them to a different disk or to a different range of disks using wild cards.
The scrubbing tests are not limited to disk drives. They may be run on optical media such as CD and DVDs, as well as ATAPI (IDE) devices. You would do this in order to perform an optical media certification which would insure that every block of the CD/DVD was readable w/o errors. (If you find a problem, do not bother trying to remap it on a read-only device.
Running the scrub tests on optical devices would also uncover and report other hardware problems, even if the drives are IDE.
As of this version of the documentation, we have not tested remapping DVD R/W media in event a defect has been found. It should work, but we do not have means to test this now.
These tests can be made with peripherals set to any block size, up to 2048 bytes. However, your host operating system or SCSI/Fibre channel controller may not recognize 520-byte or 528-byte formatted disk drives.
The scrub tests will terminate prematurely after 8190 different blocks report problems.
Due to limitations in SGI's IRIX operating system that require pass-through I/O to have exclusive access, scrubbing functions typically take 2 - 3 times longer under that O/S.  It will have significant system overhead as the device must get opened/closed between hundreds of millions of I/Os.  If you use -scrubq, then the performance impact is minor.
(The -16 and -12 options are mutually exclusive, as are the -scrubr and -scrubs commands.

 

 

Self-Test Characteristics

Test Option

Description

Type of Test / Methodology

Strengths

Weaknesses

 

-stsb

short background (ANSI-defined test, built into the device's firmware)

Single command sent, disk runs test for up to 2 minutes, saves result in log page.
Once command is launched by SMARTMonUX, no further interaction required. Unlimited number of disks can be tested concurrently without adversely affecting host system or I/O bandwidth.
Full test of all except media, but media does have light test.
Completes in less than 2 minutes regardless of host I/O load.
Unlimited instances can be run concurrently w/o adverse affect on host.

 

Not good for certifying media, but can be combined with -scrub for a thorough test (but best to combine -steb with -scrub for most complete test).
Useless for testing DVD and CDROM media.

-steb

extended background

(ANSI-defined test, built into the device's firmware)

Disk vendors use this as a pass/fail criteria to authorize warranty returns.
Results viewable with -C and -str commands.

 

 

 

 

Tests 100% of disk, including random I/O.
Like the -stsb, this test also has no host overhead once it is accepted by the disk.
It only returns first error then terminates.
Only way to get a full disk test if you have any errors is to correct problem and start again. This could take days of operator time if you have multiple errors towards end of a large disk.

-scrub

scrub test

Reads all blocks on disk and reports sense information resulting from every I/O.
Automatic retries as necessary depending on the errors.
Full report of non-zero sense information and errors/retries
Single pass-read-everything, returns all errors in report by block number.
Use it to then manually reassign sectors in single pass or to send to storage vendor for analysis for drive replacement.
No random I/O test.
No non-media tests.
You should combine this test with the -steb to guarantee 100% testing. Run -scrub first and reassign all sectors first so the -steb will not stop when it finds first error.

-scrubq

quick scrub test

Same as above, but it does 32 blocks at a time to finish test much earlier.
Does full read, but finishes much faster than -scrub.
Use it to quickly find out if there is any sense data indicating drive needs to be replaced or if further action required to repair it.
Blocks are read in chunks of 32, so sense errors are tied to range of blocks.
You will have to run the -scrub or -steb options determine exactly what block(s) you need to remap.

-scrubr

random seek test

Repositions the head in a pseudo-random sequence until one seek has been done for every 16 blocks of data.on the disk.  This invokes the SEEK(10) SCSI CDB.
This is an important test and successful sequential reads or write tests will not stress the drive arm assembly sufficiently.
The -scrubr & -scrubs commands are mutually exclusive. You must perform each test separately.

-scrubs

sequential seek test

Repositions the head from beginning to end of disk using the SEEK(10) SCSI CDB
Arguably not as useful or stressful on a disk then performing random seeks.
The -scrubr & -scrubs commands are mutually exclusive. You must perform each test separately.

-scrubt

terminate on first error

Terminates any of these self-test diagnostics upon first error
Self test aborts if problem found, dramatically speeding up process of testing multiple devices.
Test does not report all errors found and/or repaired.

-scrubv

Verbose scrub

Combine with -scrub or -scrubq to show results in foreground.

It shows percentage complete and remaining time.
Do not redirect output to a file as the file will contain large amount of formatted text and backspace chars.

 

Example Results

[root@BOSS smartmon]# ./smartmon-ux -scrubv -scrub /dev/sg9

SMARTMon-ux [Release 1.26, Build 22-APR-2004] - Copyright 2001-2004 SANtools, Inc. http://www.SANtools.com

Discovered SEAGATE ST373405FC S/N "3EK0V6SG" on /dev/sg9 [SES] (Not Enabling SMART)(70007 MB)

(Note percentage complete information and time remaining will appear and automatically update as this procedure progresses. This is not shown below)

Beginning SANtools fitness test for SEAGATE ST373405FC at /dev/sg9 (143374740 blocks, blocksize=512)

Block 145614 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available

Block 145615 Sense: 3/11/00 [Drive media failed] Unrecovered read error

Block 145616 Sense: 3/11/00 [Drive media failed] Unrecovered read error

Block 145617 Sense: 3/11/00 [Drive media failed] Unrecovered read error

Block 145618 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available

Block 145619 Sense: 4/32/00 [Controller/drive hardware failed] No defect spare location available

 

Block scrubbing error summary:

Block 145614 4/32/00 Count=1 [Controller/drive hardware failed] No defect spare location available

Block 145615 3/11/00 Count=3 [Drive media failed] Unrecovered read error

Block 145616 3/11/00 Count=3 [Drive media failed] Unrecovered read error

Block 145617 3/11/00 Count=3 [Drive media failed] Unrecovered read error

Block 145618 4/32/00 Count=3 [Controller/drive hardware failed] No defect spare location available

Block 145619 4/32/00 Count=2 [Controller/drive hardware failed] No defect spare location available

 

Program Ended.

 

Completion and Test Time

The -scrub command reports errors at the block level, by reading each block individually. As such, it sacrifices speed for granularity. Our 146GB 15000RPM SAS disk takes 10 hours to complete using these options. If you don't care about individual block numbers, but still want a count of the bad blocks, then use the -scrubq which reads 32 blocks at a time. The same disk that took 10 hours to test with the -scrubq command takes 32 minutes to complete.

 

If you just need a pass-fail test too see if a particular disk has any read problems, then be sure to add the -scrubt option so that it terminates on the first error.  The results below were run on the same disk which has bad blocks which we created with this software on blocks 123 and 456.

 

Slow, Detailed Report

time /etc/smartmon-ux -scrub /dev/rdsk/c4t15d0s0

SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com

Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)

 

 

Block scrubbing error summary:

Block 123 4/09/00 Count=3 [Controller/drive hardware failed] Track following error

Block 456 4/09/00 Count=3 [Controller/drive hardware failed] Track following error

 

 

Program Ended.

 

 

real    10h35m40.22s

user    27m8.57s

sys     2h43m53.15s

 

Faster Report

# time ./smartmon-ux -scrubq /dev/rdsk/c4t15d0s0

SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com

Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)

 

 

Block scrubbing error summary:

Blocks 96 - 112 4/09/00 Count=3 [Controller/drive hardware failed] Track following error

Blocks 448 - 464 4/09/00 Count=3 [Controller/drive hardware failed] Track following error

 

 

Program Ended.

real    32m15.85s

user    2m20.74s

sys     5m18.14s

 

Fastest

time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0

SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com

Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)

 

 

Block scrubbing error summary:

Blocks 96 - 128 4/09/00 Count=1 [Controller/drive hardware failed] Track following error

 

 

real    0m1.67s

user    0m0.00s

sys     0m0.02s

 

If your disks support background media scanning, then you can just ask the disk if it has any problems via the -bmsr command (assuming scanning is enabled).  This will generate a report based on the last background scan the selected disk ran, and any subsequent activity since that scan. It will take less than a second to report all bad blocks on the disk, regardless of how many you have and where they are located.  The disk retains this information through power-cycles.

 

time ./smartmon-ux -scrubq -scrubt /dev/rdsk/c4t15d0s0

SMARTMon-UX [Release 1.36, Build 10-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com

Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)

Background Media Scan Report @ Tue Jun 10 12:18:51 2008

Accumulated power-on minutes:             134911 [94 days]

Number of background scans performed:     37

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00 

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

      0           133            37fc7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      1        117114          2bf620f   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      2        130954               7b   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

      3        130954              1c8   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

      4        130954            37fc7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      5        131392            37fc8   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      6        133380            38039   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      7        133792          d699104   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      8        134753          dccde66   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

      9        134755          e2bede7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

 

Program Ended.

real    0m0.25s

user    0m0.00s

sys     0m0.02s