Background (BGMS) Media Scan Functions

Top  Previous  Next

Reasonably current SCSI, FC and SAS disk drives (such as the Seagate 10K.5 family and above) have a programmable feature that lets the disk be configured so it scans the disk for correctable errors during idle time.  If your disk has this firmware and capability, you can us the software to configure, disable, and report test results.

 

What is Background Scanning

The best way to describe background media scanning and explain the benefits comes from Seagate's patent #7490261 - Background media scan for recovery of data errors. The following abridged text comes from the published patent itself:

"Media defects can arise at any sector on your disk drive during the lifetime of the storage system (grown defects). These grown defects include, for example, invading foreign particles which become embedded onto the surface of the disc, or external shocks to the storage system which can cause the transducer to nick or crash onto the surface of the disc. Defective sectors pose either temporary or permanent data retrieval problems.

 

Read errors are typically determined when the host computer attempts to retrieve user data from a sector and one or more uncorrected errors exist. Typically, the data storage system includes internally programmed error recovery routines such that upon determination of a read error, the data storage system applies a variety of corrective operations to recover user data. Occasionally, the data storage system exhausts all available corrective operations for recovery of data without success. The data storage system will declare a hard error and reallocate the sector by mapping out the bad sector and substituting an unused, reserved sector. The use of these corrective operations and reallocation functions can require a significant amount of time during retrieval of user data and thus, limit the maximum data transfer rate of the data storage system."

 

It does not matter whether you are using JBOD, hardware RAID or software-based RAID, BGMS will provide  profound improvement in reliability and data integrity with near-zero overhead.

 

Benefits of BGMS

First, BGMS will fix bad blocks on-the-fly as they are discovered by the firmware. The disk drive will use idle time to perform multiple re-reads to correct the data.  As the bad blocks are discovered BEFORE the O/S actually needs the data on those blocks, then no programs have to suspend processing while bad blocks are repaired.  If your host is streaming movies into hotel rooms, then user's won't suffer through the experience of a movie stopping for 5-30 seconds while the host and/or RAID subsystem go through the data recovery/remapping process.

 

If you are using software RAID, then BGMS can somewhat replace data consistency checks, and provide somewhat self-healing storage farms. In the event the BGMS-enabled disk can not repair a bad block, then you can use the report SMARTMonUX generates to provide you a list of physical disk drives and offsets where you know you have unrecoverable data.  You can then use a shell script to find bad blocks, then either run a parity rebuild, or issue a single command to repair the bad stripe by reading the part of the RAID volume that incorporates the bad block(s).  By issuing a read, the RAID software will discover for itself that there is unreadable data and it will fix it for you.

 

By exploiting the power of BGMS, you could effectively scan and repair any size storage farm 24x7 without the inherent overhead when the host tries to scan & repair bad blocks via brute-force techniques.

 

Disable Background Media Scanning

The -bmsd command disables background media scanning.

 

Usage

smartmon-ux -bmsd DeviceList

 

Enable Background Media Scanning

The -bmse command enables background media scanning.

 

Usage

smartmon-ux -bmse n DeviceList

 

Where:  n represents the hourly scanning interval.  Once the disk is programmed to enable scanning, the disk will automatically begin a new scan after the supplied interval. If disk power is lost, the timer will automatically reset to zero, and scanning will automatically continue.  Send the -bmsd command to stop and disable scanning.

 

Enable Background Media Pre-Scanning

The -bmsep command enables background media pre-scanning. Many devices require the pre-scanning function to be enabled, and a scanning cycle to complete before it starts normal background scanning. The -bmsdp command disables this feature. The pre-scan only needs to be completed once.

 

Usage

smartmon-ux -bmsep n DeviceList

 

Where:  n represents the hourly scanning interval.  Once the disk is programmed to enable scanning, the disk will automatically begin a new scan after the supplied interval. If disk power is lost, the timer will automatically reset to zero, and scanning will automatically continue.  Send the -bmsdp command to stop and disable scanning.

 

Report Background Media Scan Results

The -bmsr command disables background media scanning.

 

Usage

smartmon-ux -bmsr DeviceList

 

The command below was run on a SPARC Solaris 10 system that has 6 SAS disks. We added the time command to the prompt so that you can see how quickly the command runs. This was also run with wild-cards to select all disks attached to controller #4.

 

# time ./smartmon-ux -bmsr /dev/rdsk/c4*s0

SMARTMon-UX [Release 1.36, Build  8-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com

Discovered SEAGATE ST3146855SS S/N "3LN23ER0" on /dev/rdsk/c4t12d0s0 (Not Enabling SMART)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:03 2008

Accumulated power-on minutes:             135086 [94 days]

Number of background scans performed:     34

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

    0             8           577a4b   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    1         46392            381f8   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    2         46402          7598a8e   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    3        117139          2cfae2a   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    4        117149          9c9036c   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    5        131136          77b3f4d   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    6        135041          77339d3   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

 

Discovered SEAGATE ST3146855SS S/N "3LN2A027" on /dev/rdsk/c4t13d0s0 (Not Enabling SMART)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:03 2008

Accumulated power-on minutes:             134976 [94 days]

Number of background scans performed:     34

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Number of defects reported:               0

 

Discovered SEAGATE ST3146855SS S/N "3LN29PAS" on /dev/rdsk/c4t14d0s0 (Not Enabling SMART)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:03 2008

Accumulated power-on minutes:             134904 [94 days]

Number of background scans performed:     35

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

    0           148          d99d9f7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    1          8855          761f75d   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

 

Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:04 2008

Accumulated power-on minutes:             134325 [93 days]

Number of background scans performed:     35

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

    0           133            37fc7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    1        117114          2bf620f   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    2        130954               7b   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

    3        130954              1c8   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

    4        130954            37fc7   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    5        131392            37fc8   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    6        133380            38039   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    7        133792          d699104   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

 

Discovered SEAGATE ST3146855SS S/N "3LN27XJ9" on /dev/rdsk/c4t16d0s0 (Not Enabling SMART)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:04 2008

Accumulated power-on minutes:             134950 [94 days]

Number of background scans performed:     38

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

    0         46356          3b46c18   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    1        133307            80a34   ERR     recovered via in-place rewrite  Controller/drive hardware failed Track following error

 

Discovered SEAGATE ST3146855SS S/N "3LN29QG4" on /dev/rdsk/c4t17d0s0 (SMART enabled)(140014 MB)

 

 

Background Media Scan Report @ Sun Jun  8 16:33:04 2008

Accumulated power-on minutes:             134993 [94 days]

Number of background scans performed:     35

Background scanning status:               medium scan halted, waiting for interval timer expiration

Background scan percentage completed:     0.00

Defect#   PowerOnMins   HexBlockNumber   State   Reassignment Status             AdditionalInfo

    0           127            381a8   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    1         46378          de80f44   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    2         56468          3a44867   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    3         86795          a817a7f   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    4        130059          de863e6   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    5        131031            1e240   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

    6        132850          e01e8c4   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

    7        133350             1f62   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

    8        133350            8034a   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

    9        133350            805b4   ERR     waiting for WRITE               Controller/drive hardware failed Track following error

   10        134778          e01e8fa   OK      recovered via in-place rewrite  Recovered error Recovered data with retries

 

Program Ended.

 

 

real    0m1.15s

user    0m0.01s

sys     0m0.02s

#

 

The PowerOnMins field represents the total minutes that the disk has been powered on. The value is non-volatile, so the minutes increase only while the disk is powered on.  The fields marked with ERR correspond to defects that are in need of repair. These are bad blocks that can not be read.  If the disks are part of a software RAID set, then you should launch a data consistency repair using whatever utility is appropriate for your operating system.

 

Note that it took a little over one second to report all unrecoverable blocks for nearly one terabyte worth of storage.  The blocks that it reports were discovered during prior automated background media scans (see the -bmse function in this section).

 

Using Media Scan Results with Software RAID

BGMS not only improves data integrity by automatically repairing failing blocks by rewriting them, but can also provide enough information to construct a script to rebuild software RAID volumes when the need arises. For example, if you have two disks that mirror each other (RAID-1),and smartmon-ux tells you that block #1234 is bad and unreadable, then you can instruct the operating system to run a consistency repair on the volume to recover.  If the media scan results -bmsr reports that there are no bad blocks, then there is no need to run a manual check for bad blocks that could take hours or even days if you have a large storage pool.

 

The script, FindBadBlocks.sh utilizes the -bmsr function to enumerate all bad blocks and report them by slice (the equivalent of a partition).  This, in turn, can be used by the system administrator to determine whether or not a repair is warranted for any particular volume.  This script was run against the same Solaris 10 system that supplied the scan results shown above.

 

./FindBadBlocks.sh

PhysicalDevPath    Days:Hrs:Min  Offset State

/dev/rdsk/c1t2d0s0            -       - OK

/dev/rdsk/c4t12d0s0     0:00:08  577a4b Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    32:05:12   381f8 Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    32:05:22 7598a8e Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    81:08:19 2cfae2a Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    81:08:29 9c9036c Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    91:01:36 77b3f4d Recovered via in-place rewrite

/dev/rdsk/c4t12d0s0    93:18:41 77339d3 Recovered via in-place rewrite

/dev/rdsk/c4t14d0s0     0:02:28 d99d9f7 Recovered via in-place rewrite

/dev/rdsk/c4t14d0s0     6:03:35 761f75d Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0     0:02:13   37fc7 Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0    81:07:54 2bf620f Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0    90:22:34      7b ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t15d0s0    90:22:34     1c8 ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t15d0s0    90:22:34   37fc7 Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0    91:05:52   37fc8 Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0    92:15:00   38039 Recovered via in-place rewrite

/dev/rdsk/c4t15d0s0    92:21:52 d699104 Recovered via in-place rewrite

/dev/rdsk/c4t16d0s0    32:04:36 3b46c18 Recovered via in-place rewrite

/dev/rdsk/c4t16d0s0    92:13:47   80a34 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0     0:02:07   381a8 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    32:04:58 de80f44 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    39:05:08 3a44867 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    60:06:35 a817a7f Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    90:07:39 de863e6 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    90:23:51   1e240 ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t17d0s0    92:06:10 e01e8c4 Recovered via in-place rewrite

/dev/rdsk/c4t17d0s0    92:14:30    1f62 ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t17d0s0    92:14:30   8034a ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t17d0s0    92:14:30   805b4 ERR waiting for WRITE Controller/drive hardware failed Track following error

/dev/rdsk/c4t17d0s0    93:14:18 e01e8fa Recovered via in-place rewrite