|
Background (BGMS) Media Scan Functions |
Top Previous Next |
|
Reasonably current SCSI, FC and SAS disk drives (such as the Seagate 10K.5 family and above) have a programmable feature that lets the disk be configured so it scans the disk for correctable errors during idle time. If your disk has this firmware and capability, you can us the software to configure, disable, and report test results.
What is Background Scanning The best way to describe background media scanning and explain the benefits comes from Seagate's patent #7490261 - Background media scan for recovery of data errors. The following abridged text comes from the published patent itself: "Media defects can arise at any sector on your disk drive during the lifetime of the storage system (grown defects). These grown defects include, for example, invading foreign particles which become embedded onto the surface of the disc, or external shocks to the storage system which can cause the transducer to nick or crash onto the surface of the disc. Defective sectors pose either temporary or permanent data retrieval problems.
Read errors are typically determined when the host computer attempts to retrieve user data from a sector and one or more uncorrected errors exist. Typically, the data storage system includes internally programmed error recovery routines such that upon determination of a read error, the data storage system applies a variety of corrective operations to recover user data. Occasionally, the data storage system exhausts all available corrective operations for recovery of data without success. The data storage system will declare a hard error and reallocate the sector by mapping out the bad sector and substituting an unused, reserved sector. The use of these corrective operations and reallocation functions can require a significant amount of time during retrieval of user data and thus, limit the maximum data transfer rate of the data storage system."
It does not matter whether you are using JBOD, hardware RAID or software-based RAID, BGMS will provide profound improvement in reliability and data integrity with near-zero overhead.
Benefits of BGMS First, BGMS will fix bad blocks on-the-fly as they are discovered by the firmware. The disk drive will use idle time to perform multiple re-reads to correct the data. As the bad blocks are discovered BEFORE the O/S actually needs the data on those blocks, then no programs have to suspend processing while bad blocks are repaired. If your host is streaming movies into hotel rooms, then user's won't suffer through the experience of a movie stopping for 5-30 seconds while the host and/or RAID subsystem go through the data recovery/remapping process.
If you are using software RAID, then BGMS can somewhat replace data consistency checks, and provide somewhat self-healing storage farms. In the event the BGMS-enabled disk can not repair a bad block, then you can use the report SMARTMonUX generates to provide you a list of physical disk drives and offsets where you know you have unrecoverable data. You can then use a shell script to find bad blocks, then either run a parity rebuild, or issue a single command to repair the bad stripe by reading the part of the RAID volume that incorporates the bad block(s). By issuing a read, the RAID software will discover for itself that there is unreadable data and it will fix it for you.
By exploiting the power of BGMS, you could effectively scan and repair any size storage farm 24x7 without the inherent overhead when the host tries to scan & repair bad blocks via brute-force techniques.
Disable Background Media Scanning The -bmsd command disables background media scanning.
Usage
Enable Background Media Scanning The -bmse command disables background media scanning.
Usage smartmon-ux -bmse n DeviceList
Where: n represents the hourly scanning interval. Once the disk is programmed to enable scanning, the disk will automatically begin a new scan after the supplied interval. If disk power is lost, the timer will automatically reset to zero, and scanning will automatically continue. Send the -bmsd command to stop and disable scanning.
Report Background Media Scan Results The -bmsr command disables background media scanning.
Usage
The command below was run on a SPARC Solaris 10 system that has 6 SAS disks. We added the time command to the prompt so that you can see how quickly the command runs. This was also run with wild-cards to select all disks attached to controller #4.
# time ./smartmon-ux -bmsr /dev/rdsk/c4*s0 SMARTMon-UX [Release 1.36, Build 8-JUN-2008] - Copyright 2001-2008 SANtools(R), Inc. http://www.SANtools.com Discovered SEAGATE ST3146855SS S/N "3LN23ER0" on /dev/rdsk/c4t12d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:03 2008 Accumulated power-on minutes: 135086 [94 days] Number of background scans performed: 34 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 8 577a4b OK recovered via in-place rewrite Recovered error Recovered data with retries 1 46392 381f8 OK recovered via in-place rewrite Recovered error Recovered data with retries 2 46402 7598a8e OK recovered via in-place rewrite Recovered error Recovered data with retries 3 117139 2cfae2a OK recovered via in-place rewrite Recovered error Recovered data with retries 4 117149 9c9036c OK recovered via in-place rewrite Recovered error Recovered data with retries 5 131136 77b3f4d OK recovered via in-place rewrite Recovered error Recovered data with retries 6 135041 77339d3 OK recovered via in-place rewrite Recovered error Recovered data with retries
Discovered SEAGATE ST3146855SS S/N "3LN2A027" on /dev/rdsk/c4t13d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:03 2008 Accumulated power-on minutes: 134976 [94 days] Number of background scans performed: 34 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Number of defects reported: 0
Discovered SEAGATE ST3146855SS S/N "3LN29PAS" on /dev/rdsk/c4t14d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:03 2008 Accumulated power-on minutes: 134904 [94 days] Number of background scans performed: 35 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 148 d99d9f7 OK recovered via in-place rewrite Recovered error Recovered data with retries 1 8855 761f75d OK recovered via in-place rewrite Recovered error Recovered data with retries
Discovered SEAGATE ST3146855SS S/N "3LN29ZZ5" on /dev/rdsk/c4t15d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:04 2008 Accumulated power-on minutes: 134325 [93 days] Number of background scans performed: 35 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 133 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries 1 117114 2bf620f OK recovered via in-place rewrite Recovered error Recovered data with retries 2 130954 7b ERR waiting for WRITE Controller/drive hardware failed Track following error 3 130954 1c8 ERR waiting for WRITE Controller/drive hardware failed Track following error 4 130954 37fc7 OK recovered via in-place rewrite Recovered error Recovered data with retries 5 131392 37fc8 OK recovered via in-place rewrite Recovered error Recovered data with retries 6 133380 38039 OK recovered via in-place rewrite Recovered error Recovered data with retries 7 133792 d699104 OK recovered via in-place rewrite Recovered error Recovered data with retries
Discovered SEAGATE ST3146855SS S/N "3LN27XJ9" on /dev/rdsk/c4t16d0s0 (Not Enabling SMART)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:04 2008 Accumulated power-on minutes: 134950 [94 days] Number of background scans performed: 38 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 46356 3b46c18 OK recovered via in-place rewrite Recovered error Recovered data with retries 1 133307 80a34 ERR recovered via in-place rewrite Controller/drive hardware failed Track following error
Discovered SEAGATE ST3146855SS S/N "3LN29QG4" on /dev/rdsk/c4t17d0s0 (SMART enabled)(140014 MB)
Background Media Scan Report @ Sun Jun 8 16:33:04 2008 Accumulated power-on minutes: 134993 [94 days] Number of background scans performed: 35 Background scanning status: medium scan halted, waiting for interval timer expiration Background scan percentage completed: 0.00 Defect# PowerOnMins HexBlockNumber State Reassignment Status AdditionalInfo 0 127 381a8 OK recovered via in-place rewrite Recovered error Recovered data with retries 1 46378 de80f44 OK recovered via in-place rewrite Recovered error Recovered data with retries 2 56468 3a44867 OK recovered via in-place rewrite Recovered error Recovered data with retries 3 86795 a817a7f OK recovered via in-place rewrite Recovered error Recovered data with retries 4 130059 de863e6 OK recovered via in-place rewrite Recovered error Recovered data with retries 5 131031 1e240 ERR waiting for WRITE Controller/drive hardware failed Track following error 6 132850 e01e8c4 OK recovered via in-place rewrite Recovered error Recovered data with retries 7 133350 1f62 ERR waiting for WRITE Controller/drive hardware failed Track following error 8 133350 8034a ERR waiting for WRITE Controller/drive hardware failed Track following error 9 133350 805b4 ERR waiting for WRITE Controller/drive hardware failed Track following error 10 134778 e01e8fa OK recovered via in-place rewrite Recovered error Recovered data with retries
Program Ended.
real 0m1.15s user 0m0.01s sys 0m0.02s #
The PowerOnMins field represents the total minutes that the disk has been powered on. The value is non-volatile, so the minutes increase only while the disk is powered on. The fields marked with ERR correspond to defects that are in need of repair. These are bad blocks that can not be read. If the disks are part of a software RAID set, then you should launch a data consistency repair using whatever utility is appropriate for your operating system.
Note that it took a little over one second to report all unrecoverable blocks for nearly one terabyte worth of storage. The blocks that it reports were discovered during prior automated background media scans (see the -bmse function in this section).
Using Media Scan Results with Software RAID BGMS not only improves data integrity by automatically repairing failing blocks by rewriting them, but can also provide enough information to construct a script to rebuild software RAID volumes when the need arises. For example, if you have two disks that mirror each other (RAID-1),and smartmon-ux tells you that block #1234 is bad and unreadable, then you can instruct the operating system to run a consistency repair on the volume to recover. If the media scan results -bmsr reports that there are no bad blocks, then there is no need to run a manual check for bad blocks that could take hours or even days if you have a large storage pool.
The script, FindBadBlocks.sh utilizes the -bmsr function to enumerate all bad blocks and report them by slice (the equivalent of a partition). This, in turn, can be used by the system administrator to determine whether or not a repair is warranted for any particular volume. This script was run against the same Solaris 10 system that supplied the scan results shown above.
./FindBadBlocks.sh PhysicalDevPath Days:Hrs:Min Offset State /dev/rdsk/c1t2d0s0 - - OK /dev/rdsk/c4t12d0s0 0:00:08 577a4b Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 32:05:12 381f8 Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 32:05:22 7598a8e Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 81:08:19 2cfae2a Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 81:08:29 9c9036c Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 91:01:36 77b3f4d Recovered via in-place rewrite /dev/rdsk/c4t12d0s0 93:18:41 77339d3 Recovered via in-place rewrite /dev/rdsk/c4t14d0s0 0:02:28 d99d9f7 Recovered via in-place rewrite /dev/rdsk/c4t14d0s0 6:03:35 761f75d Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 0:02:13 37fc7 Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 81:07:54 2bf620f Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 90:22:34 7b ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t15d0s0 90:22:34 1c8 ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t15d0s0 90:22:34 37fc7 Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 91:05:52 37fc8 Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 92:15:00 38039 Recovered via in-place rewrite /dev/rdsk/c4t15d0s0 92:21:52 d699104 Recovered via in-place rewrite /dev/rdsk/c4t16d0s0 32:04:36 3b46c18 Recovered via in-place rewrite /dev/rdsk/c4t16d0s0 92:13:47 80a34 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 0:02:07 381a8 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 32:04:58 de80f44 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 39:05:08 3a44867 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 60:06:35 a817a7f Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 90:07:39 de863e6 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 90:23:51 1e240 ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t17d0s0 92:06:10 e01e8c4 Recovered via in-place rewrite /dev/rdsk/c4t17d0s0 92:14:30 1f62 ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t17d0s0 92:14:30 8034a ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t17d0s0 92:14:30 805b4 ERR waiting for WRITE Controller/drive hardware failed Track following error /dev/rdsk/c4t17d0s0 93:14:18 e01e8fa Recovered via in-place rewrite
|