This is an old revision of the document!


Replacing faulty disks in a ZFS RAID System

Here are my experiences with a faulty hard disk drive on a freshly installed FreeBSD-11.0 using auto ZFS option during the install to create a raidz2 with 4 disks

Initial symptoms were Xorg frozen and some other services had crashed

I was able to ssh into it and see a lot of disk i/o related errors with ada0 in /var/log/messages

So I decided to

# zpool scrub zroot

# zpool status
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Apr 14 13:34:02 2017
        662G scanned out of 1.66T at 124M/s, 2h22m to go
        1.06M repaired, 38.96% done
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    ada0p3  ONLINE       3    10    19  (repairing)
	    ada1p3  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

After the scrub completed I was still getting a lot of i/o errors in /var/log/messages so decided to replace the SATA cable on my ada0 drive

The motherboard manual came in handy here to determine which drive was ada0 or you could use something like

# dmesg | grep -B1 'ada0: Serial Number'

After replacing the SATA cable there were still i/o errors in messages and worse still after a reboot I got the following errors and the system didn't want to boot at all

gptzfsboot: error 16 lba XXXXXXXXX

decimal 16 is hex 0x10

The following is referenced from External Link 00h successful completion 01h invalid function in AH or invalid parameter 02h address mark not found 03h disk write-protected 04h sector not found/read error 05h reset failed (hard disk) 05h data did not verify correctly (TI Professional PC) 06h disk changed (floppy) 07h drive parameter activity failed (hard disk) 08h DMA overrun 09h data boundary error (attempted DMA across 64K boundary or >80h sectors) 0Ah bad sector detected (hard disk) 0Bh bad track detected (hard disk) 0Ch unsupported track or invalid media 0Dh invalid number of sectors on format (PS/2 hard disk) 0Eh control data address mark detected (hard disk) 0Fh DMA arbitration level out of range (hard disk) 10h uncorrectable CRC or ECC error on read 11h data ECC corrected (hard disk) 20h controller failure 31h no media in drive (IBM/MS INT 13 extensions) 32h incorrect drive type stored in CMOS (Compaq) 40h seek failed 80h timeout (not ready) AAh drive not ready (hard disk) B0h volume not locked in drive (INT 13 extensions) B1h volume locked in drive (INT 13 extensions) B2h volume not removable (INT 13 extensions) B3h volume in use (INT 13 extensions) B4h lock count exceeded (INT 13 extensions) B5h valid eject request failed (INT 13 extensions) B6h volume present but read protected (INT 13 extensions) BBh undefined error (hard disk) CCh write fault (hard disk) E0h status register error (hard disk) FFh sense operation failed (hard disk)

So 10h uncorrectable CRC or ECC error on read

Time to try a replacement hard disk!

So going back to my original zpool status output

config:

NAME        STATE     READ WRITE CKSUM
zroot       ONLINE       0     0     0
  raidz2-0  ONLINE       0     0     0
    ada0p3  ONLINE       3    10    19  (repairing)
    ada1p3  ONLINE       0     0     0
    ada2p3  ONLINE       0     0     0
    ada3p3  ONLINE       0     0     0

If the faulty disk is still running then do this first

# zpool offline zroot ada0p3 (will allow you to easily reattach the drive again if something goes wrong later)

Shutdown the system and replace the faulty disk drive, power on the system

# zpool status

config:

NAME        STATE     READ WRITE CKSUM
zroot       DEGRADED       0     0     0
  raidz2-0  DEGRADED       0     0     0
    15788859347225537330  UNAVAIL       0    0    0  was ada0 
    ada1p3  ONLINE       0     0     0
    ada2p3  ONLINE       0     0     0
    ada3p3  ONLINE       0     0     0

# zpool online zroot 15788859347225537330

# zpool replace zroot 15788859347225537330 ada0p3

complains about missing labels so create some

# zpool offline zroot 15788859347225537330

# ls /dev/ada* /dev/ada0 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada1p1 /dev/ada2p1 /dev/ada3p1 /dev/ada1p2 /dev/ada2p2 /dev/ada3p2

# gpart show ⇒ 40 976773088 ada1 GPT (466G)

       40       1024     1  freebsd-boot  (512K)
     1064        984        - free -  (492K)
     2048    4194303     2  freebsd-swap  (2.0G)
  4196352  972576768     3  freebsd-zfs  (464M)
976773120          8        - free -  (4.0K)

⇒ 40 976773088 ada2 GPT (466G)

       40       1024     1  freebsd-boot  (512K)
     1064        984        - free -  (492K)
     2048    4194303     2  freebsd-swap  (2.0G)
  4196352  972576768     3  freebsd-zfs  (464M)
976773120          8        - free -  (4.0K)

⇒ 40 976773088 ada3 GPT (466G)

       40       1024     1  freebsd-boot  (512K)
     1064        984        - free -  (492K)
     2048    4194303     2  freebsd-swap  (2.0G)
  4196352  972576768     3  freebsd-zfs  (464M)
976773120          8        - free -  (4.0K)

⇒ 40 976773088 ada0 GPT (466G)

# gpart show -l ada1 ⇒ 40 976773088 ada1 GPT (466G)

       40       1024     1  gptboot1  (512K)
     1064        984        - free -  (492K)
     2048    4194303     2  swap1  (2.0G)
  4196352  972576768     3  zfs1  (464M)
976773120          8        - free -  (4.0K)

# gpart add -a 4k -s 512k -l gptboot0 -t freebsd-boot ada0 # gpart add -b 1m -s 2g -l swap0 -t freebsd-swap ada0 # gpart add -a 4k -l zfs0 -t freebsd-zfs ada0 # gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

# zpool online zroot 15788859347225537330

# zpool replace zroot 15788859347225537330 ada0p3

# zpool status (should show replacing-0 and ada0p3 resilvering)

Note: if you stuff up the 'gpart add' partitioning then can delete as follows

# gpart delete -i 3 ada0 # gpart delete -i 2 ada0 # gpart delete -i 1 ada0

zfsvdevs.1492233999.txt.gz · Last modified: 2017/04/15 05:26 by matti.k
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki