Expanding a ZFS RAID System

This page documents my experiences in converting a 3-way RAIDZ1 pool into a 6-way RAIDZ2 pool (adding 3 new disks to the 3 original disk), with minimal downtime. The system boots off a UFS gmirror and this remains unchanged (though I have allocated equivalent space on the new disks to allow potential future conversion to ZFS).

In my case, I was expanding a pool named tank, using partition 5 of disks ada0, ada1 and ada2 by adding 3 new disks ada3, da0 and da1. (Despite the names, the latter are SATA disks, attached to a 3ware 9650SE-2LP since I'd run out of motherboard ports).

Original Configuration

server# gpart show
=>        34  1953522988  ada0  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953522988  ada1  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953525101  ada2  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936747791     5  freebsd-zfs  (924G)

server# zpool status -v
  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 14h22m with 0 errors on Thu Oct 28 18:22:28 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ada0p5  ONLINE       0     0     0
            ada1p5  ONLINE       0     0     0
            ada2p5  ONLINE       0     0     0

errors: No known data errors

Final Configuration

server% gpart show
=>        34  1953525101  da0  GPT  (932G)
          34          94    1  freebsd-boot  (47K)
         128    10485760    2  freebsd-zfs  (5.0G)
    10485888     6291456    3  freebsd-swap  (3.0G)
    16777344  1936747791    5  freebsd-zfs  (924G)

=>        34  1953525101  da1  GPT  (932G)
          34          94    1  freebsd-boot  (47K)
         128    10485760    2  freebsd-zfs  (5.0G)
    10485888     6291456    3  freebsd-swap  (3.0G)
    16777344  1936747791    5  freebsd-zfs  (924G)

=>        34  1953522988  ada0  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953522988  ada1  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953525101  ada2  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936747791     5  freebsd-zfs  (924G)

=>        34  1953525101  ada3  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128    10485760     2  freebsd-zfs  (5.0G)
    10485888     6291456     3  freebsd-swap  (3.0G)
    16777344  1936747791     5  freebsd-zfs  (924G)

server% zpool status
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h54m with 0 errors on Sat Nov  6 16:35:22 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            ada0p5  ONLINE       0     0     0
            ada1p5  ONLINE       0     0     0
            ada2p5  ONLINE       0     0     0
            ada3p5  ONLINE       0     0     0
            da0p5   ONLINE       0     0     0
            da1p5   ONLINE       0     0     0

errors: No known data errors

Procedure

The overall process is:

Create a 6-way RAIDZ2 across the 3 new disks (ie each disk provides two vdevs).
Copy the existing pool onto the new disks.
Switch the system to use the new 6-way pool.
Destroy the original pool.
Replace the second vdev in each disk with one of the original disks.
Re-partition the new disks to expand the remaining vdev to occupy the now unused space.

In detail:

Partition up new disks

In my case, I have root and swap on the same disks, so I needed to carve out space for that. Even if you use the disks solely for ZFS, it’s probably a good idea to partition a couple of MB off the disks in case a replacement disk is slightly smaller. As shown above, the old disks have 5 partitions (boot, UFS root, UFS /var, swap and ZFS). My long-term plans are to switch to ZFS root, so I combined the space allocated to both UFS partitions into one, but skipped p4 so that the ZFS partition remained at p5 for consistency.

Apart from the boot partition, all partitions are aligned on 8-sector boundaries to simplify possible future migration to 4KiB disks.

Initially, split the ZFS partition into two equal pieces:

for i in da0 da1 ada3; do
  gpart add -b 34 -s 94 -t freebsd-boot $i
  gpart add -b 128 -s 10485760 -i 2 -t freebsd-zfs $i
  gpart add -b 10485888 -s 6291456 -i 3 -t freebsd-swap $i
  gpart add -b  16777344 -s 968373895 -i 5 -t freebsd-zfs $i
  gpart add -b 985151239 -s 968373895 -i 6 -t freebsd-zfs $i
  gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 $i
done

At this stage, my disk layout is:

# gpart show
=>        34  1953522988  ada0  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953522988  ada1  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936745678     5  freebsd-zfs  (924G)

=>        34  1953525101  ada2  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128     6291456     2  freebsd-ufs  (3.0G)
     6291584     6291456     3  freebsd-swap  (3.0G)
    12583040     4194304     4  freebsd-ufs  (2.0G)
    16777344  1936747791     5  freebsd-zfs  (924G)

=>        34  1953525101  ada3  GPT  (932G)
          34          94     1  freebsd-boot  (47K)
         128    10485760     2  freebsd-zfs  (5.0G)
    10485888     6291456     3  freebsd-swap  (3.0G)
    16777344   968373895     5  freebsd-zfs  (462G)
   985151239   968373895     6  freebsd-zfs  (462G)
  1953525134           1        - free -  (512B)

=>        34  1953525101  da0  GPT  (932G)
          34          94    1  freebsd-boot  (47K)
         128    10485760    2  freebsd-zfs  (5.0G)
    10485888     6291456    3  freebsd-swap  (3.0G)
    16777344   968373895    5  freebsd-zfs  (462G)
   985151239   968373895    6  freebsd-zfs  (462G)
  1953525134           1       - free -  (512B)

=>        34  1953525101  da1  GPT  (932G)
          34          94    1  freebsd-boot  (47K)
         128    10485760    2  freebsd-zfs  (5.0G)
    10485888     6291456    3  freebsd-swap  (3.0G)
    16777344   968373895    5  freebsd-zfs  (462G)
   985151239   968373895    6  freebsd-zfs  (462G)
  1953525134           1       - free -  (512B)

Create 6-way RAIDZ2 zpool

If the new disks have previously been used, particularly for ZFS, it's a good idea to zero out the first and last 512KiB or more - which is where ZFS stores its vdev labels.

for i in da0 da1 ada3; do
  dd if=/dev/zero of=/dev/${i}p5 count=1024
  dd if=/dev/zero of=/dev/${i}p5 seek=968372000
  dd if=/dev/zero of=/dev/${i}p6 count=1024
  dd if=/dev/zero of=/dev/${i}p6 seek=968372000
done

I wanted my final pool configuration to have all the vdevs in alphabetical order, so I allocated the temporary vdevs first.

zpool create tank2 raidz2 da0p6 da1p6 ada3p6 ada3p5 da0p5 da1p5
zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   2.70T  2.34T   369G    86%  ONLINE  -
tank2  2.70T   202K  2.70T     0%  ONLINE  -

Initial data copy to new pool

By using ZFS snapshots, I can transfer the majority of the pool contents over to the new pool without impacting normal system operation. This significantly reduces the necessary system outage.

I recommend the use of ports/misc/mbuffer or similar between the “send” and “recv” to improve throughput (I used my own equivalent tool but have left it out of the following commands).

Note that the '-u' option to 'zfs recv' is crucial - otherwise the filesystems copied to 'tank2' will be automatically mounted. Where any filesystems have 'mountpoint' specified, this would result in the 'tank2' filesystem being mounted over the equivalent 'tank' filesystem.

zfs snapshot -r tank@20101104bu
zfs send -R tank@20101104bu | zfs recv -vuF -d tank2

If you are paranoid, you can then do a scrub on 'tank2'. This will not be especially quick because having multiple vdevs per physical disk causes additional seeking between vdevs.

  zpool scrub tank2

Switch to new pool

This step entails a system outage but should be relatively quick because the bulk of the data was copied in the previous step and this step just needs to copy changes since the snapshot was taken. In my case, this took approximately 25 minutes but that included a second send/recv onto my external backup disk as well as a couple of mistakes.

In order to prevent any updates, the system should be brought down to single-user mode:

  shutdown now

Once nothing is writing to ZFS, a second snapshot can be taken and transferred to the new pool. The rollback is needed if tank2 has been altered since the previous 'zfs recv' (this includes atime updates).

  zfs snapshot -r tank@20101105bu
  zfs rollback -R tank2@20101104bu
  zfs send -R -I tank@20101104bu tank@20101105bu | zfs recv -vu -d tank2

The original pool is now renamed by exporting and importing it under a new name and then exporting it to umount it.

  zpool export tank
  zpool import tank tanko
  zpool export tanko

And the new pool is renamed to the wanted name via export/import.

  zpool export tank2
  zpool import tank2 tank

The system can now be returned to multiuser mode and any required testing performed.

  exit

Replace vdevs

I didn't explicitly destory the old pool but just wiped the vdev labels as I reused the disks. This gave me slightly more recovery scope as I could (in theory) recreate the old pool even after I'd reused the first disk (since it was RAIDZ1).

Note that the resilver appears to be achieved by regenerating the disk contents from the remaining vdevs, rather than just copying the disk being replaced (though normal FS writes appear to be addressed to it).

First disk:

dd if=/dev/zero of=/dev/ada0p5 count=1024
dd if=/dev/zero of=/dev/ada0p5 seek=1936744000
zpool replace tank da0p6 ada0p5

In my case, this took 7h20m to resilver 342G.

Second disk:

dd if=/dev/zero of=/dev/ada1p5 count=1024
dd if=/dev/zero of=/dev/ada1p5 seek=1936744000
zpool replace tank da1p6 ada1p5

In my case, this took 4h53m to resilver 341G.

Third disk:

dd if=/dev/zero of=/dev/ada2p5 count=1024
dd if=/dev/zero of=/dev/ada2p5 seek=1936744000
zpool replace tank ada3p6 ada2p5

In my case, this took 5h41m to resilver 342G.

At this point the pool is spread across all 6 disks but it still limited to ~500GB per vdev.

Expand pool

In order to expand the pool, the vdevs on the 3 new disks need to be resized. It's not possible to expand the gpart partition so this also requires a (short) outage.

For safety (to prevent ZFS confusion), the vdev metadata at the end of the temporary vdevs was destroyed, since this would otherwise appear at the end of the expanded vdevs.

dd if=/dev/zero of=/dev/da0p6 seek=968372000
dd if=/dev/zero of=/dev/da1p6 seek=968372000
dd if=/dev/zero of=/dev/ada3p6 seek=968372000

The system needs to be placed in single-user mode to allow the partitions and pool to be manipulated:

  shutdown now

Once in single-user mode, all 3 partition 6's can be deleted and the partition 5's expanded (by deleting them and recreating them with the larger size):

zpool export tank
for i in da0 da1 ada3; do
  gpart delete -i 6 $i
  gpart delete -i 5 $i
  gpart add -b 16777344 -i 5 -t freebsd-zfs -s 1936747791 $i
done
zpool import tank

The pool has now expanded to 4TB:

zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
back2  1.81T  1.57T   243G    86%  ONLINE  -
tank   5.41T  2.02T  3.39T    37%  ONLINE  -

And the system can be restarted:

  exit

Remember to add the new disks to (eg) daily_status_smart_devices

Table of Contents