====== Expanding a ZFS RAID System ======
This page documents my experiences in converting a 3-way RAIDZ1 pool into a 6-way RAIDZ2 pool (adding 3 new disks to the 3 original disk), with minimal downtime. The system boots off a UFS gmirror and this remains unchanged (though I have allocated equivalent space on the new disks to allow potential future conversion to ZFS).
In my case, I was expanding a pool named ''tank'', using partition ''5'' of disks ''ada0'', ''ada1'' and ''ada2''
by adding 3 new disks ''ada3'', ''da0'' and ''da1''. (Despite the names, the latter are SATA disks, attached to
a 3ware 9650SE-2LP since I'd run out of motherboard ports).
===== Original Configuration =====
server# gpart show
=> 34 1953522988 ada0 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953522988 ada1 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953525101 ada2 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936747791 5 freebsd-zfs (924G)
server# zpool status -v
pool: tank
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: scrub completed after 14h22m with 0 errors on Thu Oct 28 18:22:28 2010
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada0p5 ONLINE 0 0 0
ada1p5 ONLINE 0 0 0
ada2p5 ONLINE 0 0 0
errors: No known data errors
===== Final Configuration =====
server% gpart show
=> 34 1953525101 da0 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 1936747791 5 freebsd-zfs (924G)
=> 34 1953525101 da1 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 1936747791 5 freebsd-zfs (924G)
=> 34 1953522988 ada0 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953522988 ada1 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953525101 ada2 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936747791 5 freebsd-zfs (924G)
=> 34 1953525101 ada3 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 1936747791 5 freebsd-zfs (924G)
server% zpool status
pool: tank
state: ONLINE
scrub: scrub completed after 3h54m with 0 errors on Sat Nov 6 16:35:22 2010
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2 ONLINE 0 0 0
ada0p5 ONLINE 0 0 0
ada1p5 ONLINE 0 0 0
ada2p5 ONLINE 0 0 0
ada3p5 ONLINE 0 0 0
da0p5 ONLINE 0 0 0
da1p5 ONLINE 0 0 0
errors: No known data errors
===== Procedure =====
The overall process is:
- Create a 6-way RAIDZ2 across the 3 new disks (ie each disk provides two vdevs).
- Copy the existing pool onto the new disks.
- Switch the system to use the new 6-way pool.
- Destroy the original pool.
- Replace the second vdev in each disk with one of the original disks.
- Re-partition the new disks to expand the remaining vdev to occupy the now unused space.
In detail:
==== Partition up new disks ====
In my case, I have root and swap on the same disks, so I needed to
carve out space for that.
Even if you use the disks solely for ZFS, it’s probably a good idea to
partition a couple of MB off the disks in case a replacement disk is
slightly smaller.
As shown above, the old disks have 5 partitions (boot, UFS root, UFS
/var, swap and ZFS).
My long-term plans are to switch to ZFS root, so I combined the space
allocated to both UFS partitions into one, but skipped p4 so that the
ZFS partition remained at p5 for consistency.
Apart from the boot partition, all partitions are aligned on 8-sector
boundaries to simplify possible future migration to 4KiB disks.
Initially, split the ZFS partition into two equal pieces:
for i in da0 da1 ada3; do
gpart add -b 34 -s 94 -t freebsd-boot $i
gpart add -b 128 -s 10485760 -i 2 -t freebsd-zfs $i
gpart add -b 10485888 -s 6291456 -i 3 -t freebsd-swap $i
gpart add -b 16777344 -s 968373895 -i 5 -t freebsd-zfs $i
gpart add -b 985151239 -s 968373895 -i 6 -t freebsd-zfs $i
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 $i
done
At this stage, my disk layout is:
# gpart show
=> 34 1953522988 ada0 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953522988 ada1 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936745678 5 freebsd-zfs (924G)
=> 34 1953525101 ada2 GPT (932G)
34 94 1 freebsd-boot (47K)
128 6291456 2 freebsd-ufs (3.0G)
6291584 6291456 3 freebsd-swap (3.0G)
12583040 4194304 4 freebsd-ufs (2.0G)
16777344 1936747791 5 freebsd-zfs (924G)
=> 34 1953525101 ada3 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 968373895 5 freebsd-zfs (462G)
985151239 968373895 6 freebsd-zfs (462G)
1953525134 1 - free - (512B)
=> 34 1953525101 da0 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 968373895 5 freebsd-zfs (462G)
985151239 968373895 6 freebsd-zfs (462G)
1953525134 1 - free - (512B)
=> 34 1953525101 da1 GPT (932G)
34 94 1 freebsd-boot (47K)
128 10485760 2 freebsd-zfs (5.0G)
10485888 6291456 3 freebsd-swap (3.0G)
16777344 968373895 5 freebsd-zfs (462G)
985151239 968373895 6 freebsd-zfs (462G)
1953525134 1 - free - (512B)
==== Create 6-way RAIDZ2 zpool ====
If the new disks have previously been used, particularly for ZFS, it's a
good idea to zero out the first and last 512KiB or more - which is where
ZFS stores its vdev labels.
for i in da0 da1 ada3; do
dd if=/dev/zero of=/dev/${i}p5 count=1024
dd if=/dev/zero of=/dev/${i}p5 seek=968372000
dd if=/dev/zero of=/dev/${i}p6 count=1024
dd if=/dev/zero of=/dev/${i}p6 seek=968372000
done
I wanted my final pool configuration to have all the vdevs in alphabetical
order, so I allocated the temporary vdevs first.
zpool create tank2 raidz2 da0p6 da1p6 ada3p6 ada3p5 da0p5 da1p5
zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
tank 2.70T 2.34T 369G 86% ONLINE -
tank2 2.70T 202K 2.70T 0% ONLINE -
==== Initial data copy to new pool ====
By using ZFS snapshots, I can transfer the majority of the pool contents
over to the new pool without impacting normal system operation. This
significantly reduces the necessary system outage.
I recommend the use of ports/misc/mbuffer or similar between the
"send" and "recv" to improve throughput (I used my own equivalent
tool but have left it out of the following commands).
Note that the '-u' option to 'zfs recv' is crucial - otherwise the
filesystems copied to 'tank2' will be automatically mounted. Where
any filesystems have 'mountpoint' specified, this would result in the
'tank2' filesystem being mounted over the equivalent 'tank' filesystem.
zfs snapshot -r tank@20101104bu
zfs send -R tank@20101104bu | zfs recv -vuF -d tank2
If you are paranoid, you can then do a scrub on 'tank2'. This will not
be especially quick because having multiple vdevs per physical disk
causes additional seeking between vdevs.
zpool scrub tank2
==== Switch to new pool ====
This step entails a system outage but should be relatively quick because
the bulk of the data was copied in the previous step and this step just
needs to copy changes since the snapshot was taken. In my case, this
took approximately 25 minutes but that included a second send/recv onto
my external backup disk as well as a couple of mistakes.
In order to prevent any updates, the system should be brought down to
single-user mode:
shutdown now
Once nothing is writing to ZFS, a second snapshot can be taken and
transferred to the new pool. The rollback is needed if tank2 has been
altered since the previous 'zfs recv' (this includes atime updates).
zfs snapshot -r tank@20101105bu
zfs rollback -R tank2@20101104bu
zfs send -R -I tank@20101104bu tank@20101105bu | zfs recv -vu -d tank2
The original pool is now renamed by exporting and importing it under a
new name and then exporting it to umount it.
zpool export tank
zpool import tank tanko
zpool export tanko
And the new pool is renamed to the wanted name via export/import.
zpool export tank2
zpool import tank2 tank
The system can now be returned to multiuser mode and any required testing
performed.
exit
==== Replace vdevs ====
I didn't explicitly destory the old pool but just wiped the vdev labels
as I reused the disks. This gave me slightly more recovery scope as I
could (in theory) recreate the old pool even after I'd reused the first
disk (since it was RAIDZ1).
Note that the resilver appears to be achieved by regenerating the
disk contents from the remaining vdevs, rather than just copying the
disk being replaced (though normal FS writes appear to be addressed
to it).
First disk:
dd if=/dev/zero of=/dev/ada0p5 count=1024
dd if=/dev/zero of=/dev/ada0p5 seek=1936744000
zpool replace tank da0p6 ada0p5
In my case, this took 7h20m to resilver 342G.
Second disk:
dd if=/dev/zero of=/dev/ada1p5 count=1024
dd if=/dev/zero of=/dev/ada1p5 seek=1936744000
zpool replace tank da1p6 ada1p5
In my case, this took 4h53m to resilver 341G.
Third disk:
dd if=/dev/zero of=/dev/ada2p5 count=1024
dd if=/dev/zero of=/dev/ada2p5 seek=1936744000
zpool replace tank ada3p6 ada2p5
In my case, this took 5h41m to resilver 342G.
At this point the pool is spread across all 6 disks but it still limited
to ~500GB per vdev.
==== Expand pool ====
In order to expand the pool, the vdevs on the 3 new disks need to be
resized. It's not possible to expand the gpart partition so this also
requires a (short) outage.
For safety (to prevent ZFS confusion), the vdev metadata at the end of
the temporary vdevs was destroyed, since this would otherwise appear
at the end of the expanded vdevs.
dd if=/dev/zero of=/dev/da0p6 seek=968372000
dd if=/dev/zero of=/dev/da1p6 seek=968372000
dd if=/dev/zero of=/dev/ada3p6 seek=968372000
The system needs to be placed in single-user mode to allow the partitions
and pool to be manipulated:
shutdown now
Once in single-user mode, all 3 partition 6's can be deleted and the
partition 5's expanded (by deleting them and recreating them with the
larger size):
zpool export tank
for i in da0 da1 ada3; do
gpart delete -i 6 $i
gpart delete -i 5 $i
gpart add -b 16777344 -i 5 -t freebsd-zfs -s 1936747791 $i
done
zpool import tank
The pool has now expanded to 4TB:
zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
back2 1.81T 1.57T 243G 86% ONLINE -
tank 5.41T 2.02T 3.39T 37% ONLINE -
And the system can be restarted:
exit
Remember to add the new disks to (eg) daily_status_smart_devices