写这篇文章的目的是为了记录下在 FreeBSD 下如何扩充一个 zpool 的存储池(或者或是 ZFS 的文件系统)。个人对 zpool/ZFS 的理解目前还不是那么的完全,关键在于平时很少接触到,已经配置好的如果不出什么问题也不会做什么深入的了解,够用、能用了就基本差不多了,目前还涉及到任何性能的问题。
就目前我个人了解到的知识体系来说,我想到的一种比较优化的使用场景:四块乃更多快同样容量的机械硬盘组成一个 raidz2 的阵列,然后那额外一块 SSD 硬盘做一个高速缓存,我记得在 zpool 的 manpage 里看到过类似这样的配置。如果将来有机会可以尝试下,不过对普通个人电脑来说小机箱很难找到这样合适的机箱。 目前看来 HP 的 Micro Server Gen8 (暂时保留一个链接,以后补上一个对Gen8的介绍。个人在京东买了一台) 貌似是个不错的选择。
从网上搜到了一个介绍 RAID 信息的一些问题,FreeBSD 的 raidz2 应该是等同于 RAID5 的,容错量应该是两块硬盘—也就是说允许同一时间有两块硬盘坏掉。不过没做过这样的尝试,但是前两天遇到了系统启动时突然找不到跟文件系统的情况,多重启两次就好了,不知是何原因。启动后发现 zpool 在修复一块硬盘上的数据。
首先介绍一些基本的信息。我这台机器是富士通的入门级服务器()。原配有 Xeon E1245 的 CPU,两条 4GB 的 ECC 内存,外加一块西部数据的 500G 硬盘。我有外加了3块硬盘组了 raidz2 阵列。安装了 FreeBSD 10.2 的系统,选择安装向导里的 ZFS 安装,未选择硬盘加密。
在做扩充之前机器上已经有了很多数据,为了不出意外先在虚拟机上做了实验。虚拟机上的系统信息如下:
FreeBSD 10.2-RELEASE (GENERIC) #0 r286666: Wed Aug 12 15:26:37 UTC 2015
Welcome to FreeBSD!
Release Notes, Errata: https://www.FreeBSD.org/releases/
Security Advisories:   https://www.FreeBSD.org/security/
FreeBSD Handbook:      https://www.FreeBSD.org/handbook/
FreeBSD FAQ:           https://www.FreeBSD.org/faq/
Questions List: https://lists.FreeBSD.org/mailman/listinfo/freebsd-questions/
FreeBSD Forums:        https://forums.FreeBSD.org/
Documents installed with the system are in the /usr/local/share/doc/freebsd/
directory, or can be installed later with:  pkg install en-freebsd-doc
For other languages, replace "en" with a language code like de or fr.
Show the version of FreeBSD installed:  freebsd-version ; uname -a
Please include that output and any error messages when posting questions.
Introduction to manual pages:  man man
FreeBSD directory layout:      man hier
Edit /etc/motd to change this login announcement.
Ever wonder what those numbers after command names were, as in cat(1)?  It's
the section of the manual the man page is in.  "man man" will tell you more.
                -- David Scheidt <dscheidt@tumbolia.com>
上面是 shell 登录后看到第一段文字,了解个基本的版本信息就好了。 FreeBSD 10.2-RELEASE (GENERIC) #0 r286666
默认的分区情况如下,使用了系统自带的安装向导,选择了 ZFS 安装,选择了数据加密,SWAP 分区未选择加密。
$ df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default     25G    2.6G     22G    10%    /
devfs                 1.0K    1.0K      0B   100%    /dev
bootpool              1.9G    509M    1.4G    26%    /bootpool
zroot/tmp              22G     45M     22G     0%    /tmp
zroot/usr/home         23G    271M     22G     1%    /usr/home
zroot/usr/ports        23G    900M     22G     4%    /usr/ports
zroot/usr/src          22G    140K     22G     0%    /usr/src
zroot/var/audit        22G    140K     22G     0%    /var/audit
zroot/var/crash        22G    140K     22G     0%    /var/crash
zroot/var/log          22G    645K     22G     0%    /var/log
zroot/var/mail         22G    192K     22G     0%    /var/mail
zroot/var/tmp          22G    140K     22G     0%    /var/tmp
zroot                  22G    140K     22G     0%    /zroot
通过 zpool list 命令可以看到系统创建了两个存储池,一个 bootpool,另一个是 zroot。坦白讲写这篇文章的时候我还是对这个磁盘的容量大小还是不很清楚。
$ zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bootpool  1.98G   510M  1.49G         -    18%    25%  1.00x  DEGRADED  -
zroot     55.5G  7.74G  47.8G         -     4%    13%  1.00x  DEGRADED  -
zfs 命令可以看到分区都是默认的,没有做任何调整。
$ zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
bootpool             510M  1.42G   509M  /bootpool
zroot               3.75G  22.3G   140K  /zroot
zroot/ROOT          2.56G  22.3G   140K  none
zroot/ROOT/default  2.56G  22.3G  2.56G  /
zroot/tmp           44.7M  22.3G  44.7M  /tmp
zroot/usr           1.14G  22.3G   140K  /usr
zroot/usr/home       271M  22.3G   271M  /usr/home
zroot/usr/ports      900M  22.3G   900M  /usr/ports
zroot/usr/src        140K  22.3G   140K  /usr/src
zroot/var           1.37M  22.3G   140K  /var
zroot/var/audit      140K  22.3G   140K  /var/audit
zroot/var/crash      140K  22.3G   140K  /var/crash
zroot/var/log        651K  22.3G   651K  /var/log
zroot/var/mail       192K  22.3G   192K  /var/mail
zroot/var/tmp        140K  22.3G   140K  /var/tmp 
ls 命令可以看到有四块硬盘。分别是 da0,da1,da2 和 da3. 这里个人不是很明白的就是为什么 /dev/da0p4 之后还跟着一个 /dev/da0p4.eli
$ ls /dev/da*
/dev/da0        /dev/da0p4      /dev/da1p2      /dev/da2        /dev/da2p4
/dev/da0p1      /dev/da0p4.eli  /dev/da1p3      /dev/da2p1      /dev/da2p4.eli
/dev/da0p2      /dev/da1        /dev/da1p4      /dev/da2p2      
/dev/da0p3      /dev/da1p1      /dev/da1p4.eli  /dev/da2p3 
zpool status 可以看到组成 zroot 的是 /dev/da0p4.eli,/dev/db0p4.eli,/dev/dc0b4.eli,/dev/dd0p4.eli,而不是  /dev/da0p4,/dev/db0p4,/dev/dc0b4,/dev/dd0p4,没想明白。已经可以看到有一个硬盘已经处于了 UNAVAIL 的状态,而且两个 pool 都是在降级 DEGRADED 运行。
$ zpool status
  pool: bootpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 172K in 0h0m with 0 errors on Wed Dec 23 23:27:09 2015
config:
        NAME                      STATE     READ WRITE CKSUM
        bootpool                  DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            14532417844132395237  UNAVAIL      0     0     0  was /dev/da0p2
            gpt/boot1             ONLINE       0     0     0
            gpt/boot2             ONLINE       0     0     0
            gpt/boot3             ONLINE       0     0     0
errors: No known data errors
  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 600K in 0h0m with 0 errors on Wed Dec 23 23:27:06 2015
config:
        NAME                      STATE     READ WRITE CKSUM
        zroot                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            13574931247595481435  UNAVAIL      0     0     0  was /dev/da0p4.eli
            da0p4.eli             ONLINE       0     0     0
            da1p4.eli             ONLINE       0     0     0
            da2p4.eli             ONLINE       0     0     0
errors: No known data errors
从 diskinfo 的输出可以看出硬盘的大小是 20G。
root@freebsd:~ # diskinfo /dev/da0
/dev/da0        512     21474836480     41943040        0       0       2610    255     63
root@freebsd:~ # diskinfo -v /dev/da0
/dev/da0
        512             # sectorsize
        21474836480     # mediasize in bytes (20G)
        41943040        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        2610            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
                        # Disk ident.
至于硬盘的分区,通过 gpart 的输出可以看到四个分区,一个是 freebsd-boot 格式,一个是 freebsd-swap,两个 freebsd-zfs,一个用给根目录,一个用给家目录。跟目录是不加密的,家目录是加密的,这是我后来才了解到的信息。貌似系统是不能从加密的根目录直接启动的。信息的准确性还有待于进一步研究。
root@freebsd:~ # gpart show da0
=>      34  41942973  da0  GPT  (20G)
        34         6       - free -  (3.0K)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-zfs  (2.0G)
   4196352   8388608    3  freebsd-swap  (4.0G)
  12584960  29356032    4  freebsd-zfs  (14G)
  41940992      2015       - free -  (1.0M)
每个硬盘上都分了一个 4GB 大小的交换分区,后来从 top 的输出信息来看,交换分区的大小是 16GB,也就是四个硬盘上的交换分区加起来计算的容量,这个不理解。
root@freebsd:~ # cat /etc/fstab
# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/da0p3              none    swap    sw              0       0
/dev/da1p3              none    swap    sw              0       0
/dev/da2p3              none    swap    sw              0       0
/dev/da3p3              none    swap    sw              0       0
接下来添加一块硬盘,然后通过 gpart 把一块没问题的硬盘的分区表原封不动的恢复到新硬盘上来,在这里它的标号是 da3.
root@freebsd:~ # gpart backup da0 | gpart restore -F da3
可以看下新的硬盘分区情况,/dev/da3 已经按照 /dev/da0 的分区样式划分好了分区,但是奇怪的是 /devda3p4.eli 没有出来,不解!
root@freebsd:~ # ls /dev/da*
/dev/da0        /dev/da1        /dev/da2        /dev/da3
/dev/da0p1      /dev/da1p1      /dev/da2p1      /dev/da3p1
/dev/da0p2      /dev/da1p2      /dev/da2p2      /dev/da3p2
/dev/da0p3      /dev/da1p3      /dev/da2p3      /dev/da3p3
/dev/da0p4      /dev/da1p4      /dev/da2p4      /dev/da3p4
/dev/da0p4.eli  /dev/da1p4.eli  /dev/da2p4.eli
我在这里重启了一次系统: root@freebsd:~ # reboot
zpool status 看一下最新的信息,方便下面替换。
$ zpool status
  pool: bootpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 172K in 0h0m with 0 errors on Wed Dec 23 23:27:09 2015
config:
        NAME                      STATE     READ WRITE CKSUM
        bootpool                  DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            14532417844132395237  UNAVAIL      0     0     0  was /dev/da0p2
            gpt/boot1             ONLINE       0     0     0
            gpt/boot2             ONLINE       0     0     0
            gpt/boot3             ONLINE       0     0     0
errors: No known data errors
  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h18m with 0 errors on Thu Jan 28 03:50:46 2016
config:
        NAME                      STATE     READ WRITE CKSUM
        zroot                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            13574931247595481435  UNAVAIL      0     0     0  was /dev/da0p4.eli
            da0p4.eli             ONLINE       0     0     0
            da1p4.eli             ONLINE       0     0     0
            da2p4.eli             ONLINE       0     0     0
errors: No known data errors
首先替换 zroot 存储池:
root@freebsd:/usr/home/beta4better # zpool replace zroot 13574931247595481435 da3p4
Make sure to wait until resilver is done before rebooting.
If you boot from pool 'zroot', you may need to update
boot code on newly attached disk 'da3p4'.
Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:
        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
按照提示信息执行下 gpart bootcode 操作:
root@freebsd:/usr/home/beta4better # gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da3
bootcode written to da3
打开 autoexpend 选项,为后面扩容做准备。这是后来才注意到的这个属性设置。
root@freebsd:/usr/home/beta4better # zpool set autoexpand=on bootpool
zpool status 可以看到 zroot 存储池已经开始恢复数据了。
root@freebsd:/usr/home/beta4better # zpool status
  pool: bootpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist fo                   r
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 172K in 0h0m with 0 errors on Wed Dec 23 23:27:09 2015
config:
        NAME                      STATE     READ WRITE CKSUM
        bootpool                  DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            14532417844132395237  UNAVAIL      0     0     0  was /dev/da0p2
            gpt/boot1             ONLINE       0     0     0
            gpt/boot2             ONLINE       0     0     0
            gpt/boot3             ONLINE       0     0     0
errors: No known data errors
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan 28 04:03:52 2016
        199M scanned out of 7.79G at 3.16M/s, 0h41m to go
        48.9M resilvered, 2.49% done
config:
        NAME                        STATE     READ WRITE CKSUM
        zroot                       DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              13574931247595481435  UNAVAIL      0     0     0  was /dev/da0p4                   .eli
              da3p4                 ONLINE       0     0     0  (resilvering)
            da0p4.eli               ONLINE       0     0     0
            da1p4.eli               ONLINE       0     0     0
            da2p4.eli               ONLINE       0     0     0
errors: No known data errors
接下来把 bootpool 里的分区也替换下:
root@freebsd:/usr/home/beta4better # zpool replace bootpool 14532417844132395237 da3p2
可以看到 zroot 和 bootpool 两个存储池都开始了数据恢复。
root@freebsd:/usr/home/beta4better # zpool status
  pool: bootpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan 28 04:06:17 2016
        62.0M scanned out of 510M at 3.10M/s, 0h2m to go
        59.7M resilvered, 12.16% done
config:
        NAME                        STATE     READ WRITE CKSUM
        bootpool                    DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14532417844132395237  UNAVAIL      0     0     0  was /dev/da0p2
              da3p2                 ONLINE       0     0     0  (resilvering)
            gpt/boot1               ONLINE       0     0     0
            gpt/boot2               ONLINE       0     0     0
            gpt/boot3               ONLINE       0     0     0
errors: No known data errors
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan 28 04:03:52 2016
        581M scanned out of 7.79G at 3.52M/s, 0h34m to go
        144M resilvered, 7.29% done
config:
        NAME                        STATE     READ WRITE CKSUM
        zroot                       DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              13574931247595481435  UNAVAIL      0     0     0  was /dev/da0p4.eli
              da3p4                 ONLINE       0     0     0  (resilvering)
            da0p4.eli               ONLINE       0     0     0
            da1p4.eli               ONLINE       0     0     0
            da2p4.eli               ONLINE       0     0     0
errors: No known data errors
新旧磁盘的对比,可以看到两个的分区是完全一样。
root@freebsd:/usr/home/beta4better # gpart show da0
=>      34  41942973  da0  GPT  (20G)
        34         6       - free -  (3.0K)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-zfs  (2.0G)
   4196352   8388608    3  freebsd-swap  (4.0G)
  12584960  29356032    4  freebsd-zfs  (14G)
  41940992      2015       - free -  (1.0M)
root@freebsd:/usr/home/beta4better # gpart show da3
=>      34  41942973  da3  GPT  (20G)
        34         6       - free -  (3.0K)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-zfs  (2.0G)
   4196352   8388608    3  freebsd-swap  (4.0G)
  12584960  29356032    4  freebsd-zfs  (14G)
  41940992      2015       - free -  (1.0M)
放了几个小时后就可以看到数据恢复完了。1.87G 的数据花了两小时三分钟。
root@freebsd:/usr/home/beta4better # zpool status
  pool: bootpool
 state: ONLINE
  scan: resilvered 510M in 0h1m with 0 errors on Thu Jan 28 04:08:11 2016
config:
        NAME           STATE     READ WRITE CKSUM
        bootpool       ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            da3p2      ONLINE       0     0     0
            gpt/boot1  ONLINE       0     0     0
            gpt/boot2  ONLINE       0     0     0
            gpt/boot3  ONLINE       0     0     0
errors: No known data errors
  pool: zroot
 state: ONLINE
  scan: resilvered 1.87G in 2h3m with 0 errors on Thu Jan 28 06:07:20 2016
config:
        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            da3p4      ONLINE       0     0     0
            da0p4.eli  ONLINE       0     0     0
            da1p4.eli  ONLINE       0     0     0
            da2p4.eli  ONLINE       0     0     0
errors: No known data errors
至此硬盘替换完成。验证了我自己的一个思路是正确的。接下来是扩容的操作。把 da0 用一块容量为 40G 的硬盘替换掉。 同样的操作方式,只不过在硬盘的最后方会有一个 20G 的空余容量。如下所示:
root@freebsd:~ # gpart show da0
=>      34  83886013  da0  GPT  (40G)
        34         6       - free -  (3.0K)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-zfs  (2.0G)
   4196352   8388608    3  freebsd-swap  (4.0G)
  12584960  29356032    4  freebsd-zfs  (14G)
  41940992  41945055       - free -  (20G)
接下来通过 gpart resize 来扩容:
root@freebsd:~ # gpart resize -i 4 da0
da0p4 resized
root@freebsd:~ # gpart show da0
=>      34  83886013  da0  GPT  (40G)
        34         6       - free -  (3.0K)
        40      1024    1  freebsd-boot  (512K)
      1064       984       - free -  (492K)
      2048   4194304    2  freebsd-zfs  (2.0G)
   4196352   8388608    3  freebsd-swap  (4.0G)
  12584960  71301087    4  freebsd-zfs  (34G)
可以看到家目录大小变成了 34G。然后依次替换其他三个硬盘,一定要注意要等数据 resilvered 完之后再换另一块硬盘。随时通过 zpool status 来查看状态。
前后两次存储池的对比。更新前:
$ zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bootpool  1.98G   510M  1.49G         -    18%    25%  1.00x  DEGRADED  -
zroot     55.5G  7.74G  47.8G         -     4%    13%  1.00x  DEGRADED  -
更新后:
zroot@freebsd:~ # zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bootpool  1.98G   511M  1.49G         -    18%    25%  1.00x  ONLINE  -
zroot      136G  7.52G   128G         -     2%     5%  1.00x  ONLINE  -
前后两次 df 的输出对比,扩容前:
root@freebsd:~ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default     25G    2.7G     22G    11%    /
devfs                 1.0K    1.0K      0B   100%    /dev
bootpool              1.9G    509M    1.4G    26%    /bootpool
zroot/tmp              22G     45M     22G     0%    /tmp
zroot/usr/home         22G    186K     22G     0%    /usr/home
zroot/usr/ports        23G    900M     22G     4%    /usr/ports
zroot/usr/src          22G    140K     22G     0%    /usr/src
zroot/var/audit        22G    140K     22G     0%    /var/audit
zroot/var/crash        22G    140K     22G     0%    /var/crash
zroot/var/log          22G    680K     22G     0%    /var/log
zroot/var/mail         22G    192K     22G     0%    /var/mail
zroot/var/tmp          22G    140K     22G     0%    /var/tmp
zroot                  22G    140K     22G     0%    /zroot
扩容后:
root@freebsd:~ # df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
zroot/ROOT/default     63G    2.7G     60G     4%    /
devfs                 1.0K    1.0K      0B   100%    /dev
bootpool              1.9G    509M    1.4G    26%    /bootpool
zroot/tmp              60G     45M     60G     0%    /tmp
zroot/usr/home         60G    186K     60G     0%    /usr/home
zroot/usr/ports        61G    900M     60G     1%    /usr/ports
zroot/usr/src          60G    140K     60G     0%    /usr/src
zroot/var/audit        60G    140K     60G     0%    /var/audit
zroot/var/crash        60G    140K     60G     0%    /var/crash
zroot/var/log          60G    680K     60G     0%    /var/log
zroot/var/mail         60G    192K     60G     0%    /var/mail
zroot/var/tmp          60G    140K     60G     0%    /var/tmp
zroot                  60G    140K     60G     0%    /zroot
还没有机会在物理机上实验,记录基本的方法和过程如上,首次做这样的操作,可能会有错误存在,请慎重。涉及到硬盘分区操作的命令请慎重使用。
除非知道命令的真实效果,否则不要随便敲回车。
Update: