Migrate from RAID1 disk to ZFS on NixOS

June 06, 2020

As of 2020-06-06 I only made those tests inside a VM (See below if you want to play with it too). I’m not fully at peace with the process yet to actually apply it on my server. Use at your own risk!

Edit 2020-08-13: Adjustments to the process after some advices by Linus Heckemann (sphalerite on freenode): the /boot partition stays outside of zfs and some flags are set by default on the zpool.

Edit 2020-08-25: The configuration was successfully applied. An issue occured with two ZFS pools having the same name, which prevented the system from booting. Thanks to sphalerite, BiBi and Raito_Bezarius for their supports during the hours of debugging it took to figure out the issue.

Context

I’m the happy owner of a server which holds my whole infrastructure for more than one year now, powered by NixOS for declarative deployments. When I installed it the first time, I didn’t know about ZFS and all its features (see there if you want some examples)

Since I cannot afford to reinstall everything from scratch, I had to find a way to deploy ZFS safely (i.e. without losing redundancy). This article explains step by step the choices I made.

Setup

The server has a single relevant partition mounted on / (the other partitions are BIOS boot and swap, non-relevant here). This partition is a RAID1 array, backed by two disks. The partition holding / on the underlying disks is the third one, that is /dev/md0 containing /dev/sda3 and /dev/sdb3.

The distribution of my server is NixOS, installed remotely via nixops. Some commands will rely on that fact below, but might be adapted depending on your distribution (or if you don’t use nixops)

I ordrered an additional disk to my server provider (/dev/sdc). The sole purpose of this disk is to ensure redundancy in case of failure during the process. The process itself could be adapted to not need it if you’re confident enough. It can be thrown away at the end of the process.

Play with libvirtd

Since I didn’t want to break my server, I created a libvirtd rough equivalent of my setup: three disk images, two of them mounted as RAID array. Since it is a quite specific setup, I couldn’t make a fully declarative VM handled by nixops, but I still made use of some of nixpkgs helpers.

The derivation below will produce an output with three disks image as described above:

# base_image.nix
{ system ? builtins.currentSystem, size ? "10" }:
let
  pkgs = import <nixpkgs> {};
  config = (import <nixpkgs/nixos/lib/eval-config.nix> {
    inherit system;
    modules = [ {
      fileSystems."/".device = "/dev/disk/by-label/root";

      boot.loader.grub.version = 2;
      boot.loader.grub.devices = [ "/dev/vda" "/dev/vdb" ];
      boot.loader.timeout = 0;
      boot.kernelParams = ["console=ttyS0,115200"];

      services.openssh.enable = true;
      services.openssh.startWhenNeeded = false;
      services.openssh.extraConfig = "UseDNS no";
    } ];
  }).config;
  the_key = builtins.getEnv "NIXOPS_LIBVIRTD_PUBKEY";
in pkgs.vmTools.runInLinuxVM (
  pkgs.runCommand "libvirtd-image"
    { memSize = 768;
      preVM =
        ''
          mkdir $out
          diskImage1=$out/image
          diskImage2=$out/image2
          diskImage3=$out/image3
          ${pkgs.vmTools.qemu}/bin/qemu-img create -f qcow2 $diskImage1 "${size}G"
          ${pkgs.vmTools.qemu}/bin/qemu-img create -f qcow2 $diskImage2 "${size}G"
          ${pkgs.vmTools.qemu}/bin/qemu-img create -f qcow2 $diskImage3 "${size}G"
          mv closure xchg/
        '';
      postVM =
        ''
          mv $diskImage1 $out/disk.qcow2
          mv $diskImage2 $out/disk2.qcow2
          mv $diskImage3 $out/disk3.qcow2
        '';
      QEMU_OPTS = builtins.concatStringsSep " " [
        "-drive file=$diskImage1,if=virtio,cache=unsafe,werror=report"
        "-drive file=$diskImage2,if=virtio,cache=unsafe,werror=report"
        "-drive file=$diskImage3,if=virtio,cache=unsafe,werror=report"
      ];
      buildInputs = [ pkgs.utillinux pkgs.perl pkgs.kmod ];
      exportReferencesGraph =
        [ "closure" config.system.build.toplevel ];
    }
    ''
      ${pkgs.parted}/bin/parted --script /dev/vda -- \
        mklabel gpt \
        mkpart ESP fat32 8MiB 256MiB \
        set 1 boot on \
        set 1 bios_grub on \
        mkpart sap1 linux-swap 256MiB 512MiB \
        mkpart primary ext4 512MiB -1
      ${pkgs.parted}/bin/parted --script /dev/vdb -- \
        mklabel gpt \
        mkpart ESP fat32 8MiB 256MiB \
        set 1 boot on \
        set 1 bios_grub on \
        mkpart sap1 linux-swap 256MiB 512MiB \
        mkpart primary ext4 512MiB -1
      ${pkgs.mdadm}/bin/mdadm --create /dev/md0 --metadata=0.90 --level=1 --raid-devices=2 /dev/vda3 /dev/vdb3

      # Create an empty filesystem and mount it.
      ${pkgs.e2fsprogs}/sbin/mkfs.ext4 -L root /dev/md0
      ${pkgs.e2fsprogs}/sbin/tune2fs -c 0 -i 0 /dev/md0
      mkdir /mnt
      mount /dev/md0 /mnt

      export HOME=$TMPDIR
      export NIX_STATE_DIR=$TMPDIR/state

      mkdir -p /mnt/etc/nixos

      # The initrd expects these directories to exist.
      mkdir /mnt/dev /mnt/proc /mnt/sys
      mount --bind /proc /mnt/proc
      mount --bind /dev /mnt/dev
      mount --bind /sys /mnt/sys

      # Copy all paths in the closure to the filesystem.
      storePaths=$(perl ${pkgs.pathsFromGraph} /tmp/xchg/closure)

      echo "filling Nix store..."
      mkdir -p /mnt/nix/store
      set -f
      cp -prd $storePaths /mnt/nix/store/

      mkdir -p /mnt/etc/nix
      echo 'build-users-group = ' > /mnt/etc/nix/nix.conf
      export USER=root

      ## Register the paths in the Nix database.
      printRegistration=1 perl ${pkgs.pathsFromGraph} /tmp/xchg/closure | \
          chroot /mnt ${config.nix.package.out}/bin/nix-store --load-db

      mkdir -p /mnt/nix/var/nix/profiles
      # Create the system profile to allow nixos-rebuild to work.
      chroot /mnt ${config.nix.package.out}/bin/nix-env \
          -p /nix/var/nix/profiles/system --set ${config.system.build.toplevel}

      # `nixos-rebuild' requires an /etc/NIXOS.
      mkdir -p /mnt/etc/nixos
      touch /mnt/etc/NIXOS

      # `switch-to-configuration' requires a /bin/sh
      mkdir -p /mnt/bin
      ln -s ${config.system.build.binsh}/bin/sh /mnt/bin/sh

      # Generate the GRUB menu.
      chroot /mnt ${config.system.build.toplevel}/bin/switch-to-configuration boot

      mkdir -p /mnt/etc/ssh/authorized_keys.d
      echo '${the_key}' > /mnt/etc/ssh/authorized_keys.d/root
      umount /mnt/proc /mnt/dev /mnt/sys
      umount /mnt
    ''
)

When deploying with nixops (via the libvirtd backend), you will need to make each image available. However, nixops only handles one and only one image, so we will need a bit of manual tasks. This is the nixops configuration I’m using:

# libvirtd.nix
{
  testzfs = { pkgs, lib, ... }:
  {
    fileSystems."/".device = lib.mkForce "/dev/disk/by-label/root";

    # Serial access via virsh console (quite handy for debugging)
    boot.kernelParams = ["console=ttyS0,115200"];
    boot.loader.grub.extraConfig = ''
      serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1
      terminal_output serial
      terminal_input serial
    '';
    boot.loader.timeout = lib.mkForce 2;

    # You need to explicitely specify the additional disk here
    boot.loader.grub.devices = [ "/dev/sdb" ];

    deployment = {
      targetEnv = "libvirtd";
      libvirtd.baseImage = pkgs.callPackage ./base_image.nix {};
      # Additional images need to be specified explicitely here (only the sda one will be picked by nixops)
      libvirtd.extraDevicesXML = ''
        <disk type="file" device="disk" snapshot="external">
          <driver name="qemu" type="qcow2"/>
          <source file="/path/to/disk2.qcow2"/>
          <target dev="hdb"/>
        </disk>
        <disk type="file" device="disk" snapshot="external">
          <driver name="qemu" type="qcow2"/>
          <source file="/path/to/disk3.qcow2"/>
          <target dev="hdc"/>
        </disk>
      '';
    };

    # Some dummy service that writes to disk regularly
    systemd.services.nag-var = {
      description = "Some service reading and writing to /var";
      after = [ "network.target" ];
      wantedBy = [ "multi-user.target" ];
      script = ''
        #!${pkgs.stdenv.shell}
        mkdir -p /var/nagvar
        while true; do
          ${pkgs.coreutils}/bin/date > /var/nagvar/last
          ${pkgs.coreutils}/bin/sleep 10
        done
      '';
    };
  };
}

Now prepare the VM. Beware, this will rapidly fill-in your /nix/store with big images. (nix-store --delete /nix/store/*libvirtd-image* to clean them selectively if you’re doing tests)

# This command will fail due to missing images
nixops deploy --create-only
# Find the path to images at the beginning of the output. It will be
# slightly different from what you would get with nix-build due to
# some parameters given by nixops
P=/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-libvirtd-image

# Stop VM.
virsh destroy nixops-...-testzfs

# Copy additional disks to places written in libvirtd.nix
# For some reason, sometimes I had to replace the first disk too in
# libvirtd folder.
cp $P/disk2.qcow2 /path/to/disk2.qcow2
cp $P/disk3.qcow2 /path/to/disk3.qcow2
chmod gu+w /path/to/disk2.qcow2 /path/to/disk3.qcow2

# Same action done by nixops on the first disk
qemu-img rebase -f qcow2 -b "" /path/to/disk2.qcow2
qemu-img rebase -f qcow2 -b "" /path/to/disk3.qcow2

# Edit libvirtd and add console configuration (in the <devices> section)
virsh edit nixops-...-testzfs
# <serial type='pty'><target port='0'/></serial>
# <console type='pty'><target type='serial' port='0'/></console>

nixops deploy --force-reboot

Now you should have a running VM containing two drives in a RAID1 array plus one unused drive, that mimics your production server, and that I used as a base for the migration process below.

In case of problem, you should be able to use virsh console to get an actual console of what’s happening on your VM (as early as grub stage). Also think of doing snapshots if you want to repeat some steps.

Migration process

Add the new disk to the RAID array

# Copy partitionning without boot partition
sfdisk -d /dev/sda | grep -v ^sector-size: | sed -e "s/21686148-6449-6E6F-744E-656564454649/0657FD6D-A4AB-43C4-84E5-0933C84B4F4F/" | sfdisk /dev/sdc

# Add the new partition to RAID array
mdadm --grow /dev/md0 --level=1 --raid-devices=3 --add /dev/sdc3

# Wait for synchronisation to finish
cat /proc/mdstat
(...)

Remove sda from the array

Beware in this step, depending on your grub configuration it could very well end up using sda for the next boot if you don’t wipe it correctly (last command of the step)

mdadm /dev/md0 --fail /dev/sda3 --remove /dev/sda3
mdadm --grow /dev/md0 --raid-devices=2

# delete old partition (so that grub doesn’t find it by error)
wipefs -a /dev/sda3

Add ZFS-specific configuration to nix

Add this to nix configuration:

boot.supportedFilesystems = [ "zfs" ];
networking.hostId = "9e16a79b";

# Maintenance target for later
systemd.targets.maintenance = {
  description = "Maintenance target with only sshd";
  after = [ "network-online.target" "network-setup.service" "sshd.service" ];
  requires = [ "network-online.target" "network-setup.service" "sshd.service" ];
  unitConfig = {
    AllowIsolate = "yes";
  };
};

And deploy:

nixops deploy
# nixos-rebuild switch

Convert sda3 to a ZFS filesystem

I wanted to use this migration to encrypt my filesystem at the same time. But doing it correctly requires specific configuration (in initrd) which I didn’t want to risk doing concurrently with the migration. So for now the password will be in cleartext (I’m aware it makes the encryption useless, but since encryption cannot be switched on later I need to activate it now. If someone obtains root access to your system during that time your encryption is screwed - he can obtain the ZFS master encryption key -, otherwise it can just be activated with a proper process later)

# Repartition your disk, it’s not recommended to have /boot in ZFS for now
# remove sda3, create a 2GB partition for /boot and create a new root partition
sfdisk --delete /dev/sda 3
fdisk /dev/sda
mdadm --create /dev/md1 --metadata=0.90 --level=1 --force --raid-devices=1 /dev/sda3
mkfs.ext4 /dev/md1
mkdir /mnt && mount /dev/md1 /mnt && echo -n "12345678" > /mnt/pass.key && chmod go-rwx /mnt/pass.key
echo -n "12345678" > /boot/pass.key && chmod go-rwx /boot/pass.key
zpool create -O xattr=sa -O acltype=posixacl -O atime=off -o ashift=12 -O mountpoint=legacy -f zpool sda4
zfs create -o encryption=on -o keyformat=passphrase -o keylocation=file:///boot/pass.key zpool/root
zfs create zpool/root/nix
zfs create -o atime=on zpool/root/var
zfs create -o sync=disabled zpool/root/tmp
zfs create zpool/root/etc
umount /mnt
mount -t zfs zpool/root /mnt
mkdir /mnt/nix && mount -t zfs zpool/root/nix /mnt/nix
mkdir /mnt/var && mount -t zfs zpool/root/var /mnt/var
mkdir /mnt/tmp && mount -t zfs zpool/root/tmp /mnt/tmp
mkdir /mnt/etc && mount -t zfs zpool/root/etc /mnt/etc
mkdir /mnt/boot && mount /dev/md1 /mnt/boot
rsync -aHAXS --one-file-system / /mnt/

Let NixOS know about new filesystem

Until there, you could do everything while keeping your system running, rebooting etc. From now on, everything must be done in one go (no reboot inbetween) or you might not be able to properly boot

Obtain your /boot uuid and replace below:

fileSystems."/"     = { fsType = "zfs"; device = "zpool/root"; };
fileSystems."/boot" = { fsType = "ext4"; device = "/dev/disk/by-uuid/5b27af91-f515-44f4-9a65-1516326d9297"; };
fileSystems."/etc"  = { fsType = "zfs"; device = "zpool/root/etc"; };
fileSystems."/nix"  = { fsType = "zfs"; device = "zpool/root/nix"; };
fileSystems."/tmp"  = { fsType = "zfs"; device = "zpool/root/tmp"; };
fileSystems."/var"  = { fsType = "zfs"; device = "zpool/root/var"; };
boot.initrd.secrets = {
  "/boot/pass.key" = "/boot/pass.key";
}

Deploy partially:

nixops deploy --dry-activate
# nixos-rebuild dry-activate

Go in maintenance mode, resynchronize and prepare next boot

systemctl isolate maintenance.target
systemctl stop systemd-journald systemd-journald.socket systemd-journald-dev-log.socket systemd-journald-audit.socket
rsync -aHAXS --delete --one-file-system / /mnt/

# Prepare next boot in zfs filesystem
NIXOS_INSTALL_BOOTLOADER=1 nixos-enter --root /mnt/ -- /nix/var/nix/profiles/system/bin/switch-to-configuration boot
# Prepare next boot in raid array - for grub
/nix/var/nix/profiles/system/bin/switch-to-configuration boot

# Unmount everything and prepare the filesystem
umount -R /mnt

# Remove sdb3 from raid array and attach it to ZFS. We still have
# the data both in raid and zfs, and no file is modified due to
# maintenance mode so they’re synchronized
mdadm /dev/md0 --fail /dev/sdb3 --remove /dev/sdb3

# Repartition /dev/sdb similarly to /dev/sda
fdisk /dev/sdb
# Add sdb3 to the /boot array
mdadm --grow /dev/md1 --level=1 --raid-devices=2 --add /dev/sdb3
zpool attach -f zpool sda4 sdb4

# Wait until it’s fully synchronized (or feel lucky and don’t wait)
zpool status

# Restart
shutdown -r now

Cleanup old system

Now that the installation is finished, you may cleanup the additional disk and profit

mdadm --stop /dev/md0
wipefs -a /dev/sdc3
shred -v /dev/sdc

Immae's blog

egrep -ri TODO /etc