Database prepare time
Let's see how fast (or slow) can be to prepare a PostgreSQL database via sysbench prepare command. Please note that sysbench prepare is much heavier than a “simple” raw SQL import, as it issues many synchronized writes (fsync) rather than a single sync command at the end.
The fastest performer was LVM with preallocated space, with preallocated Qcow2 slightly behind. ZFS is again very fast, with ThinLVM+nozeroing at its wheels.
But the real winner is the ThinLVM+EXT4 combo: it had performance very similar to a preallocated LVM volume, with the added flexibility of being a thin volume. And while you can argue that using a direct-attached ThinLVM volume without zeroing is a security hazard (at least in untrusted environments), it is relatively safe to use it as the backing device of a host-side filesystem installation.
What about the large XFS vs EXT4 gap? It can probably related to more efficient journal commit, but maybe it is another problem entirely: I noticed that with default 512KB chunks, XFS log buffer size is (by default) not optimally sized. Technical explanation: XFS log buffer size is MAX(32768, log_sunit). At filesystem creation mkfs.xfs try to optimize the sunit values for the current raid setup, but a 512KB stripe value is too much and so mkfs.xfs reverts to a sunit value of only 32KB. In fact, during filesystem creation you can see something similar:
log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB
This, in turn, cause a low default log buffer size. Please note that while logbsize is a tunable mount parameters (and we can indeed tune even the mkfs options), for this comparison I decided to stick with default options. So I left XFS with default settings. In a future article I'll dissect this (and others) behavior, but this is a work for another time...
Another possible reason is XFS log placement, directly at the middle of the volume: with a mostly-empty filesystem (as in this case: of our 800 GB volume, only a maximum of 32 GB where allocated) it is not the best placement. On the other side, when usage grows, it surely become a good choice.
What about BTRFS? It surprises me: it is fast when using CoW + preallocation or NoCoW + thin, and slow in the other cases. The first result puzzles me: for a CoW filesystem, preallocation don't tell very much, as new blocks will never rewritten in place.
On the other side, the fast NoCoW + thin result can be explained, as thin provision give the filesytem a chance to turn sparse writes into contiguous ones. But hey – the DB inserts performed by sysbench should already be contiguous, right? True, but don't forget the XFS layer inside the guest image: as explained, by default XFS store its journal at the middle of the volume. Thin provision means than the host can remap the log to another location (the beginning of the volume, as guest filesystem creation is one of the first thing the installer accomplishes), and so total seek time can go down (even with small 8 GB images, guest filesystem was under 20% full, so the middle-of-disk placement was not optimal). In other words, we are probably observing an interesting interaction between the host and guest filesystems, and their respective methods to allocate data/metadata blocks.