ES Cluster on ZFS with PCIe3.0 SSD/SATA SSD

(Peter) #1


Just wondering if anyone has configured ES on a ZFS Pool of PCIe x3.0 SSDs and SATA3 SSDs yet and if how they found performance?

If so did you do anything special with the PCIe SSD and the L2ARC?


(Jörg Prante) #2

Are you running Linux?

(Horst Birne) #3

I cant say anything about PCI SSDs, but we use a RAID 5 of SATA-SSDs and switched from BTRFS to ZFS about 1 year ago.

The performance is much better with ZFS now and it runs very smoothly, with no problems since the migration.

Luckily our server have plenty of RAM, so we didnt configured L2ARC, as the ARC and the speed of the SSDs are more than enough for ES (ES is rather CPU , than disk bound)

We also enable the built-in ZFS compression with the fast LZ4 compress algorithm - the CPU overhead is rather low and its saving a huge amount of disk space.

(Peter) #4

Yep, my test box is running Debian 8

(Peter) #5

Good to know, I figured this could be a useful way to get the 'Hot ' (all SSD) cluster and 'Warm' (all HDD) a nice speed boost with little cost, even if I added 2x SSDs to the 'Warm' as L2ARC+LZ4 and left the 'Hot' as just standard without compression?

(Jörg Prante) #6

While ZFS is surely an advanced file system, if you want performance with SSD, I recommend XFS and hardware RAID. HW RAID controller are much faster than ZFS pools.

(Peter) #7

If budget is a serious constraint where nice array controllers, enterprise level disks etc aren't achievable would you say that ZFS would be a better use of disk resource than just going Ext3?

(Sean Johnson) #8

I'm just working on preliminary setup and config, but so far putting Elasticsearch on ZFS has been as performant as XFS on CentOS 7 systems. Personally, I find the other features of ZFS to be the benefits that push me over to using ZFS on all but the OS disk.

Now, there is currently one HUGE caveat to this. If you are going to put Elasticsearch on ZFS using the current ZoL release (, MAKE SURE you create the ZFS filesystem with the xattr=sa option. Without this, there's a very good chance that the ZFS filesystem will not correctly free up deleted blocks.

The GitHub Issue is here :

(Peter) #9

Thats a good shout thanks, it wasnt something I was aware of!

(Nik Everett) #10

Just so you know I've been following along with this. I don't have anything to add other than "cool" and "good luck". Personally I'd love to be able to use ZFS and I salute you for doing so. Just keep in mind that you are one of the few folks who do so getting help might be difficult.

(Jörg Prante) #11

While you're at it, here is the best summary I know of what has to be planned before using ZFS, a little old, so some facts are outdated, but most points are still valid:

For what it's worth, I have never been able to set up ZFS to match the performance of XFS under Linux.

(Peter) #12

Great information thanks everyone, I'll maybe add XFS into my bench-marking to ensure I've covered all bases. The Hot tier is likely to be working with 8,035,200,000 docs (4.3TB) a month so need to try and optimize every possible piece.


Hi! As I can remember ZFS on Linux by default is very memory hungry and in most cases should be limited by tuning ARC min/max memory. Another point that Marvel does not understand how to get disk usage from ZFS, but it can be fixed with recent versions (don't know).

(Peter) #14

I figured I'd post a follow up on this. I've got the hardware end and have been benchmarking with Bonnie++

2x Xeon 2.1Ghz E5-2620 v4
Supermicro X10DAI
128GB DDR4 2133mhz ECC
2x Samsung 512GB 950 Pro (via PCIe Gen3) (for cache drives)
1x LSI Megaraid 93622-8i
8x Samsung 1TB 850 Pro in a Raid5 hardware array.

I'm running VMWare ESXi6 to separate the guest OS from the underlying hardware so was able to spin up a Windows VM and run CrystalDiskMark as a comparison.

8x 850 Drives in Raid 5

  • CrystalDiskMark Results
  • Seq Q32T1 - Read: 4137MB/s - Write: 2932MB/s
  • 4K Q32T1 - Read: 492MB/s - Write: 155MB/s
  • Seq - Read: 2544MB/s - Write: 2357MB/s
  • 4K - Read: 30MB/s - Write: 65MB/s

2x 950 NVMe Drives in Raid 0

  • CrystalDiskMark Results
  • Seq Q32T1 - Read: 3219MB/s - Write: 3065MB/s
  • 4K Q32T1 - Read: 447MB/s - Write: 436MB/s
  • Seq - Read: 2574MB/s - Write: 2511MB/s
  • 4K - Read: 41MB/s - Write: 96MB/s

Which when I compare a zfs pool (of the raid5 volume) + 2 nvme cache drives the performance isn't great!

Output Block; 298MB/sec
Output Rewrite: 166MB/sec
Input Block: 1564MB/sec

Does anyone have any suggestions on a better configuration or disk config? I'm going to try EXT4 and XFS as a comparison now to rule out an Debian/Bonnie issues.


(Jörg Prante) #15

Don't use RAID5 if you are after write speed.

ZFS uses advanced features (checksums for everything, deduplication, compression) which are not targeted for maximum speed.

(Peter) #16

I've checked that controller and hoped to see JBOD but don't think it supports it, it was also to help reduce the number of drives needed to be passed through to ESXi (and then the guest VM). I'm not that worried about write speed for ES as it'll be continuous slow small writes where as the reads would be where the performance is needed.

(Peter) #17

Just as an update.

I've tried ZFS/XFS/bcache and cant get any of the benchmarks to match the Windows Crystal Disk Mark results... really at a loss as to why performance is so different between the two OS's. (Unless its just the way Bonnie++ vs Crystal are displaying the results?)

Anyone any thoughts?

(system) #18