Using RAID 0 vs multiple data paths after commit #10461

Hello all,
I posted this in the Google mailing list, but I'd thought I'd ask here as well.

I currently have 2 servers, each with 128GB RAM and 2 HDDs in a RAID 1, and 10 SDDs. I was instructed to create a ELK stack on these servers, using the SDDs as separate disks and pointing ES to each as a config. The reasoning behind this was to avoid the speed hit from the RAID controller, as well as flexibility. My plan was to have 1 server have a single instance of ES with 31GB RAM and the rest of the memory for Kibana, Logstash, Redis, and the 2nd server to have 2 instances of ES with 31 GB each.

When I researched about RAID vs the multiple data paths, everybody seems to recommend not using RAID (or using RAID 0), but I fail to see how multiple data paths will give flexibility. Up until today, I was under the impression that ES stripes the data anyway, so it'll be like a RAID 0 (except with a lot more work on the fstab side of things). If one disk went down, you'd lose the whole node since it didn't care about where it placed the data. Now, however, I read the commit #10461 on github and it seems to indicate that the code was changed to allow a single path for each shard? If
that's the case, and I have 3 shards + 1 replica each (because we have 3 nodes), how does this utilize all 10 SDDs? If this is NOT the case, the only real data redundancy and resiliency is still via the replicas, correct? So it doesn't really matter RAID 0 vs multiple data paths? Can anybody shed some light on this issue? I appreciate any and all help!

There are two very different approaches.

One approach is when you can configure your own machines in your own data center. Then you can select your hardware, maintain it, manage failures, choose your failure recovery strategy etc. The other approach is when you are "in the cloud" - that means if you must provision your software on other people's machines.

For the first approach, I recommend RAID0 if you can afford to let complete nodes fail in case of disk errors. It has maximum performance because of the hardware controller support. To ensure availability, this requires replica level 1 (or higher) so you can still operate your cluster when one node fails and must be decommissioned.

For the other approach, you should listen to the people who prepare and maintain the hardware. If they set up disks without RAID (JBOD), ES helps you with the multiple data path feature to spread disk load.

Thanks Jorg. This is a server that I'll be setting up myself (hardware), so I can configure the RAID or leave it JBOD. The issue I'm dealing with is, if a disk does fail, what happens in the JBOD scenario? Is it possible that just that shard will fail, and the replica shard (or primary if the failed drive had a replica) will takeover? I know that for a failed node, that's the way it's supposed to work, but what about a single shard? Also, if each shard is only on a single data path, is the inverse true as well? ES will only place a single shard on each data path? Or will a data path contain multiple shards if it has the space, and so potentially multiple shards (but not the whole node) can fail?

Its difficult. If one disk fails you lose everything on it and
Elasticsearch gets upset until you shut down the service and remove the
disk from its and start it again. This is the best case scenario.
As it is now I suspect it'll amount to the same thing as losing the whole
RAID0 - the files on the other disk aren't super useful if you have
replicas because they won't contain a full replica (until that commit you
mention) and elasticsearch will have spun up a new replica when it noticed
that that node was down anyway. It might have trouble noticing too.

If you have 2 disks in the system I'd go RAID0. If you have ten then RAID10

  • though that is just my gut from relational db days. They hate RAID5/6. I
    suspect ES actually is less bad but I don't know how much.

That's what I'm not sure about; the behavior of ES if one of the disks go bad. Since when a node goes bad, the replicas are supposed to be promoted for any primaries that were on the node that goes bad, is that true for when a single shard goes bad as well? I suppose I could test by force-dismounting the drive on the OS side and see how ES reacts, but I was hoping there were actual test cases or at least a theory of behavior before I had to resort to that

I think I should also ask a fundamental question, because I'm pretty new to development (more of a systems guy until I got this job): if there is a pull request on the elasticsearch github that talks about this change in data path storage (striping to single datapath per shard), this doesn't mean that the code has been integrated into the main release right?