Hybrid RAID 0 and Mutliple Data Paths

I'm in the process of designing a cluster and would like to get as much capacity (using SSDs) that I can squeeze out of my budget. I've been reading documentation and various posts regarding the changes to multiple data paths as well as the benefits to RAID 0.

I don't like the idea of a single drive failure in a RAID 0 bringing down the entire node because I'm looking at ~20TB worth of data, but I'm not sure if I need the performance gains of RAID 0. What I'd like to do is create multiple smaller RAID 0 groups and then use multiple data paths to stripe across.

So I have a few questions:

  1. If I match my shard count to the number RAID volumes, will ES store one shard per volume per index? For example, if I have 6 RAID 0 volumes and I specify 6 shards, for a single index, will that result in each RAID volume containing 1 shard of that index?

  2. In the event of disk failure, ES should become angry. My plan is to monitor the logs for a disk failure, recreate the volume, and mount it back to the original location. When ES resumes, will it replicate the exact same shards that were on the failed volume back to the new volume?

  3. Should I even use RAID 0 in this situation? Would having 25 disks mounted separately be a safer bet if I don't need the performance gains of RAID 0?

I look forward to testing this all myself, but I'm wondering if I've missed anything before I start the proof of concept.

Thanks for your help!

Am I interpreting this correctly that you're planning to put 20TB on a single node?

Yes, but based on your response, you're making me think I shouldn't! :sweat_smile:

The plan is 3 nodes, mirroring each other with 25 bay disk enclosures attached.

I'm not 100% sure of all of your questions but I can answer a few from memory.

When Elaticsearch loses a shard copy it waits something like a minute and then it assigns it to to some node with free space to host the copy. It seems unlikely that you'd get the node back online after a disk failure that fast so it is unlikely that it'd place the shard copy back on the failed volume.

25 disks per node is a lot. With that many disks you may prefer raid 10 or even raid 6 just so that management is easy. When you lose a disk on a node with ES you typically lose have to shut the node down until you can do something about the hardware. 25 disks means you'll have to do a lot of data copying to get the right number of shard copies back. It'll take a while, making losing a disk into an exciting event. If you use raid 10 or raid 6 losing a disk becomes less exciting. That lack of excitement may well be worth any performance penalty space loss that comes with the choice.

My idea was to have a script watch the hardware log files and in the event of a disk failure event, immediately stop Elasticsearch, create a new volume, mount the volume back to the original location and start Elasticsearch back up.

I plan on testing the theory prior to putting in a purchase order, but I was posting to find any glaring mistakes. Sounds like my first mistake was thinking I could create a node with a lot of disks.

If it makes any difference, the usage is for a SIEM so server, router, firewall, etc logs. My estimates are around 1000-1500 events per second but I will have separate Logstash nodes for ingesting the data.

This is indeed a lot. Usually folks go with more nodes with fewer disks. Disk layouts like this work fairly well when you want to keep a lot of data online for a long time but don't plan on querying it super frequently. Especially if you don't mind waiting a little while if you want to run aggregations across tons of data. It'd still be fairly quick to do aggregations across small bits of data, but less quick than it is when your disk to ram ratio is better.

Another thing to think about is that Elasticsearch has compatibility with data on disk across 2 major versions - so if you start with 2.x you'll be able to read those indexes until 6.x. Which is a thing to think about if you plan to keep data in there for years.

I planned on using 64GB of RAM. Do you think this use case would warrant going beyond that "golden number"?

My current proof of concept cluster started on 1.7 and I've been keeping it relatively up to date. While I wouldn't call our use case mission critical, we would be a little sad if it was down. But it's certainly not to the criticality shown on some of the customer stories. I had a couple issues going from 1.7 to 2.x. but nothing major. I'm thinking when I go full production, I'll keep our cluster within a few months of the latest release.

Thanks for the insight! Now I'm definitely waiting for our Gold subscription before I nail down a design.

1000-1500 log events/s doesn't seem that much which implies you may be planning for long retention in order to get to 20TB. If you're planning to do daily indices then you'll potentially have a lot of shards and the overhead of those is another problem with trying to put a lot of data on few nodes. We tried using 16TB nodes once for archive purposes and couldn't actually utilize more than 5-6TB of the space before the nodes ran out of heap due to the overhead of trying to load too many shards per node. That was on ES 1.7 though so the newer versions are probably better but it's still something to be aware of. If you have fewer, larger shards you can probably get much further but then there are other tradeoffs. So in general, yes using more smaller nodes will be better.


Currently, I set up my index naming convention to include daily, weekly, or monthly because I was concerned about how many indices I would generate after a year. Also, this use case will only have around 10 users. We do plan on having some dashboards on TVs that refresh every 5 or 10 minutes. The users may query via Kibana a few times per day as part of troubleshooting but this would only hit recent logs. At worst, we'd run a report looking for the internet activity of a particular user for the past year. It'd be completely fine as a user if this aggregation took 5 minutes, besides timeouts. This index generates around 11 million records per day and around 3-4TB a year of storage.

Do you know of any guidance regarding the heap sizing as it relates to total disk storage? All I ever see is don't make your heap too big and don't make it too small.

The recommendation is still the smaller or half of your ram and 30gb.
Elasticsearch makes extensive use of the disk cache. There are some
features that unfortunately still use a lot of heap, mostly terms
aggregation. Mostly.