Cluster recovery is slower than molasses

We've been running into this issue with multiple clusters and I'm wondering if there's something inherently wrong with what we're doing.

We take logs in from various services and systems to keep an eye on how things are running. We store them in hourly indexes. We're currently running on 3x r3.2xlarge systems in AWS. Each one has 4x 500GB SSDs to try and increase storage throughput.

The systems run pretty smoothly for a good chunk of time. Our current cluster is pretty fresh, taking in about 1.2 billion documents into roughly 6000 indices over the past 18 days. We currently have 3 shards/index and 1 replica.

Eventually, some sort of brief outage will occur that will cause the cluster to go into recovery. Recovery will take anywhere from 12 hours to a couple of days, which is really rough on us. It's currently faster for us to create a fresh cluster and use one of our snapshots to restore everything.

Any information would be greatly appreciated. Can provide just config info and diagnostics if they'll help out.

What version are you on?

Like what?

Thanks for the quick response! We're currently on version 2.2.2.

The one we just ran into was an interruption in networking. Also had the ping between systems time out during recovery because so many tasks pile up. We've also run into this in situations where we've needed to do rolling restarts, such as cluster expansion, but the last one was a while back on an older version.

Why are you using hourly indices? If I calculate correctly, your 6000 indices consist of 36000 shards. With a total of 6TB of storage, this means the average shard size can not exceed 175MB, which is small for Elasticsearch under most circumstances.

Having that many indices and shards in a cluster that size is excessive and wastes a lot of resources. I would recommend changing to daily indices in order to reduce the total shard count, and would expect that to help speed up your recoveries.

Thanks for the input. Switched everything over to daily and taking things in to see how it goes. At the moment, I don't remember why we went with hourly indices.

Out of curiosity, if 175MB is too small for a shard, is there an "ideal" size for a shard?

Ideal shard size depends on the use case, but for log analytics use cases having shards between a few GB and a few tens of GB is not unusual. The size of the shard will drive the query latency, and will depend on what type of data and queries you have. Benchmarking this is therefore recommended.

1 Like

Sounds good. We'll see how it goes over the next couple days as we take more logs in and post updates. We primarily use dashboards to keep an eye on things over the past half hour, so that goes pretty quickly. I'm curious to see what happens when we have to start doing queries over a few days to hunt things down.

Have about a week's worth of data in the cluster and everything is running smoothly with daily indices. Still have some indices that are pretty small, but not as worried about it since our shard count is way down. It's sitting at a little over 500 shards at the moment compared the 12k-ish that we would have had with hourly indices.

Tried running queries for the week through one of our larger indices, which is about 10GB/day, and didn't notice any difference in performance. Also took one of the nodes down for a the moment and recovery is silky smooth.

Thanks for the help!