We've been running into this issue with multiple clusters and I'm wondering if there's something inherently wrong with what we're doing.
We take logs in from various services and systems to keep an eye on how things are running. We store them in hourly indexes. We're currently running on 3x r3.2xlarge systems in AWS. Each one has 4x 500GB SSDs to try and increase storage throughput.
The systems run pretty smoothly for a good chunk of time. Our current cluster is pretty fresh, taking in about 1.2 billion documents into roughly 6000 indices over the past 18 days. We currently have 3 shards/index and 1 replica.
Eventually, some sort of brief outage will occur that will cause the cluster to go into recovery. Recovery will take anywhere from 12 hours to a couple of days, which is really rough on us. It's currently faster for us to create a fresh cluster and use one of our snapshots to restore everything.
Any information would be greatly appreciated. Can provide just config info and diagnostics if they'll help out.