Moving shards is slow

I did not mean to imply Elasticsearch should balance primaries, I meant that I should be doing it manually (or with an external daemon).

My particular configuration has two zones: One zone is for "backup", which are weak machines with large drives, and the other is the "spot" zone with powerful machines and faster drives. The spot zone is a bit unstable, but it provides us with 10x the processing power for a given price point. Stability in the spot zone is enhanced by distributing over many machine types at a variety of price points, and nodes get plenty of time to shutdown when they are about to be reclaimed. Elasticsearch 1.7 is awesome in this configuration! It handles the shard replication and distribution on top of an ever changing node environment.

Our use case is a few users querying billions records covering our test results, CI metadata, and data extracted from a few other reasonably large systems. Data consistency is nice-to-have, but at our scale data is lost all the time. The CI writes partial files, CI has bugs, tests are failing, S3 times out, schemas are changing, the ETL pipeline does not handle corner cases, etc. Losses that Elasticsearch introduce are relatively tiny. The whole system is statistically accurate for our needs. Even if we loose live data, we have the original datasources to fall back on. Elasticsearch 1.7 is awesome for this use case! We need not manage the changing schemas, and queries are significantly faster than any database, or big data solution, for the given hardware (not including the 10x savings from spot nodes)

So, we have had wonderful success with Elasticsearch 1.7 for over two years! Now it is time to get even more benefit from an upgrade! Even if (manually) moving primaries is expensive, the orders-of-magnitude benefit we get from Elasticsearch will be worth it, as long as it can recover shards in a timely manner.

The 5gb shards you see are on the new ESv6.1 cluster, they are just small samples of production data while we run scaling tests. In production we create a 60shard index each week to store 2T of test data. We target 20gb shards (+/- 10gb) so we can move shards quickly for inevitable recovery. We also limit our concurrent recoveries so that shards move as fast as possible and accept query load asap; it is better to move 1 shard in 1 minute than 2 shards in 2 minutes because the former gives us a shard to query sooner.