There were several variables in play for this scenario. My impression is that shard (re)allocation can be expensive. Given those 2 data points, I will try to present a succinct description so that if the answer is "don't do that", I have sacrificed few electrons and few (of your) brain cells.
We wanted to expand our system and upgrade from 1.4.4 to 1.6. Our system is
- 6 logstash boxen (using node protocol)
- 5 es data nodes (32G RAM/1.6TB)
- 1 es no-data no-master (http frontend)
- 5 shard/1 replica
- 7 daily indices (so 70 shards per day)/ ~30 days retention
- 5.5 TB total storage used
We started by adding four 1.6 nodes to the cluster. Things went ok with some strange unbalanced shard distribution and periods of no data ingestion overnite. We closed a majority of the indices then moved the original nodes from 1.4.4. to 1.6.
When we started re-opening indices (several at once), es node load went thru the roof and ingestion of data stopped. Data ingestion would not resume until all shards were allocated, initialized, and the subsequent re-allocation activity was complete. We were using kopf as the indicator of these operations.
We used the kopf "cluster settings" page to manipulate parameters:
- routing: concurrent_rebalance, concurrent_recoveries
- recovery: concurrent_streams, max_bytes_per_sec
At the end as were using:
- concurrent_rebalance=1, concurrent_recoveries=10
- concurrent_streams=3, max_bytes_per_sec=200M
... and the cluster just would NOT ingest data until all the shard processing was complete.
I have seen performance degradation around addition/deletion of nodes, but this just seemed SO finicky, unexpected, and mildly concerning. We just were not aware of this level of impact.
I have tried to balance info with brevity. I'm hoping that its best to provide any add'l info in response to questions or hits from more informed users.
thx so much. and thx to the elastic team: awesome stuff!