We're in the process of reconfiguring our elasticsearch clusters to have separate client and master nodes. While adding the master nodes is intrusive, the rolling restart of data nodes after setting node.master to false turned out to be disruptive, causing a red cluster.
Our setup, outlined:
ES 1.7.5
4 client nodes
12 data nodes
3 master nodes
Our elasticsearch.yml has
discovery.zen.ping.unicast.hosts: ["esmaster1", "esmaster2", "esmaster3", "esclient1", "esclient2", "esclient3", "esclient4", "esdata1", "esdata2", "esdata3", "esdata4", "esdata5", "esdata6", "esdata7", "esdata7", "esdata8", "esdata9", "esdata10", "esdata11", "esdata12"]
Before the reconfig, discovery.zen.minimum_master_nodes was set to 6 int((number of nodes / 2) + 1). After the reconfig, we set this to 2 based on 3 master nodes.
The rolling restart was done as follows for each node:
- stop all indexers
- disable shard allocations
- shut down elasticsearch on node using api
- restart elasticsearch service
- enable shard allocations
- wait until cluster is green
- start indexers
Once indexers were caught up with realtime, I would proceed with the next node. This went according to plan until the last data node was restarted, which was the master at the time (I saved it for last, perhaps I shouldn't have ?)
These graphs show exactly the timeframe we were impacted:
[continuing in a follow-up post as I have exceeded my 5000 character max]