Cluster broke after some network troubles

hello guys , i need some advice.
Is it possible configure elasticsearch cluster 2.3.4 to restore connection between nodes that placed in different servers after some network troubles.

i have next configuration:
server1 - node1
cluster.name: name
node.name: node-1
node.master: true
node.data: false
index.number_of_shards: 2
index.number_of_replicas: 1
index.refresh_interval: 15s
threadpool.search.queue_size: 10000
path.logs: /data/logs/elasticsearch/node1
bootstrap.mlockall: true
network.publish_host: server1
network.bind_host: 0
discovery.zen.ping.unicast.hosts: ["server1","server2"]

server1-node2:
cluster.name: eventhandler-main-db
node.name: node-2
node.master: false
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
index.refresh_interval: 15s
threadpool.search.queue_size: 10000
path.logs: /data/logs/elasticsearch/node2
bootstrap.mlockall: true
network.publish_host: serveer1
network.bind_host: 0
discovery.zen.ping.unicast.hosts: ["server1", "server2"]

and like this configuration on the second server:server2.

after some network troubles:
i got NodeDisconnectedException:

[2017-09-12 18:02:25,111][INFO ][cluster.service ] [node-2] removed {{node-3}{8W1PsU7zR72UhnKTY_h2gQ}{server2}{server2:9300}{data=false, master=true},}, reason: zen-disco-master_failed ( {node-3}{8W1PsU7zR72UhnKTY_h2gQ}{server2}{server2:9300}{data=false, master=true}) [2017-09-12 18:02:25,127][DEBUG][action.admin.cluster.health] [node-2] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{node-3}{8W1PsU7zR7 2UhnKTY_h2gQ}{server2}{server2:9300}{data=false, master=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [node-3][server2:9300][cluster:monitor/health ] disconnected] [2017-09-12 18:02:25,127][DEBUG][action.admin.cluster.health] [node-2] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{node-3}{8W1PsU7zR7 2UhnKTY_h2gQ}{server2}{server2:9300}{data=false, master=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [node-3][server2:9300][cluster:monitor/health ] disconnected] [2017-09-12 18:02:25,128][DEBUG][action.admin.cluster.state] [node-2] connection exception while trying to forward request with action name [cluster:monitor/state] to master node [{node-3}{8W1PsU7zR72U hnKTY_h2gQ}{server2}{server2:9300}{data=false, master=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [node-3][server2:9300][cluster:monitor/state] d isconnected] [2017-09-12 18:02:25,130][DEBUG][action.admin.cluster.health] [node-2] timed out while retrying [cluster:monitor/health] after failure (timeout [30s]) NodeDisconnectedException[[node-3][server2:9300][cluster:monitor/health] disconnected] [2017-09-12 18:02:29,524][INFO ][cluster.service ] [node-2] detected_master {node-1}{F1CvVKftTnmRezujp7Werw}{server1}{server1:9300}{data=false, master=true}, added {{node-1}{F1CvVKftTnmR ezujp7Werw}{server1}{server1:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{node-1}{F1CvVKftTnmRezujp7Werw}{server1}{server1:9300}{data=false, master=true}]) [2017-09-12 18:03:29,485][WARN ][transport ] [node-2] Received response for a request that has timed out, sent [58961ms] ago, timed out [28961ms] ago, action [internal:discovery/zen/fd/m aster_ping], node [{node-1}{F1CvVKftTnmRezujp7Werw}{server1}{server1:9300}{data=false, master=true}], id [2706] [2017-09-12 18:03:59,575][INFO ][cluster.service ] [node-2] removed {{node-4}{TthpIEPcSBCE3Irzzhblvw}{server2}{server2:9301}{master=false},}, reason: zen-disco-receive(from master [{node -1}{F1CvVKftTnmRezujp7Werw}{server1}{server1:9300}{data=false, master=true}]) [2017-09-12 18:04:00,123][DEBUG][action.search ] [node-2] Node [TthpIEPcSBCE3Irzzhblvw] not available for scroll request [cXVlcnlUaGVuRmV0Y2g7MjsyMjpUdGhwSUVQY1NCQ0UzSXJ6emhibHZ3OzIzOlR0aHBJ RVBjU0JDRTNJcnp6aGJsdnc7MDs=] [2017-09-12 18:04:00,123][DEBUG][action.search ] [node-2] Node [TthpIEPcSBCE3Irzzhblvw] not available for scroll request [cXVlcnlUaGVuRmV0Y2g7MjsyMjpUdGhwSUVQY1NCQ0UzSXJ6emhibHZ3OzIzOlR0aHBJ RVBjU0JDRTNJcnp6aGJsdnc7MDs=] [2017-09-12 18:04:01,385][DEBUG][action.admin.cluster.node.info] [node-2] failed to execute on node [TthpIEPcSBCE3Irzzhblvw] NodeDisconnectedException[[node-4][server2:9301][cluster:monitor/nodes/info[n]] disconnected] [2017-09-12 18:04:01,387][WARN ][action.index ] [node-2] [events-2017.09.12][1] failed to perform indices:data/write/index[r] on node {node-4}{TthpIEPcSBCE3Irzzhblvw}{server2}{server2 :9301}{master=false} NodeDisconnectedException[[node-4][server2:9301][indices:data/write/index[r]] disconnected] [2017-09-12 18:04:01,410][WARN ][action.index ] [node-2] [events-2017.09.12][1] failed to perform indices:data/write/index[r] on node {node-4}{TthpIEPcSBCE3Irzzhblvw}{server2}{server2 :9301}{master=false} NodeDisconnectedException[[node-4][server2:9301][indices:data/write/index[r]] disconnected] [2017-09-12 18:04:01,387][WARN ][action.index ] [node-2] [events-2017.09.12][0] failed to perform indices:data/write/index[r] on node {node-4}{TthpIEPcSBCE3Irzzhblvw}{server2}{server2 :9301}{master=false}

workaround is to restart master node on one of the 2 servers and cluster is restore.
Is it possible automize to auto restore? Some ways?

It looks like the nodes simply disconnected, however, a much bigger concern is that discovery.zen.minimum_master_nodes is not set. This can lead to split brains where the network partitions and each node elects itself as master. If you set it to a quorum (2 in your case, though adding another node would be better so it could be set to 2 with 3 nodes) then during a network disconnection the nodes will not elect two masters, allowing them to reconnect when the partition ends.

1 Like

dakrone thanks, i have question:
if partition will take 1 hour, will it[disconnected node] try to autorecconect to cluster, or not?
case with 3 master node and discovery.zen.minimum_master_nodes = 2

dakrone thanks, i check this case. And cluster success autorestore after some network partitions.
i havent any questions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.