Elasticsearch cluster of 4 nodes has "master not discovered exception"

Hi,

I had an error last week, seemed to be RAM in Java that parked my install on the first node (where Kibana role was installed). Ever since then I cannot get the cluster back operational.

When it is all restarted, all 4 nodes seem to communicate ok and report alive, some shards are not allocated but I can get to Yellow status at least, but left long enough after than the cluster transitions to RED and stops there. I can telnet between all of them still.

Which logs do I need to look in and does anyone have any good ideas? :frowning:

root@SERVER:~# curl -XGET 'http://localhost:9200/_cluster/state?pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

What does your node configuration(s) look like?

Hi Christian, pasted below.

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluste$
#
# Please consult the documentation for further information on configuration opt$
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: *************
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ${HOSTNAME}
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by com$
#
path.data: /data/data/elasticsearch
#
# Path to log files:
#
path.logs: /data/logs/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: [_tun0_, _local_, _site_]
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: ["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.$
#
# Prevent the "split brain" by configuring the majority of nodes (total number $
#
#discovery.zen.minimum_master_nodes: 1
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

It looks like all 4 nodes are master eligible. If that is the case you need to set discovery.zen.minimum_master_nodes to 3 in accordance with these guidelines in order to avoid split brain scenarios.

Thank you I will try that, and feedback to the guys that helped with this install.

Finally got a chance to try this, unfortunately it has not changed the status at all....

root@SERVER:~# curl -XGET 'http://localhost:9200/_cluster/state?pretty'
{
  "error" : {
"root_cause" : [
  {
    "type" : "master_not_discovered_exception",
    "reason" : null
  }
],
"type" : "master_not_discovered_exception",
"reason" : null
  },
  "status" : 503
}

Hold that thought... forgot to apply change to elasticsearch.yml on all nodes. WIll update if it has worked now corrected my mistake.

this has not worked.... suggestions for advanced troubleshooting appreciated

Can you telnet from one node to port 9300 on another? Is there any firewall rules that prevent the nodes from connecting? Is there anything in the Elasticsearch logs?

Hi, have checked this before but yes, can telnet from-to all on port 9300, no firewall rules (not changed since it was working).

Which log am I looking for?

Look in the Elasticsearch logs, which should be located under /data/logs/elasticsearch based on your config above. Is there anything related to zen discovery there?

Hi, yes I do seem to have lines related to zen discovery however unsure how to summarise 121576 lines of log....

Can you perhaps provide some examples or put relevant sections in a gist and link to it here?

sure, here's a couple of snippets (that mostly repeat in the log...)

[2018-04-18T00:00:03,166][INFO ][o.e.d.z.ZenDiscovery     ] [SERVER1] failed to send join request to master [{SERVER4}{Z3e14QZHRuabo3e0sVyUIw}{q_ybVx2CRAKc5jIl0kfDYA}{10.0.0.4}{10.0.0.4:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-04-18T00:00:30,205][WARN ][r.suppressed             ] path: /system/events, params: {index=system, type=events}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:165) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:387) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:273) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:421) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-04-18T00:00:35,251][WARN ][r.suppressed             ] path: /system/events, params: {index=system, type=events}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:165) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:387) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:273) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:421) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
	[2018-04-18T09:05:01,738][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,739][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,739][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,740][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,740][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,741][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,741][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeDisconnectedException[[SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [SERVER3][10.0.0.3:9300][indices:data/write/bulk[s][r]] disconnected
[2018-04-18T09:05:01,742][WARN ][o.e.c.a.s.ShardStateAction] [SERVER1] [winlogbeat-6.1.2-2018.04.16][1] received shard failed for shard id [[winlogbeat-6.1.2-2018.04.16][1]], allocation id [aF_mHJp0Qf6CwS1-jiFPoA], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [winlogbeat-6.1.2-2018.04.16][1], node[m4g37E_rR56fgQa0iT0aRg], [R], s[STARTED], a[id=aF_mHJp0Qf6CwS1-jiFPoA]], failure [NodeNotConnectedException[[SERVER3][10.0.0.3:9300] Node not connected]]
org.elasticsearch.transport.NodeNotConnectedException: [SERVER3][10.0.0.3:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:692) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:122) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:525) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:501) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.sendReplicaRequest(TransportReplicationAction.java:1188) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1152) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:171) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:155) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:122) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:358) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:298) ~[elasticsearch-6.1.2.jar:6.1.2]

Ok well I might have to look at trying to get this back to a "clean" state and try again :frowning:

Such frustrating software... great when it works but completely hosed when it "falls over" , not feeling like sustainable long term solution right now.

Which Elasticsearch version are you using?

root@SERVER:~# curl -XGET 'http://localhost:9200'
{
"name" : "SERVER",
"cluster_name" : "servername",
"cluster_uuid" : "QyHQj6TXTzG9ZYjC0B5ZCA",
"version" : {
"number" : "6.1.2",
"build_hash" : "5b1fea5",
"build_date" : "2018-01-10T02:35:59.208Z",
"build_snapshot" : false,
"lucene_version" : "7.1.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.