Data nodes failed to send join request to master


I have an Elastic cluster and been running with no problems for the past year or so. We are using version 5.4 and running in an Openstack environment.

We had an outage and looks like some nodes didn't shut-down properly.
After I started all ES instances starting with the master nodes then data and finally client, my data nodes are failing to join.

On the data nodes:

[2018-04-18T17:57:32,857][WARN ][o.e.n.Node ] [esdata-11] timed out while waiting for initial discovery state - timeout: 30s


[2018-04-18T17:25:10,267][INFO ][o.e.d.z.ZenDiscovery ] [esdata-11] failed to send join request to master [{master-1}{-0ZXpK0VRV6yf8QdTwewPg}{xayChL4oQB6etcAwguIfJQ}{}{}{ml.enabled=true}], reason [RemoteTransportException[[master-1][][internal:discovery/zen/join]]; nested: ConnectTransportException[[esdata-11][] connect_timeout[30s]]; nested: IOException[connection timed out:]; ]

On the Master node:

[2018-04-18T10:51:16,477][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-2-2018.04.18][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-2-2018.04.18][0]] containing [5] requests]]


2018-04-18T13:24:49,315][WARN ][o.e.d.z.PublishClusterStateAction] [master-1] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{master-2}{0aacWZoiTNaNFxSIn0sETg}{L-jh0PFDRzaABYjcQLXmwA}{}{}{ml.enabled=true}])

[2018-04-18T14:05:04,037][WARN ][o.e.x.m.MonitoringService] [master-2] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk

I have looked at many resources as possible and each one had a different solution which in most cases didn't apply to me.

The weird thing is that the data nodes can see the masters changing, but still they fail to join!

Any idea whats wrong here is it Openstack network related ? or more ES related ?

I was able to resolve my problem successfully and get my cluster to start recovering, by turning off iptables, or in general iptables need to be modified for the service to properly use the ports needed i.e. 9300 and 9200.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.