Second ES Instance Unable to Discover Other Nodes

seth.yes · October 10, 2017, 9:16pm

ES/Kibana/Logstash v5.6.2

I have three problematic machines in a 12 node cluster. These machines run two instances of ES, one is pointed to SSDs, and the other is pointed towards HDDs on the same machine. This is setup in a hot/warm architecture.

Prior to upgrading, and even a week after upgrading, both ES instances would happily operate on the same node. However, this morning I've been fighting with them and one instance typically refuses to join the cluster. Currently the warm instance won't join the cluster, giving the error:

[2017-10-10T14:55:47,758][INFO ][o.e.d.z.ZenDiscovery     ] [elkserver-prod-node03] failed to send join request to master [{wbu2-elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][10.191.4.62:9300][internal:discovery/zen/join]]; nested: IndexNotFoundException[no such index]; ]

and the error:

[2017-10-10T15:01:45,875][WARN ][r.suppressed             ] path: /.reporting-*/esqueue/_search, params: {index=.reporting-*, type=esqueue, version=true}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

here, elkserver-prod-node02 is the master.

If I attempt to modify the elasticsearch.yml and comment out network.host: 10.191.5.42, I get a different error:

[2017-10-10T14:55:00,659][INFO ][o.e.d.z.ZenDiscovery     ] [elkserver-prod-node03] failed to send join request to master [{elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][10.191.4.62:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[elkserver-prod-node03][127.0.0.1:9300] handshake failed. unexpected remote node {elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}]; ]

And the node still doesn't join.. Any ideas?

My warm elasticsearch.yml:

 cluster.name: ELK-CLUSTER
#
 node.name: wbu2-elkserver-prod-node03
#
 node.master: false
 node.data: true
 node.ingest: false
#
 node.attr.box_type: warm
 node.attr.tag: warm
#
 path.data: /elasticsearch/warm/data
 path.logs: /elasticsearch/warm/logs
# 
 bootstrap.memory_lock: true
#
 network.host: 10.191.5.42
 network.bind_host: 0.0.0.0
 discovery.zen.ping.unicast.hosts: ["wbu2-elkserver-prod-node01.mydomain","elkserver-prod-node02.mydomain","elkserver-prod-node03.mydomain","elkserver-prod-node03.mydomain:9301","elkserver-prod-node04.mydomain","elkserver-prod-node04.mydomain:9301","elkserver-prod-node05.mydomain","elkserver-prod-node06.mydomain","elkserver-prod-node07.mydomain","elkserver-prod-node08.mydomain","elkserver-prod-node10.mydomain","elkserver-prod-node11.mydomain","elkserver-prod-node11.mydomain:9301","gpuserver-prod-node02.mydomain"]
#
 discovery.zen.minimum_master_nodes: 2
 gateway.recover_after_nodes: 5
#
 xpack.security.enabled: false

My hot elasticsearch.yml:

 cluster.name: ELK-CLUSTER
#
 node.name: elkserver-prod-node03-hot
#
 node.master: false
 node.data: true
 node.ingest: false
#
 node.attr.box_type: hot
#
 path.data: /elasticsearch/hot/data/
 path.logs: /elasticsearch/warm/logs/hot/
#
 network.host: 10.191.5.42
 network.bind_host: 0.0.0.0

Most concerning of all -- the master node is completely freaking out with log messages:

[2017-10-10T15:13:12,082][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [elkserver-prod-node02] [nginx-2017.08.14s][0]: failed to list shard for shard_store on node [cW3TmEVIShSGxkCA8_zRew]
org.elasticsearch.action.FailedNodeException: Failed node [cW3TmEVIShSGxkCA8_zRew]....
.
.
.
Caused by: java.io.FileNotFoundException: no segments* file found in store(mmapfs(/elasticsearch/warm/data/nodes/0/indices/qrGQNPRBSjqWBG92jW-GgQ/0/index)): files: [recovery.AV8H6oCofOIfywKhusjN._0.dii, recovery.AV8H6oCofOIfywKhusjN._0.dim, recovery.AV8H6oCofOIfywKhusjN._0.fdx, recovery.AV8H6oCofOIfywKhusjN._0.fnm....

Which is likely referencing the shrunk shards, which I shrunk last week. Those shards are on node03 warm, which is refusing to connect.

Any ideas at all? I'm at my wits-end on this one.

seth.yes · October 10, 2017, 9:53pm

I was able to redirect Kibana to another (non dual-instance) ES node and got it running, however one machine (elkserver-prod-node03) is still inaccessible as it refuses to join the cluster, as seen above.

seth.yes · October 27, 2017, 8:50pm

Anyone? I've tried everything I can think of and can't find any resolution..

Igor_Motov · October 27, 2017, 10:36pm

Is 10.191.5.42 the actual ip address of the the node where you run elasticsearch? Do you have proper IP addresses configured on each node. If not, try setting the correct address and restarting the nodes.

seth.yes · October 30, 2017, 5:32pm

10.191.5.42 is indeed the address of that node. I hadn't specified specific ports for the hot instance, so the hot instance was taking the default port 9200, which would stop the warm instance from starting.. apparently the warm instance would only start on port 9200. I updated the warm instance to use ports 9201 & 9301.. it started up fine but the whole cluster went down with all nodes reporting the same issue:

...failed to send join request to master [{wbu2-elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][10.191.4.62:9300][internal:discovery/zen/join]]; nested: IndexNotFoundException[no such index]; ]

I updated all the nodes to: network.host: 0.0.0.0 and restarted the entire cluster. It's up for now but I've no idea what's happening in the background. When I had listed the actual IP of the node in network.host, that node would not join the cluster, with the message: ...failed to send join request to master...

e: something funky is going on with the zen discovery, as I shouldn't have to do the network.host workaround. I don't know enough about the internals of ES Zen discovery to figure it out though.

Igor_Motov · October 30, 2017, 5:48pm

Can you post a full log with stacktraces?

seth.yes · October 30, 2017, 6:52pm

Is there an Elastic resource for how to implement Stack Traces? This is a production cluster and I don't want to hinder it by implementing this improperly.

Where would I dump these traces? The discussion forum here doesn't allow for the amount of data I'd need to post.

Lastly, the issue isn't occurring anymore, so I'll likely only be able to update this thread with stack traces once I'm seeing the issue again.

Igor_Motov · October 30, 2017, 6:59pm

I was asking for stacktraces that showed up in the log file. You posted only the first line of the stacktrace and I was simply asking for the rest, no additional implementation is required. You can dump these logs to http://gist.github.com/gists

seth.yes · October 30, 2017, 7:34pm

Thanks for the clarification. Logs can be seen here:

gist.github.com

https://gist.github.com/sethyes/2415eb29d767e75b7bd47a42233e5e7e.js

elasticsearch.log

[2017-10-30T00:00:03,247][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: RemoteTransportException[[wbu2-elkserver-prod-node02][10.191.4.62:9300][indices:admin/create]]; nested: IndexNotFoundException[no such index];
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:130) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_111]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_111]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_111]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_111]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_111]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_111]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_111]

This file has been truncated. show original

seth.yes · November 3, 2017, 4:55pm

This is low-priority now as all my nodes containing both hot & warm instances are up.

system · December 1, 2017, 4:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 6.2.4 nodes can’t discover each other in AWS Elasticsearch	2	1006	July 6, 2018
Nodes not discovering in ELK 7.0 Elasticsearch	3	327	July 12, 2019
ES 2.3.3 node cannot join the master node Elasticsearch	5	2139	July 5, 2017
Master election problem in 3 node cluster when one died Elasticsearch	7	2571	July 5, 2017
Data nodes are not able to join master, failed to send join request to master Elasticsearch	2	879	February 25, 2019

Second ES Instance Unable to Discover Other Nodes

Related topics