Second ES Instance Unable to Discover Other Nodes

ES/Kibana/Logstash v5.6.2

I have three problematic machines in a 12 node cluster. These machines run two instances of ES, one is pointed to SSDs, and the other is pointed towards HDDs on the same machine. This is setup in a hot/warm architecture.

Prior to upgrading, and even a week after upgrading, both ES instances would happily operate on the same node. However, this morning I've been fighting with them and one instance typically refuses to join the cluster. Currently the warm instance won't join the cluster, giving the error:

[2017-10-10T14:55:47,758][INFO ][o.e.d.z.ZenDiscovery     ] [elkserver-prod-node03] failed to send join request to master [{wbu2-elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{}{}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][][internal:discovery/zen/join]]; nested: IndexNotFoundException[no such index]; ]

and the error:

[2017-10-10T15:01:45,875][WARN ][r.suppressed             ] path: /.reporting-*/esqueue/_search, params: {index=.reporting-*, type=esqueue, version=true}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

here, elkserver-prod-node02 is the master.

If I attempt to modify the elasticsearch.yml and comment out, I get a different error:

[2017-10-10T14:55:00,659][INFO ][o.e.d.z.ZenDiscovery     ] [elkserver-prod-node03] failed to send join request to master [{elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{}{}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][][internal:discovery/zen/join]]; nested: ConnectTransportException[[elkserver-prod-node03][] handshake failed. unexpected remote node {elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{}{}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}]; ]

And the node still doesn't join.. Any ideas?

My warm elasticsearch.yml: ELK-CLUSTER
# wbu2-elkserver-prod-node03
 node.master: false true
 node.ingest: false
 node.attr.box_type: warm
 node.attr.tag: warm
# /elasticsearch/warm/data
 path.logs: /elasticsearch/warm/logs
 bootstrap.memory_lock: true
 network.bind_host: ["wbu2-elkserver-prod-node01.mydomain","elkserver-prod-node02.mydomain","elkserver-prod-node03.mydomain","elkserver-prod-node03.mydomain:9301","elkserver-prod-node04.mydomain","elkserver-prod-node04.mydomain:9301","elkserver-prod-node05.mydomain","elkserver-prod-node06.mydomain","elkserver-prod-node07.mydomain","elkserver-prod-node08.mydomain","elkserver-prod-node10.mydomain","elkserver-prod-node11.mydomain","elkserver-prod-node11.mydomain:9301","gpuserver-prod-node02.mydomain"]
 discovery.zen.minimum_master_nodes: 2
 gateway.recover_after_nodes: 5
# false

My hot elasticsearch.yml: ELK-CLUSTER
# elkserver-prod-node03-hot
 node.master: false true
 node.ingest: false
 node.attr.box_type: hot
# /elasticsearch/hot/data/
 path.logs: /elasticsearch/warm/logs/hot/

Most concerning of all -- the master node is completely freaking out with log messages:

[2017-10-10T15:13:12,082][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [elkserver-prod-node02] [nginx-2017.08.14s][0]: failed to list shard for shard_store on node [cW3TmEVIShSGxkCA8_zRew]
org.elasticsearch.action.FailedNodeException: Failed node [cW3TmEVIShSGxkCA8_zRew]....
Caused by: no segments* file found in store(mmapfs(/elasticsearch/warm/data/nodes/0/indices/qrGQNPRBSjqWBG92jW-GgQ/0/index)): files: [recovery.AV8H6oCofOIfywKhusjN._0.dii, recovery.AV8H6oCofOIfywKhusjN._0.dim, recovery.AV8H6oCofOIfywKhusjN._0.fdx, recovery.AV8H6oCofOIfywKhusjN._0.fnm....

Which is likely referencing the shrunk shards, which I shrunk last week. Those shards are on node03 warm, which is refusing to connect.

Any ideas at all? I'm at my wits-end on this one.

I was able to redirect Kibana to another (non dual-instance) ES node and got it running, however one machine (elkserver-prod-node03) is still inaccessible as it refuses to join the cluster, as seen above.

Anyone? I've tried everything I can think of and can't find any resolution..

Is the actual ip address of the the node where you run elasticsearch? Do you have proper IP addresses configured on each node. If not, try setting the correct address and restarting the nodes. is indeed the address of that node. I hadn't specified specific ports for the hot instance, so the hot instance was taking the default port 9200, which would stop the warm instance from starting.. apparently the warm instance would only start on port 9200. I updated the warm instance to use ports 9201 & 9301.. it started up fine but the whole cluster went down with all nodes reporting the same issue:

...failed to send join request to master [{wbu2-elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{}{}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][][internal:discovery/zen/join]]; nested: IndexNotFoundException[no such index]; ]

I updated all the nodes to: and restarted the entire cluster. It's up for now but I've no idea what's happening in the background. When I had listed the actual IP of the node in, that node would not join the cluster, with the message: ...failed to send join request to master...

e: something funky is going on with the zen discovery, as I shouldn't have to do the workaround. I don't know enough about the internals of ES Zen discovery to figure it out though.

Can you post a full log with stacktraces?

Is there an Elastic resource for how to implement Stack Traces? This is a production cluster and I don't want to hinder it by implementing this improperly.

Where would I dump these traces? The discussion forum here doesn't allow for the amount of data I'd need to post.

Lastly, the issue isn't occurring anymore, so I'll likely only be able to update this thread with stack traces once I'm seeing the issue again.

I was asking for stacktraces that showed up in the log file. You posted only the first line of the stacktrace and I was simply asking for the rest, no additional implementation is required. You can dump these logs to

Thanks for the clarification. Logs can be seen here:

This is low-priority now as all my nodes containing both hot & warm instances are up.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.