ES/Kibana/Logstash v5.6.2
I have three problematic machines in a 12 node cluster. These machines run two instances of ES, one is pointed to SSDs, and the other is pointed towards HDDs on the same machine. This is setup in a hot/warm architecture.
Prior to upgrading, and even a week after upgrading, both ES instances would happily operate on the same node. However, this morning I've been fighting with them and one instance typically refuses to join the cluster. Currently the warm instance won't join the cluster, giving the error:
[2017-10-10T14:55:47,758][INFO ][o.e.d.z.ZenDiscovery ] [elkserver-prod-node03] failed to send join request to master [{wbu2-elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][10.191.4.62:9300][internal:discovery/zen/join]]; nested: IndexNotFoundException[no such index]; ]
and the error:
[2017-10-10T15:01:45,875][WARN ][r.suppressed ] path: /.reporting-*/esqueue/_search, params: {index=.reporting-*, type=esqueue, version=true}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
here, elkserver-prod-node02 is the master.
If I attempt to modify the elasticsearch.yml and comment out network.host: 10.191.5.42
, I get a different error:
[2017-10-10T14:55:00,659][INFO ][o.e.d.z.ZenDiscovery ] [elkserver-prod-node03] failed to send join request to master [{elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}], reason [RemoteTransportException[[elkserver-prod-node02][10.191.4.62:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[elkserver-prod-node03][127.0.0.1:9300] handshake failed. unexpected remote node {elkserver-prod-node02}{7mM4RuPsTXOoafvHNfPHWA}{lPgPb2yrR6m5klEgksa0yQ}{10.191.4.62}{10.191.4.62:9300}{ml.max_open_jobs=10, box_type=warm, ml.enabled=true, tag=warm}]; ]
And the node still doesn't join.. Any ideas?
My warm elasticsearch.yml:
cluster.name: ELK-CLUSTER
#
node.name: wbu2-elkserver-prod-node03
#
node.master: false
node.data: true
node.ingest: false
#
node.attr.box_type: warm
node.attr.tag: warm
#
path.data: /elasticsearch/warm/data
path.logs: /elasticsearch/warm/logs
#
bootstrap.memory_lock: true
#
network.host: 10.191.5.42
network.bind_host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["wbu2-elkserver-prod-node01.mydomain","elkserver-prod-node02.mydomain","elkserver-prod-node03.mydomain","elkserver-prod-node03.mydomain:9301","elkserver-prod-node04.mydomain","elkserver-prod-node04.mydomain:9301","elkserver-prod-node05.mydomain","elkserver-prod-node06.mydomain","elkserver-prod-node07.mydomain","elkserver-prod-node08.mydomain","elkserver-prod-node10.mydomain","elkserver-prod-node11.mydomain","elkserver-prod-node11.mydomain:9301","gpuserver-prod-node02.mydomain"]
#
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 5
#
xpack.security.enabled: false
My hot elasticsearch.yml:
cluster.name: ELK-CLUSTER
#
node.name: elkserver-prod-node03-hot
#
node.master: false
node.data: true
node.ingest: false
#
node.attr.box_type: hot
#
path.data: /elasticsearch/hot/data/
path.logs: /elasticsearch/warm/logs/hot/
#
network.host: 10.191.5.42
network.bind_host: 0.0.0.0
Most concerning of all -- the master node is completely freaking out with log messages:
[2017-10-10T15:13:12,082][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [elkserver-prod-node02] [nginx-2017.08.14s][0]: failed to list shard for shard_store on node [cW3TmEVIShSGxkCA8_zRew]
org.elasticsearch.action.FailedNodeException: Failed node [cW3TmEVIShSGxkCA8_zRew]....
.
.
.
Caused by: java.io.FileNotFoundException: no segments* file found in store(mmapfs(/elasticsearch/warm/data/nodes/0/indices/qrGQNPRBSjqWBG92jW-GgQ/0/index)): files: [recovery.AV8H6oCofOIfywKhusjN._0.dii, recovery.AV8H6oCofOIfywKhusjN._0.dim, recovery.AV8H6oCofOIfywKhusjN._0.fdx, recovery.AV8H6oCofOIfywKhusjN._0.fnm....
Which is likely referencing the shrunk shards, which I shrunk last week. Those shards are on node03 warm, which is refusing to connect.
Any ideas at all? I'm at my wits-end on this one.