ELK 5.4 data node is failing to join the cluster, it is stuck at discovery phase

chakri276 · May 9, 2018, 3:53pm

I am using Elasticsearch 5.4,
I have 3 master nodes and 8 data notes in which one data node stopped its elastic search process abruptly as its supervisord is down.
I have 111 unassigned shrads and the reason for that is,
"can_allocate" : "no_valid_shard_copy",
** "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",**

I am thinking the node which stopped should have the primary shard, That node is failing to join the cluster, from the logs I see below,

[2018-05-09T11:39:55,759][INFO ][o.e.n.Node [2018-05-09T11:39:55,812][INFO ][o.e.e.NodeEnvironment [2018-05-09T11:39:55,813][INFO ][o.e.e.NodeEnvironment [2018-05-09T11:39:56,141][INFO ][o.e.n.Node [2018-05-09T11:39:56,141][INFO ][o.e.n.Node [2018-05-09T11:39:56,914][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,914][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,915][INFO ][o.e.p.PluginsService [2018-05-09T11:39:56,916][INFO ][o.e.p.PluginsService [2018-05-09T11:39:58,062][INFO ][o.e.d.DiscoveryModule [2018-05-09T11:39:59,561][INFO ][o.e.n.Node [2018-05-09T11:39:59,561][INFO ][o.e.n.Node [2018-05-09T11:39:59,677][INFO ][o.e.t.TransportService [2018-05-09T11:39:59,683][INFO ][o.e.b.BootstrapChecks [2018-05-09T11:39:59,687][ERROR][o.e.b.Bootstrap bootstrap checks failed
max file descriptors [4096] for elasticsearch [2018-05-09T11:39:59,690][INFO ][o.e.n.Node [2018-05-09T11:39:59,713][INFO ][o.e.n.Node [2018-05-09T11:39:59,713][INFO ][o.e.n.Node [2018-05-09T11:39:59,721][INFO ][o.e.n.Node ] [ah-1007168-003] initializing ...
] [ah-1007168-003] using [1] data paths, mounts [[/apps/cie_dashboard (/dev/mapper/volgrp02-appscie_dashboard)]], net usable_space [529.1gb], net total_space [899.5gb], spins? [possibly], types [xfs]
] [ah-1007168-003] heap size [23.9gb], compressed ordinary object pointers [true]
] [ah-1007168-003] node name [ah-1007168-003], node ID [iUAPJ0glTV2fP5_isJE4Jg]
] [ah-1007168-003] version[5.4.0], pid[6077], build[780f8c4/2017-04-28T17:43:27.229Z], OS[Linux/3.10.0-693.17.1.el7.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_72/25.72-b15]
] [ah-1007168-003] loaded module [aggs-matrix-stats]
] [ah-1007168-003] loaded module [ingest-common]
] [ah-1007168-003] loaded module [lang-expression]
] [ah-1007168-003] loaded module [lang-groovy]
] [ah-1007168-003] loaded module [lang-mustache]
] [ah-1007168-003] loaded module [lang-painless]
] [ah-1007168-003] loaded module [percolator]
] [ah-1007168-003] loaded module [reindex]
] [ah-1007168-003] loaded module [transport-netty3]
] [ah-1007168-003] loaded module [transport-netty4]
] [ah-1007168-003] no plugins loaded
] [ah-1007168-003] using discovery type [zen]
] [ah-1007168-003] initialized
] [ah-1007168-003] starting ...
] [ah-1007168-003] publish_address {171.135.145.72:9300}, bound_addresses {0.0.0.0:9300}
] [ah-1007168-003] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
] [ah-1007168-003] node validation exception
process is too low, increase to at least [65536]
] [ah-1007168-003] stopping ...
] [ah-1007168-003] stopped
] [ah-1007168-003] closing ...
] [ah-1007168-003] closed

My /etc/security/limits.conf values are already set to,
root - nofile 100000
ahselkpd - nofile 82919
ahselkpd - memlock unlimited # #unlimited memory lock for elk_user

my ulimit -n for ahselkpd is 82919
my ulimit -n for root is 100000

Also the value for vm.max_map_count=262144 is set globally under /etc/sysctl.conf

Please help how do I resolve this issue?

system · June 6, 2018, 3:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unassigned Shard Elasticsearch	4	711	January 3, 2020
Unassigned shards on cluster restart Elasticsearch	1	675	October 2, 2018
Unassigned shards, crashed cluster recovery Elasticsearch	9	13001	February 2, 2018
Elasticseach failed shard allocation Elasticsearch	8	1311	May 28, 2021
Shards unassigned after some nodes went down Elasticsearch	8	418	September 29, 2020

ELK 5.4 data node is failing to join the cluster, it is stuck at discovery phase

Related Topics