Node discovery issue post upgrade to newer version of ELK stack


(VivSam) #1

I have migrated all my ELK nodes to new versions. I have 2 ES nodes with configs as below
10.***.235
elk_es_crt_00
cluster.name: csm_elk_es_crt_00
node.name: "elkescrt00.***.net"
bootstrap.memory_lock: false
network.host: 192.168.0.1
transport.host: 127.0.0.1
http.port: 9200
http.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["elkescrt00.***.net", "elkescrt01.***.net"]

10.***.234
elk_es_crt_01
cluster.name: csm_elk_es_crt_01
node.name: "elkescrt01.***.net"
bootstrap.memory_lock: false
network.host: 192.168.0.1
transport.host: 127.0.0.1
http.port: 9200
http.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["elkescrt00.***.net", "elkescrt01.***.net"]

and my kibana.yml has below config;s
kibana.yml

server.port: 5601
server.host: "10.***.232"
elasticsearch.url: "http://elkescrt01.***.net:9200"

after migration ES indices in elk_es_crt_01 has been hashed but not elk_es_crt_00.

am i missing something?


(Christian Dahlqvist) #2

The configuration does not make sense to me. It looks like the nodes reside in different IP addresses and based on discovery.zen.ping.unicast.hosts it also look like you are expecting the nodes to form a cluster. This will however not be possible as the two nodes do not have the same cluster.name and transport.host only binds to localhost. What are you expecting the setup to look like? How did it work before the migration? Which version of Elasticsearch are you using?


(VivSam) #3

Yes both nodes reside in different IP.
This was a cluster with 2 ES nodes before migration. you are correct am trying to form a cluster as before.
My ES version is 5.5.1.

Now i did bind the transport.host with localhost.

when i do http://elkcrt.***.com:9200/_cat/nodes

its showing only
127.0.0.1 21 97 7 0.11 0.17 0.12 mdi * elkescrt00.***.net
and GET /_cat/nodes

127.0.0.1 11 96 1 0.00 0.11 0.20 mdi * elkescrt01.***.net


(Christian Dahlqvist) #4

In order to form a cluster the nodes need to:

  1. Have the same cluster name
  2. Be able to communicate via the transport protocol on port 9300

In your config the cluster name is different and the nodes can not reach each others transport port as it binds to localhost. You will need to change this in order for your cluster to form a cluster.

You should also set discovery.zen.minimum_master_nodes to 2 once you establish a cluster if both your nodes are master eligible.


(VivSam) #5

So should i mention the port number 9300 explicitly?
i changed the cluster names to be same and i think it was success and in logs i got the below

[o.e.c.s.ClusterService   ] [elkescrt00.***.net] new_master {elkescrt00.***.net}{qgbCeUyrQtSkrL9ciZcnlQ}{6o3tCU2tRg6QV95dnJrwBA}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-08-22T04:55:17,735][INFO ][o.e.h.n.Netty4HttpServerTransport] [elkescrt00.***.net] publish_address {10.***.235:9200}, bound_addresses {10.***.235:9200}

But the problem is when i check the nodes using

http://elkcrt.***.com:9200/_cat/nodes
it is giving me
127.0.0.1 72 98 8 0.42 0.99 0.99 mdi * elkescrt01.***.net

only 1 node.
I don't understand.


(Christian Dahlqvist) #6

Your node is still binding to 127.0.0.1, which will not work. You need to bind to an IP address (10.***.234 and 10.***.235) that can be reached from the other node.


(VivSam) #7

Thanks. So it should be

network.host: 10.***.235 ?

If so am getting the error

elasticsearch dead but subsys locked


(VivSam) #8

Have been getting this error.
Not able to validate what it is.

I changed my es config to bind host IP with transport port but es service dies everytime.

10.***.235
elk_es_crt_00
cluster.name: csm_elk_es_crt_01
node.name: "elkescrt00.***.net"
bootstrap.memory_lock: true
http.port: 9200
discovery.zen.ping.unicast.hosts: ["elkescrt00.***.net", "elkescrt01.***.net"]

10.***.234
elk_es_crt_01
cluster.name: csm_elk_es_crt_01
node.name: "elkescrt01.***.net"
bootstrap.memory_lock: true
http.port: 9200
discovery.zen.ping.unicast.hosts: ["elkescrt00.***.net", "elkescrt01.***.net"]

Error in logs

[2017-08-23T05:07:17,732][INFO ][o.e.t.TransportService   ] [elkescrt00.***.net] publish_address {10.162.119.235:9300}, bound_addresses {10.162.119.235:9300}
[2017-08-23T05:07:17,746][INFO ][o.e.b.BootstrapChecks    ] [elkescrt00.***.net] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-08-23T05:07:17,748][ERROR][o.e.b.Bootstrap          ] [elkescrt00.***.net] node validation exception
[2] bootstrap checks failed
[1]: max number of threads [1024] for user [elasticsearch] is too low, increase to at least [2048]
[2]: system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
[2017-08-23T05:07:17,751][INFO ][o.e.n.Node               ] [elkescrt00.***.net] stopping ...
[2017-08-23T05:07:17,797][INFO ][o.e.n.Node               ] [elkescrt00.***.net] stopped
[2017-08-23T05:07:17,797][INFO ][o.e.n.Node               ] [elkescrt00.***.net] closing ...
[2017-08-23T05:07:17,809][INFO ][o.e.n.Node               ] [elkescrt00.***.net] closed

i really need to fix this soon.
Can somebody help.


(Christian Dahlqvist) #9

In addition to what you have here, you need to configure network.host to a non-loopback interface, e.g. the IP of the host. Once you do this, Elasticsearch will determine that you are no longer running a single instance locally for development, and will add additional checks to ensure the deployment is sound and follow best practices. There is system configuration you will need to pay attention to and also a number of bootstrap checks that will need to pass. Information about any checks that fail will be written to the log.


(VivSam) #10

Thanks for the quick reply.
I have same indices in both nodes and having same logs writing on both nodes.
How do i know i have a cluster for my 2 node ES set up?
Is this the command to know number of ES nodes in my cluster?

GET /_nodes


(Christian Dahlqvist) #11

The cat nodes API will show the nodes in the cluster.


(VivSam) #12

the problem is as soon as i add non-loopback interface for ex
network.host: 10.162.119.235

and restart the ES service it gives me
elasticsearch dead but subsys locked
error.


(Christian Dahlqvist) #13

What is in the Elasticsearch logs?


(VivSam) #14

Here is the logs from 1 of the 2 nodes[elkescrt01.***.net,elkescrt00.***.net]. Its the same log except for the host name elkescrt01.***.net .

[2017-08-23T06:19:06,719][INFO ][o.e.n.Node               ] [elkescrt01.***.net] initializing ...
[2017-08-23T06:19:06,988][INFO ][o.e.e.NodeEnvironment    ] [elkescrt01.***.net] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/mapper/elasticsearch-elasticsearch--data)]], net usable_space [1003gb], net total_space [1.1tb], spins? [possibly], types [ext4]
[2017-08-23T06:19:06,988][INFO ][o.e.e.NodeEnvironment    ] [elkescrt01.***.net] heap size [5.9gb], compressed ordinary object pointers [true]
[2017-08-23T06:19:17,348][INFO ][o.e.n.Node               ] [elkescrt01.***.net] node name [elkescrt01.***.net], node ID [Gn5tyVUAQKmetNeJGyjSFg]
[2017-08-23T06:19:17,348][INFO ][o.e.n.Node               ] [elkescrt01.***.net] version[5.5.1], pid[5210], build[19c13d0/2017-07-18T20:44:24.823Z], OS[Linux/2.6.32-573.18.1.el6.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_71/25.71-b15]
[2017-08-23T06:19:17,354][INFO ][o.e.n.Node               ] [elkescrt01.***.net] JVM arguments [-Xms6g, -Xmx6g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-08-23T06:19:19,299][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [aggs-matrix-stats]
[2017-08-23T06:19:19,299][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [ingest-common]
[2017-08-23T06:19:19,299][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [lang-expression]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [lang-groovy]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [lang-mustache]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [lang-painless]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [parent-join]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [percolator]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [reindex]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [transport-netty3]
[2017-08-23T06:19:19,300][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] loaded module [transport-netty4]
[2017-08-23T06:19:19,301][INFO ][o.e.p.PluginsService     ] [elkescrt01.***.net] no plugins loaded
[2017-08-23T06:19:22,653][INFO ][o.e.d.DiscoveryModule    ] [elkescrt01.***.net] using discovery type [zen]
[2017-08-23T06:19:27,048][INFO ][o.e.n.Node               ] [elkescrt01.***.net] initialized
[2017-08-23T06:19:27,048][INFO ][o.e.n.Node               ] [elkescrt01.***.net] starting ...
[2017-08-23T07:16:43,879][INFO ][o.e.t.TransportService   ] [elkescrt01.***.net] publish_address {10.162.119.234:9300}, bound_addresses {10.162.119.234:9300}
[2017-08-23T07:16:43,908][INFO ][o.e.b.BootstrapChecks    ] [elkescrt01.***.net] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-08-23T07:16:43,911][ERROR][o.e.b.Bootstrap          ] [elkescrt01.***.net] node validation exception
[1] bootstrap checks failed
[1]: max number of threads [1024] for user [elasticsearch] is too low, increase to at least [2048]
[2017-08-23T07:16:43,915][INFO ][o.e.n.Node               ] [elkescrt01.***.net] stopping ...
[2017-08-23T07:16:43,974][INFO ][o.e.n.Node               ] [elkescrt01.***.net] stopped
[2017-08-23T07:16:43,974][INFO ][o.e.n.Node               ] [elkescrt01.***.net] closing ...
[2017-08-23T07:16:43,991][INFO ][o.e.n.Node               ] [elkescrt01.***.net] closed

(Christian Dahlqvist) #15

As you can see from the logs, the bootstrap checks are not passing, preventing the node from starting up. Go through the full list of bootstrap checks and make sure they are all correctly configured.


(VivSam) #16

I have done this @Christian_Dahlqvist and did set the ulimit -u to 2048 as root user.
But this message keeps popping.


(Christian Dahlqvist) #17

If you have installed from a package, Elasticsearch should be running as the elasticsearch user, not root.


(VivSam) #18

can't thank enough :slight_smile: :smiley:


(system) #19

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.