Elasticsearch loses its Master every few minutes

Michel_Laporte · October 5, 2015, 12:53pm

Hi,

I have Graylog working with ES.

I have 2 nodes and one elected as Master and the slave is Master eligible. I have installed KOPF on each one.

However, after a few days of ES being started, i cannot load the KOPF information on the WEB UI. After 15 minutes it re-appears and the cluster goes back to being Green. 15 mins or so later same happens, either KOPF cannot load information ( http://i.imgur.com/QghoQnM.jpg ) or the cluster goes red or yellow and the error i get is here ( http://i.imgur.com/hrEt6ik.jpg )

At first i thought it was RAM but i have upgraded the RAM on the VM and upped the Heap Usage. If i restart ES on the master, everything goes back to normal for a few days until the problem appears again. During the time, im unable to query the ES index either (Using Glog) until i restart ES then i'm able to search my logs. What could be going wrong? It seems the main error i always get is " No active Master, switching to basic mode". I keep losing my master.

Thanks,
Michel

msimos · October 5, 2015, 6:16pm

Hi,

You need to check the Elasticsearch log file on each node to see whats happening. If a node doesn't respond with in 90 seconds for some reason (3 attempts at 30 seconds), then the node is removed from the cluster. So something is happening but you need to look into the Elasticsearch log file on each node for any additional errors.

Michel_Laporte · October 6, 2015, 9:38am

Hi Mike,

Thanks for taking your time to reply
I currently have 2 main errors that keep popping up.

2015-10-05 11:36:44,406][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [8625] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][Gy1EMkaTSgilwenuqw5o7Q][ess-lon-gray-002][inet[/192.168.32.70:9300]]])

&

[2015-10-05 11:47:54,495][INFO ][cluster.service ] [ess-lon-gray-003_master] removed {[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false},}, reason: zen-disco-node_failed([graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}), reason transport disconnected

The second one states Transport Disconnected?

For some reason the Master is being kicked out of the ES Cluster.

Is there anything else i'm missing out on? What could be causing this?

Thanks

msimos · October 6, 2015, 9:03pm

Hi,

Unfortunately its hard to say what the issue is since it just says the node graylog2-server timed out. From the logs it says ess-lon-gray-003_master removed ess-lon-gray-002 from the cluster. So it sounds like the connection was dropped by ess-lon-gray-002. You should check the log file on both nodes around these times. Also I'd eliminate any network issues or if the data node goes into a long old GC you can see issues like this because java will pause the JVM to do garbage collection.

Michel_Laporte · October 7, 2015, 10:45am

Hi Mike,

Is there anyway to delay the GC process or disable it?

(Im a bit of a novice when it comes to how Java handles its garbage collection)

Or is there any way to increase the Java Heap Usage?

Thanks

Michel_Laporte · October 8, 2015, 3:11pm

I've also realised that this happens every 2 hours on the dot.

Very weird.

[2015-10-08 01:02:07,956][DEBUG][action.admin.indices.alias.exists] [ess-lon-gray-003_master] no known master node, scheduling a retry
[2015-10-08 01:02:09,436][INFO ][cluster.service ] [ess-lon-gray-003_master] added {[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]],}, reason: zen-disco-receive(join from node[[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])
[2015-10-08 01:02:10,016][INFO ][cluster.metadata ] [ess-lon-gray-003_master] [graylogemea_21] update_mapping [message] (dynamic)

[2015-10-08 03:02:28,182][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [12166] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])

msimos · October 8, 2015, 7:00pm

Hi,

There isn't anyway to disable or delay the GC process. You either need to reduce the amount of heap (ie. disable norms or use doc values) you are using. Or you can increase the java heap to reduce the frequency of GCs. Refer to this URL:

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

If you're using the RPM or DEB install you should set it in /etc/sysconfig/elasticsearch or /etc/default/elasticsearch respectively. Since this happens almost every 2 hours, I'd check to see what else is happening on your servers or network. Setup something to monitor port 9300 between the two nodes and check the load & cpu.

Michel_Laporte · October 9, 2015, 9:28am

Hi,

I did increase the HEAP in the /etc/default/elasticsearch

It is referenced in KOPF, however, the Graylog instance still says 972MB (Its what i found weird)

http://imgur.com/Br5qAdY

Im starting to think its Graylog instead of ES?

Christian_Dahlqvist · October 9, 2015, 9:37am

Have you met up monitoring of the network connection between the nodes? If so, did this show anything? Are the nodes located in the same data centre?

Michel_Laporte · October 9, 2015, 11:01am

Im going to run that now.

Any tools you recommend?

Topic		Replies	Views
Another node tries to become master (possibly due to GC hangs) Elasticsearch	4	401	July 6, 2017
Master node failure causes cluster to fail Elasticsearch	3	1674	July 6, 2017
[WARN ][o.e.t.TransportService ] [esm3] Received response for a request that has timed out, sent [33470ms] ago, timed out [3470ms] ago, action [internal:discovery/zen/fd/master_ping] Elasticsearch	3	2828	March 23, 2018
MasterNotDiscoveredException Elasticsearch	1	303	July 6, 2017
Zen ping timeout causes nodes to lose master permanently Elasticsearch	4	735	July 6, 2017

Elasticsearch loses its Master every few minutes

Related topics