Elasticsearch loses its Master every few minutes

Hi,

I have Graylog working with ES.

I have 2 nodes and one elected as Master and the slave is Master eligible. I have installed KOPF on each one.

However, after a few days of ES being started, i cannot load the KOPF information on the WEB UI. After 15 minutes it re-appears and the cluster goes back to being Green. 15 mins or so later same happens, either KOPF cannot load information ( http://i.imgur.com/QghoQnM.jpg ) or the cluster goes red or yellow and the error i get is here ( http://i.imgur.com/hrEt6ik.jpg )

At first i thought it was RAM but i have upgraded the RAM on the VM and upped the Heap Usage. If i restart ES on the master, everything goes back to normal for a few days until the problem appears again. During the time, im unable to query the ES index either (Using Glog) until i restart ES then i'm able to search my logs. What could be going wrong? It seems the main error i always get is " No active Master, switching to basic mode". I keep losing my master.

Thanks,
Michel

Hi,

You need to check the Elasticsearch log file on each node to see whats happening. If a node doesn't respond with in 90 seconds for some reason (3 attempts at 30 seconds), then the node is removed from the cluster. So something is happening but you need to look into the Elasticsearch log file on each node for any additional errors.

Hi Mike,

Thanks for taking your time to reply :smile:
I currently have 2 main errors that keep popping up.

2015-10-05 11:36:44,406][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [8625] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][Gy1EMkaTSgilwenuqw5o7Q][ess-lon-gray-002][inet[/192.168.32.70:9300]]])

&

[2015-10-05 11:47:54,495][INFO ][cluster.service ] [ess-lon-gray-003_master] removed {[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false},}, reason: zen-disco-node_failed([graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}), reason transport disconnected

The second one states Transport Disconnected?

For some reason the Master is being kicked out of the ES Cluster.

Is there anything else i'm missing out on? What could be causing this?

Thanks

Hi,

Unfortunately its hard to say what the issue is since it just says the node graylog2-server timed out. From the logs it says ess-lon-gray-003_master removed ess-lon-gray-002 from the cluster. So it sounds like the connection was dropped by ess-lon-gray-002. You should check the log file on both nodes around these times. Also I'd eliminate any network issues or if the data node goes into a long old GC you can see issues like this because java will pause the JVM to do garbage collection.

Hi Mike,

Is there anyway to delay the GC process or disable it?

(Im a bit of a novice when it comes to how Java handles its garbage collection)

Or is there any way to increase the Java Heap Usage?

Thanks

I've also realised that this happens every 2 hours on the dot.

Very weird.

[2015-10-08 01:02:07,956][DEBUG][action.admin.indices.alias.exists] [ess-lon-gray-003_master] no known master node, scheduling a retry
[2015-10-08 01:02:09,436][INFO ][cluster.service ] [ess-lon-gray-003_master] added {[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]],}, reason: zen-disco-receive(join from node[[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])
[2015-10-08 01:02:10,016][INFO ][cluster.metadata ] [ess-lon-gray-003_master] [graylogemea_21] update_mapping [message] (dynamic)

[2015-10-08 03:02:28,182][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [12166] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])

Hi,

There isn't anyway to disable or delay the GC process. You either need to reduce the amount of heap (ie. disable norms or use doc values) you are using. Or you can increase the java heap to reduce the frequency of GCs. Refer to this URL:

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

If you're using the RPM or DEB install you should set it in /etc/sysconfig/elasticsearch or /etc/default/elasticsearch respectively. Since this happens almost every 2 hours, I'd check to see what else is happening on your servers or network. Setup something to monitor port 9300 between the two nodes and check the load & cpu.

Hi,

I did increase the HEAP in the /etc/default/elasticsearch

It is referenced in KOPF, however, the Graylog instance still says 972MB (Its what i found weird)

http://imgur.com/Br5qAdY

Im starting to think its Graylog instead of ES?

Have you met up monitoring of the network connection between the nodes? If so, did this show anything? Are the nodes located in the same data centre?

Im going to run that now.

Any tools you recommend?