I have 2 nodes and one elected as Master and the slave is Master eligible. I have installed KOPF on each one.
However, after a few days of ES being started, i cannot load the KOPF information on the WEB UI. After 15 minutes it re-appears and the cluster goes back to being Green. 15 mins or so later same happens, either KOPF cannot load information ( http://i.imgur.com/QghoQnM.jpg ) or the cluster goes red or yellow and the error i get is here ( http://i.imgur.com/hrEt6ik.jpg )
At first i thought it was RAM but i have upgraded the RAM on the VM and upped the Heap Usage. If i restart ES on the master, everything goes back to normal for a few days until the problem appears again. During the time, im unable to query the ES index either (Using Glog) until i restart ES then i'm able to search my logs. What could be going wrong? It seems the main error i always get is " No active Master, switching to basic mode". I keep losing my master.
You need to check the Elasticsearch log file on each node to see whats happening. If a node doesn't respond with in 90 seconds for some reason (3 attempts at 30 seconds), then the node is removed from the cluster. So something is happening but you need to look into the Elasticsearch log file on each node for any additional errors.
Thanks for taking your time to reply
I currently have 2 main errors that keep popping up.
2015-10-05 11:36:44,406][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [8625] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][Gy1EMkaTSgilwenuqw5o7Q][ess-lon-gray-002][inet[/192.168.32.70:9300]]])
Unfortunately its hard to say what the issue is since it just says the node graylog2-server timed out. From the logs it says ess-lon-gray-003_master removed ess-lon-gray-002 from the cluster. So it sounds like the connection was dropped by ess-lon-gray-002. You should check the log file on both nodes around these times. Also I'd eliminate any network issues or if the data node goes into a long old GC you can see issues like this because java will pause the JVM to do garbage collection.
I've also realised that this happens every 2 hours on the dot.
Very weird.
[2015-10-08 01:02:07,956][DEBUG][action.admin.indices.alias.exists] [ess-lon-gray-003_master] no known master node, scheduling a retry
[2015-10-08 01:02:09,436][INFO ][cluster.service ] [ess-lon-gray-003_master] added {[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]],}, reason: zen-disco-receive(join from node[[ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])
[2015-10-08 01:02:10,016][INFO ][cluster.metadata ] [ess-lon-gray-003_master] [graylogemea_21] update_mapping [message] (dynamic)
[2015-10-08 03:02:28,182][WARN ][discovery.zen.publish ] [ess-lon-gray-003_master] timed out waiting for all nodes to process published state [12166] (timeout [30s], pending nodes: [[graylog2-server][IuYwswm-QGCfMoIIYEGyMA][ess-lon-gray-002][inet[/192.168.32.70:9350]]{client=true, data=false, master=false}, [ess-lon-gray-002_slave][waK4DyM2SIqg161gYdlbuA][ess-lon-gray-002][inet[/192.168.32.70:9300]]])
There isn't anyway to disable or delay the GC process. You either need to reduce the amount of heap (ie. disable norms or use doc values) you are using. Or you can increase the java heap to reduce the frequency of GCs. Refer to this URL:
If you're using the RPM or DEB install you should set it in /etc/sysconfig/elasticsearch or /etc/default/elasticsearch respectively. Since this happens almost every 2 hours, I'd check to see what else is happening on your servers or network. Setup something to monitor port 9300 between the two nodes and check the load & cpu.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.