Hi,
Off late we've observed some stability issues with our elasticsearch cluster in production.
Appreciate any kind of hints in troubleshooting this one.
Our cluster
6 instances of c3.xlarge instances running on aws running elasticsearch 1.3.0
There is continuous indexing of small documents (~1-3kb) at the rate of ~500 per second.
Documents are being written to ES from a storm cluster of 3 nodes.
Aggregation queries are occasionally being made over this data spanning over a range of 30-60 mins.
Observations
The cluster state very frequently goes red and reports fewer active nodes. The state reported by each node in the cluster is different.
The following log message are seen on some of the nodes:
[2015-07-03 12:24:53,724][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-07-03 12:25:10,315][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Sometimes, just before these message we find logs like:
[2015-06-24 06:39:27,679][DEBUG][action.admin.cluster.health] [metrics-datastore-1-production_rel_1_7azure] connection exception while trying to forward request to master node [[metrics-datastore-3-production_rel_1_7azure][4Ri9yww7RNa-TPAfLlgoNQ][ip-172-31-32-159.us-west-2.compute.internal][inet[/172.31.32.159:9300]]{max_local_storage_nodes=1}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [metrics-datastore-3-production_rel_1_7azure][inet[/172.31.32.159:9300]][cluster/health] disconnected]
[2015-06-24 06:39:28,790][INFO ][discovery.zen ] [metrics-datastore-1-production_rel_1_7azure] master_left [[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1}], reason [no longer master]
[2015-06-24 06:39:28,790][INFO ][cluster.service ] [metrics-datastore-1-production_rel_1_7azure] master {new [metrics-datastore-2-production_rel_1_7azure][E9RKBA8yT2Wf-sp93XrTCQ][ip-172-31-0-179.us-west-2.compute.internal][inet[/172.31.0.179:9300]]{max_local_storage_nodes=1}, previous [metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1}}, removed {[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1})
The node that shows these log messages remains in this state (probably for ever) until a restart. After which, the cluster state goes back to green with a consistent health reporting from all the nodes.
Thanks,
Srinath.