Stability issues with elasticsearch cluster


Off late we've observed some stability issues with our elasticsearch cluster in production.
Appreciate any kind of hints in troubleshooting this one.

Our cluster
6 instances of c3.xlarge instances running on aws running elasticsearch 1.3.0
There is continuous indexing of small documents (~1-3kb) at the rate of ~500 per second.
Documents are being written to ES from a storm cluster of 3 nodes.
Aggregation queries are occasionally being made over this data spanning over a range of 30-60 mins.

The cluster state very frequently goes red and reports fewer active nodes. The state reported by each node in the cluster is different.

The following log message are seen on some of the nodes:
[2015-07-03 12:24:53,724][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-07-03 12:25:10,315][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

Sometimes, just before these message we find logs like:
[2015-06-24 06:39:27,679][DEBUG][] [metrics-datastore-1-production_rel_1_7azure] connection exception while trying to forward request to master node [[metrics-datastore-3-production_rel_1_7azure][4Ri9yww7RNa-TPAfLlgoNQ][][inet[/]]{max_local_storage_nodes=1}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [metrics-datastore-3-production_rel_1_7azure][inet[/]][cluster/health] disconnected]
[2015-06-24 06:39:28,790][INFO ][discovery.zen ] [metrics-datastore-1-production_rel_1_7azure] master_left [[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][][inet[/]]{max_local_storage_nodes=1}], reason [no longer master]
[2015-06-24 06:39:28,790][INFO ][cluster.service ] [metrics-datastore-1-production_rel_1_7azure] master {new [metrics-datastore-2-production_rel_1_7azure][E9RKBA8yT2Wf-sp93XrTCQ][][inet[/]]{max_local_storage_nodes=1}, previous [metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][][inet[/]]{max_local_storage_nodes=1}}, removed {[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][][inet[/]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][][inet[/]]{max_local_storage_nodes=1})

The node that shows these log messages remains in this state (probably for ever) until a restart. After which, the cluster state goes back to green with a consistent health reporting from all the nodes.


How much data in the cluster? How many shards and indices (and replicas)?

You should really upgrade as well, 1.3.0 is pretty old by today's standards.

Thank Mark for the reply. We have already begun the upgrade activity but want to be sure that there are no other short comings leading to this situation.

There are around 23 indices but only around 6 of them have considerable amount of data getting indexed. Each index is configured to have 3 shards with number of replicas (async) set to 1.

The active indices (around 5 of them) have a primary storage of around 20gb but the rest have data in the order of a few megabytes.

Let me know if you need any more data.

Bump! Anyone?....

It could be a networking issue?

Don't think its a networking issue. This happens most of the time during our stability tests.