Stability issues with elasticsearch cluster

Srinath_C · July 6, 2015, 7:12pm

Hi,

Off late we've observed some stability issues with our elasticsearch cluster in production.
Appreciate any kind of hints in troubleshooting this one.

Our cluster
6 instances of c3.xlarge instances running on aws running elasticsearch 1.3.0
There is continuous indexing of small documents (~1-3kb) at the rate of ~500 per second.
Documents are being written to ES from a storm cluster of 3 nodes.
Aggregation queries are occasionally being made over this data spanning over a range of 30-60 mins.

Observations
The cluster state very frequently goes red and reports fewer active nodes. The state reported by each node in the cluster is different.

The following log message are seen on some of the nodes:
[2015-07-03 12:24:53,724][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-07-03 12:25:10,315][DEBUG][action.bulk ] [metrics-datastore-3-QA2906-perf] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

Sometimes, just before these message we find logs like:
[2015-06-24 06:39:27,679][DEBUG][action.admin.cluster.health] [metrics-datastore-1-production_rel_1_7azure] connection exception while trying to forward request to master node [[metrics-datastore-3-production_rel_1_7azure][4Ri9yww7RNa-TPAfLlgoNQ][ip-172-31-32-159.us-west-2.compute.internal][inet[/172.31.32.159:9300]]{max_local_storage_nodes=1}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [metrics-datastore-3-production_rel_1_7azure][inet[/172.31.32.159:9300]][cluster/health] disconnected]
[2015-06-24 06:39:28,790][INFO ][discovery.zen ] [metrics-datastore-1-production_rel_1_7azure] master_left [[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1}], reason [no longer master]
[2015-06-24 06:39:28,790][INFO ][cluster.service ] [metrics-datastore-1-production_rel_1_7azure] master {new [metrics-datastore-2-production_rel_1_7azure][E9RKBA8yT2Wf-sp93XrTCQ][ip-172-31-0-179.us-west-2.compute.internal][inet[/172.31.0.179:9300]]{max_local_storage_nodes=1}, previous [metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1}}, removed {[metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([metrics-datastore-6-production_rel_1_7azure][1VJKk1OXTPW7xx5iNSkmEg][ip-172-31-19-48.us-west-2.compute.internal][inet[/172.31.19.48:9300]]{max_local_storage_nodes=1})

The node that shows these log messages remains in this state (probably for ever) until a restart. After which, the cluster state goes back to green with a consistent health reporting from all the nodes.

Thanks,
Srinath.

warkolm · July 6, 2015, 9:04pm

How much data in the cluster? How many shards and indices (and replicas)?

You should really upgrade as well, 1.3.0 is pretty old by today's standards.

Srinath_C · July 7, 2015, 12:47am

Thank Mark for the reply. We have already begun the upgrade activity but want to be sure that there are no other short comings leading to this situation.

There are around 23 indices but only around 6 of them have considerable amount of data getting indexed. Each index is configured to have 3 shards with number of replicas (async) set to 1.

The active indices (around 5 of them) have a primary storage of around 20gb but the rest have data in the order of a few megabytes.

Let me know if you need any more data.

Srinath_C · July 8, 2015, 3:35am

Bump! Anyone?....

warkolm · July 8, 2015, 4:09am

It could be a networking issue?

Srinath_C · July 10, 2015, 4:02am

Don't think its a networking issue. This happens most of the time during our stability tests.

Topic		Replies	Views
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017
Elastic Unstable Elasticsearch	17	329	February 8, 2024
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1588	March 2, 2021
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017

Stability issues with elasticsearch cluster

Related topics