Cluster really slow after restart

elssar · April 18, 2016, 1:55am

Last week we had to restart a few data nodes in our cluster. These three nodes contained our main index (~1Billion documents, ~450GB across 10 shards with 1 replica).

Since the restart, all queries to the cluster have been sluggish, specially to the main index. Just a simple search is either timing out or taking between 100 and 300 seconds.

I added 2 more data nodes into the mix, but they don't seem to help.

After the restart, one node got all the primary shards of the main index. Not sure if that is the cause of the performance degradation.

Another curiosity is that Kibana is failing to load now. It errors out while fetching its settings, even though the same _mget query is successful when I manually perform it. Not sure if its related.

Using Elasticsearch 1.7.1, and Kibana 4.1.2.

The optimizations we made through trial and error (breaker limit settings) have persisted, so I'm not sure what is causing this.

There are 4 other data nodes in the cluster, but they don't hold the indices that are problematic and they weren't restarted either.

Can anyone tell me how can I diagnose why queries are so slow? I have slow logging enabled on all the nodes.

EDIT1:

We have allocated more than 50% of the RAM to the ES heap (12G/15.25G). This was done incrementally, over a period of a few months (from 7G to 12G). I'm guessing it could be a problem, but before last week we haven't had any problems with our queries. Not sure how this would suddenly become a factor.

EDIT2:

When making a query, the load average spikes, but CPU usage remains the same.

EDIT3:

Our use case is mostly indexing heavy. Most of the heavy queries/aggregations run as scripts once a day that put aggregated data into other indices which are then queried (using either Kibana or Grafana)

warkolm · April 18, 2016, 2:14am

Maybe because the files that were constantly accessed were held in off heap memory by the OS, but restarting node caused it to clear that out?
Though with only a few gig for that caching, I'm not sure it would really matter for such a large index.

How are you monitoring ES?

elssar · April 18, 2016, 2:18am

A combination of AWS Opsworks monitoring, Head, Bigdesk and looking at server logs.

Topic		Replies	Views
Slowness in cluster Elasticsearch	1	403	February 13, 2017
Elasticsearch become slow Elasticsearch	1	324	July 6, 2017
What elasticsearch does when restarting a cluster? Elasticsearch	5	1640	May 13, 2019
Cluster deteriorates after a couple of days Elasticsearch	9	4839	July 5, 2017
ES cluster queries are slow suddenly Elasticsearch	1	816	July 6, 2017

Cluster really slow after restart

Related topics