Production cluster slows down after 15-20 days of starting the services

Hanish_Bansal · May 3, 2016, 3:04pm

We are using 5 machines cluster in production for
ElasticSearch. We are facing an issue that initially cluster response is
good but after 15-20 days of starting the services, ES response get
slow down, also indexing process get slow down. To resolve the issue we
have to restart ElasticSearch on couple of nodes.

Details:
ES Version: 1.5.2
RAM per node: 30 GB
ES Heap size per node: 12 GB
Disk Space per node: 1 TB

We are indexing 40 millions records in a day of size 30 GB. Indexes are created on daily basis. Replication factor is 2 as of now.

We have to restart the ES after 15-20 days to keep it stable.

It would be great help if anyone could share their suggestions

Bruce_Ritchie · May 3, 2016, 6:43pm

Do you have gc logging enabled? If so you may want to check to see if ES is slowly eating up it's heap and slowing down because of GC's. There has been a couple of improvements and fixes post 1.5 that may impact your cluster - the one that comes to mind is described here - JVM heap wasted in Segments fixedBitSet

Hanish_Bansal · May 5, 2016, 3:22pm

Hi Bruce,

Thanks for reply.

I have not enabled gc logging explicitly. Can you please let me know how to enable gc logging?

As of now we can not upgrade ES as there are major changes in java apis also.

Bruce_Ritchie · May 5, 2016, 3:35pm

API changes while they exist should not be horribly onerous (we did the upgrade from 1.3 to 2.3 in a week or so) especially for a 1.5 -> 1.7.5.

With respect to gc logging I'm pretty sure there is an environment variable to enable it. Check your bin/elasticsearch and/or bin/elasticsearch.in.sh file for details.

Hanish_Bansal · May 5, 2016, 5:13pm

Okay. I will check about GC logging.

I am just curious to know, did you not face any issue for data compatibility while upgrading ES from 1.3 to 2.3? I mean data indexed in old version of ES is compatible with new ES version or is there some process for this?

Bruce_Ritchie · May 5, 2016, 5:15pm

We have to reindex because we had a field with a period in the name which is disallowed in 2.3. There is a compatibility plugin for 1.x that can tell you if your system can be upgraded in place or not.

Hanish_Bansal · May 6, 2016, 3:20pm

Thanks Bruce for this information.

Is there any REST api in ES, using that we can identify heap usage like which is eating up the memory?

Bruce_Ritchie · May 6, 2016, 3:54pm

I'd suggest looking at the stats api - https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html https://www.elastic.co/guide/en/elasticsearch/guide/current/_monitoring_individual_nodes.html

If that doesn't help than perhaps a heap dump + eclipse memory analyzer. Be aware though that you'll need a decent understanding of ES's internals to make sense of the classes and their hierarchy.

Topic		Replies	Views
Cluster (ES 5.2) performance degrading after indexing Elasticsearch	3	508	June 6, 2017
Performance degrading after a couple of weeks Elasticsearch	7	520	October 30, 2018
ES performance issues for 800G data per day Elasticsearch	9	509	July 6, 2017
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
Production es help Elasticsearch	9	713	December 14, 2016

Production cluster slows down after 15-20 days of starting the services

Related topics