Production cluster slows down after 15-20 days of starting the services

We are using 5 machines cluster in production for
ElasticSearch. We are facing an issue that initially cluster response is
good but after 15-20 days of starting the services, ES response get
slow down, also indexing process get slow down. To resolve the issue we
have to restart ElasticSearch on couple of nodes.

Details:
ES Version: 1.5.2
RAM per node: 30 GB
ES Heap size per node: 12 GB
Disk Space per node: 1 TB

We are indexing 40 millions records in a day of size 30 GB. Indexes are created on daily basis. Replication factor is 2 as of now.

We have to restart the ES after 15-20 days to keep it stable.

It would be great help if anyone could share their suggestions

Do you have gc logging enabled? If so you may want to check to see if ES is slowly eating up it's heap and slowing down because of GC's. There has been a couple of improvements and fixes post 1.5 that may impact your cluster - the one that comes to mind is described here - JVM heap wasted in Segments fixedBitSet

Hi Bruce,

Thanks for reply.

I have not enabled gc logging explicitly. Can you please let me know how to enable gc logging?

As of now we can not upgrade ES as there are major changes in java apis also.

API changes while they exist should not be horribly onerous (we did the upgrade from 1.3 to 2.3 in a week or so) especially for a 1.5 -> 1.7.5.

With respect to gc logging I'm pretty sure there is an environment variable to enable it. Check your bin/elasticsearch and/or bin/elasticsearch.in.sh file for details.

Okay. I will check about GC logging.

I am just curious to know, did you not face any issue for data compatibility while upgrading ES from 1.3 to 2.3? I mean data indexed in old version of ES is compatible with new ES version or is there some process for this?

We have to reindex because we had a field with a period in the name which is disallowed in 2.3. There is a compatibility plugin for 1.x that can tell you if your system can be upgraded in place or not.

Thanks Bruce for this information.

Is there any REST api in ES, using that we can identify heap usage like which is eating up the memory?

I'd suggest looking at the stats api - https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html https://www.elastic.co/guide/en/elasticsearch/guide/current/_monitoring_individual_nodes.html

If that doesn't help than perhaps a heap dump + eclipse memory analyzer. Be aware though that you'll need a decent understanding of ES's internals to make sense of the classes and their hierarchy.