Incomprehensible Elasticsearch behaviour

I'm running a two nodes elasticsearch cluster(v. 5.6.10) which I monitor and I notice some interesting graphs which I can't understand. The first one is a graph of merge operations over time:

As far as I can see it's more like a linear growth so I assume there's something which can explain this but I can't find any information. And the more interesting thing is that at 00:00 it drops to zero. Can someone explain what is causing this?

The second graph is pretty much the same as the first one but it's a graph of the heap used by the cluster:

This looks like a memory leak to me and again around 00:00 the heap usage resets.

Here's a graph of the elasticsearch operations(indexing rate and search rate). As we can see from them there's almost no indexing and the peak of the search requests is 40 per second which I think is not that much load.


The issue I'm facing is that the peak time of all graphs coincides with the 'rush hour' of my application and the system becomes irresponsive.

Here's some information about the setup of the cluster:

I have 2 virtual machines and the nodes are running in a separate docker containers on each of them.

Node 1 hardware(The GREEN graph):

  • 8 core Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
  • a spinning hard drive
  • 16gb of heap allocated

Node 2 hardware(The YELLOW graph):

  • 8 core Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  • a spinning hard drive
  • 16gb of heap allocated

On these virtual machines runs also the db cluster, so the processor is shared between the db cluster and the elasticseach cluster.

Another thing that is worth mentioning is that because of the spinning disks I tried this recommendation https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html but it didn't change anything. I applied the setting on index level without restarting the cluster as I found on several places information that it's a runtime setting.

A few general comments;

5.X has been EOL for some time now. You should upgrade as a matter of urgency.

And 2 nodes is not ideal, you run a risk of not maintaining a quorum/split brain.

That's unlikely to be a good thing due to resource contention.

Moving on though; What do your Elasticsearch logs show for things like GC?

I can't move right now, because the usage is tightly coupled with hibernate search and unfortunately the new version of hibernate search, which supports ES 6 and above is still in development.

Regarding the logs- no there's nothing suspicious in the logs(no GC logs at all).

Des someone knows about some automatic operations executed on the cluster, because as we can see on the graphs the problem happens at the exact same time every day? It looks like a cron job is executed which leads to some service interruptions.

Not in 5.X, is there anything in cron on the hosts?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.