High indexing ES cluster sawtooth pattern

Hello folks,
I have a classic ELK cluster for storing logs from my applications.

Elastic search:
version: 5.1.2
3 nodes 8Gb of RAM each box (6Gb dedicated to ES)
Running on AWS gp2 provisioned 3000 IOps hard drives.
mlockall enabled

version: 5.0.0 (5.2.2 gives the same results)
1 instance with default configuration.

Recently I realized that we can't catch up with the amount of data we are pushing into ES.
After x-pack installation I found out that I have a sawtooth pattern in the "Indexing rate" section.
It looks like the whole box suspends completely every 90 seconds.
I increased memory allocation from 4Gb to 6Gb which didn't help a lot.

The weird thing is that I can see the same behavior when I put ES off the load.
And these gaps are in all charts which look so weird.
I see two possible scenarios either ES box freezes periodically either X-Pack just messes monitoring packets.
Personally, it seems impossible to fall JVM heap to zero because of GC.

Any thoughts?

Under load

Master node

Off the load

Master node

Can you see anything in logs?

Some comments:

not more than 50% of the RAM allocated to the HEAP.
Use machines with internal SSD drives

What does the advanced node screen show with respect to GC? What is the average size of your documents?

Logs are completely clear. indexing_slowlog and index_search_slowlog are empty.
Unfortunately, I can't use internal SSD drives. We are running on AWS and switching is not the case.

My average document size is 724 bytes.
Daily index size is about 60-75 Gb.
Daily docs is about 72-100 mln.

Here are advanced tabs
Master node

Secondary node

You are probably going to pay the price with segment merging.

You didn't disable refresh right?

No, I didn't.

Here is my active index monitoring

It does segment merging, but I don't see any big merges related to these gaps.

Based on your description it sounds like you might be running on m4.large nodes. If that is the case, these only have 2 CPU cores and moderate networking performance, both of which could be limiting indexing throughput. When I look at your monitoring graphs, it looks like CPU is pegged at 100% for periods of time and the heap usage is quite high (especially given that we recommend a 4GB heap on a 8GB host), resulting in a shallow saw-tooth pattern. Given the size of the documents and complexity of the mappings in use, it is possible that you have reached the limit of what your cluster can handle and actually may need to upgrade to a larger insurance type that provide more CPU and RAM as well as better networking performance.

Yes, we are running on m4.large boxes.
It is very possible.
I will give it another try on bigger boxes.
Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.