Up to 5 hours delay in elasticsearch indexing

Hi,

I'm experiencing an issue with Elasticsearch indexing. I've searched online quite a lot but i could not find anyone with a problem even remotely as severe as mine, hence this message.
My elasticsearch cluster is composed of:

  • 15 nodes of which 14 are data nodes
  • 8cpu and 32GB each
  • elasticsearch 6.5.3 configured to have 16GB Max Heap

on this cluster run 2 index templates:

  • one is fed via a custom component that pushes logs to a kafka topic, logstash listens to this topic and adds some processing before shipping it to elasticsearch via elasticsearch output plugin
  • the other is fed via a filebeat -> logstash -> elasticsearch pipeline

both indexes are rotated every day and have similar settings in terms of shards (both indexes shards stay between 20 and 40 GB as advised by the tuning guide)

What's weird is that, while the index from the first template look fine and with minimum delay, the index from the second template has up to 5hours delay (i've measured this by looking at the time the index stops adding documents to the template which is around 5am compared to midnight in the first index)

The effect is also quite visible on Kibana as logs tend to from a slope in the far right end of the log count histogram.

Any idea of what that could be?
I've already tried removing as much full text indexing as possible so most of the text fields in the second index are mapped as keyword
I've also disabled swapping as per the tuning elastic search guide
Since we mainly use it for logs, I've raised refresh rate to 5s
I was about to set the index buffer to 512mb, but the default setting of 10% of the heap would put it at 1.6Gb and i see no reason to bring it down but please advise if I'm wrong

Nothing seems to work so work so far, and I'm generally puzzled of where could hundreds of gigabytes of logs go for 5 hours since I do know they are eventually stored as I can see the logs the next day... there's nothing in our cluster big enough to store it except elasticsearch persistence. So I've also checked the translog which seems to rise up to ~300MB per node for about a minute and go down as i would expect.

Any ideas?

Best regards,
Andrea

I'm adding a screenshot of how the histogram looks like


The expectation is that it should look kinda flat. Actually if I go back the next time in the same timeframe it will look flat.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.