Elasticsearch and log load

Hello, we're currently on Elasticsearch 6.3.2 with the log sources sending their data to Elasticsearch via Logstash.

I've recently noticed that Elasticsearch doesn't seem to be indexing anywhere near the number of documents it should be; for example our Winlogbeat index used to receive over 200,000 logs per 15 minutes and our syslog index used to receive about 700,000 logs per 15 minutes. However this has dropped drastically over time as we've added more log sources (Winlogbeat now about 2,000 logs per 15 mins, syslog about 60,000).

Over the months we've added more log sources and I can only assume that this is causing the slowdown / non-indexing because when I disable the other log sources and leave the Winlogbeat input enabled in Logstash, the logs per 15 minutes seems to return to normal (over 200,000 again).

I've looked at both the Logstash (logstash-plain.log) and Elasticseach (es-cluster.log) application logs and can't see anything about "throttling" or anything else that indicates Elasticsearch can't handle the log load I'm throwing at it - can anyone point me in the right direction?

Many thanks.

If you use Kibana you can check Indexing Rate and Indexing Latency for Elasticsearch in
Monitoring > Clusters > Overview

It is also possible to add monitoring to Logstash

What is the CPU load like on your machines?

Hello thanks for getting back to me; we don't currently use XPACK - can I enable monitoring on the free license?

The resource stats (CPU, memory disk etc.) for our Logstash and Elasticsearch nodes are OK.

Sorry, have answered my own question and can see monitoring is available on the Basic license.

Hello again - as per the documentation I set up a separate monitoring cluster and have configured the Elasticsearch nodes and Logstash node to send their metrics to it; from a cursory look I can't see anything obviously wrong in the stats I'm seeing - can anyone give me guidance on what metrics I should be looking at?

Thanks.

EDIT:

Indexing rate varies between 1500 and 2500 per second
Indexing latency is sub 1 millisecond (which seems very fast to me)

Are your indices keeping up? It seems that if you have that drastic of a bottleneck, that they aren't keeping current.

On the other hand, with that index latency, maybe your log sources aren't sending data.

Check logstash for delays, are you using memory or disk queues? Check logstash and some sample winlogbeat logs.

Thanks for getting back to me.

I'm using Logstash with the memory queue - the monitor shows Logstash as emitting the same number of logs it has received with an event latency of 6 to 8 ms. The Logstash JVM heap usage is about half of the 1GB assigned to it.

Last week I did enable dubug logging for one of our Winlogbeat sources; I noted the record_number for one of the events it claimed it have sent to Logstash/Elasticsearch - the document did eventually appear in the Kibana search but only around 7 minutes or so after Winlogbeat said it submitted it.

So theres obviously a slowdown some where but the monitoring seems to be saying there isn't an issue. I did grep the Elasticsearch logs for "TOO_MANY_REQUESTS" but found nothing. I can see in the Elasticsearch logs that the 1000 field limit for Winlogbeat indexes has been reached but this is only affecting the odd few documents, it doesn't account for the millions of missing documents that should be there each day.

The monitor shows the segment count for the Elasticsearch nodes to be about 7100 - is this high? Should I change the index refresh from 1 second to 30 seconds?

Thanks.

How is the i/o load? How big are the os layer i/o queues?

Changing the refresh probably won't hurt and could help. I had node stability problems, many GC's and crashes due to heap that were fixed by reducing the segment count. Since I don't expect any log data more than 3 days old, I set indices to read-only and forcemerge them after they are three days old.

How long are you keeping data? How may shards per node?

Hello again Len,

The disk I/O does queue from time to time, its using using slow storage but its always been that way. In terms of data retention, as far as the Winlogbeat indexes go we're keeping the logs for 12 months. There is 509 shards per node.

As I say, there's no indication in the Elasticsearch / Logstash nodes that it can't handle what I'm throwing at it - I know that via the beats framework Logstash / Elasticsearch has the ability to tell the collectors to back off - would that appear in a log somewhere?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.