Regarding Bulk Indexing Requests

Hi,
Is there a way I can view the size (number of documents) of a bulk index request once it hits the elasticsearch queue? I want to estimate how efficiently logstash is batching requests that it pushes to elasticsearch.

Thx
D

No, I do not think there is. Logstash works with a specific max pipeline batch size (defaults to 125) and this is as far as I know the largest bulk size that will be sent. If you have multiple elasticsearch ouput plugins configured, controlled by conditionals, it is possible that bulk sizes by default will be much smaller which can indeed be inefficient.

So how can we measure our pipeline efficiency? Sending salami sized bulks to elasticsearch can make it look like elasticsearch is the bottleneck when it may not be.

I would recommend the following:

  • Try to have reasonably focused pipelines with a single Elasticsearch output plugin each. Use multiple defined pipelines together with pipeline to pipeline communication to control the flow. Avoid having large number of different Elasticsearch outputs controlled by conditionals. If you think you need the Elasticsearch output to send data to multiple indices, store the index name in the metadata and use this in a single Elasticsearch output.
  • Increase the batch size to something slightly larger, e.g. 1000 or 2000. If you have focused pipelines there is rarely a need to go much higher.
  • Use Logstash monitoring metrics to see how much data is processed by different plugins.
  • Have a look at the Logstash performance tuning and monitoring guide if you have not already.
1 Like

Am I correct in thinking that reading messages from message queues will saturate the logstash pipeline more effectively that having large numbers of beats, or any other agent types, fire their output directly at the input port? And, assuming a sane config, will therefore size bulk writes more efficiently?

Reading messages from a large message queue is likely to result in full batches of events being processed. I am not sure how the input plugins receiving data from Beats map this to batches, so am not sure whether it would have any impact. That said I do not think it would hust performance.

I say this because I've seen much better scaling on the backend when reading from queues (pull) vs receiving directly from agents (push). But my thinking is being disputed internally because there aren't metrics which directly corroborate my stance.

I have not run any tests or benchmarks on it so do not know for sure. I do suspect using a message queue would help in a number of ways if you have very large number of Beats feeding data, e.g. more evently distributing load and added resiliency. If the pipelines that are receiving data are inefficient (as per initial notes), adding a message queue is unlikely to improve performance much by itself.

Our pipelines are set up in the manner you recommend.

Have you overridden the default batch size?

Yes. We also run extra workers. Should we consider increasing the batch delay?

Nor sure that will help much. Lets take a step back first.

How have you determined that Elasticsearch is not the limiting factor? What indexing throughput are you seeing? How many indices and shards are you actively indexing into?

What is the size and specification of your cluster? What kind of storage are you using?

Elasticsearch isn't the limiting factor because we've run tests where for a given fixed topology (elasticsearch/logstash) we see higher sustained throughput with no loss when reading from queues. We see loss when pushing and ingestion rate is more erratic. Logstash isn't seeing 429s from the backend.
The shard layout is set at one primary and one replica per node.

OK. Good to see that you have taken a systematic approach. Sounds like a reasonable conclusion.

What does your pipelines look like? Have you looked at pipeline performance statistics to see if you have any slow/expensive filters configured?

That is a next step to take.

How can we determine, aside from cpu/heap utilisation, how/why logstash is choking on the frontend?

How many indices are you actively indexing into? Is it just one?

With respect to tuning Logstash I would look at performance metrics and try to identifying inefficient filters. I have on numerous occasions seen e.g. Inefficient grok filters slow down throughput a lot. There are also filetrs that rely on callouts that can severely limit throughput.

Yes, just a single index. We don't actually do any heavy transforms on our data. We try to keep it relatively simple - just index the data as smoothly as possible and leave data formatting to the upstream.