Optimizing process time

Hello everyone,

I'm pretty new in the elastic ecosystem.
For the work, I have to analyze logstash process, understand it and improve it if it's possible.
So I'm wondering a lot of questions.

Firstly, a little bit context :

The logstash is used to transfer data from an ES index A to an ES index B (each located on different servers).

So in the input, we use elasticsearch plugin to query index A.
Then, there is a lot of filters applied on each record/event.
Finally, output using elasticsearch plugin into the index B.

For now, there are approximatively 500k documents from the ES query.
The complete processed time is 2 hours. That's too much for us, and we must launch the process during the night.
So the goal is to reduce the time to 1 hour maximum.

So to reduce this time, few questions :

  • Is there a way to make a bulk query ? => If I understand well, each event is processed individually, input -> filters -> output. I'm wondering if it could be possible to filter all event and then sending all the record in one time to ES index B.

  • We also use plugins-filters-translate, to compared some property to a dictionary. The dictionaries are read for each event during the process ? Or once reading, logstash keep it in memory ? => for now, I'm not sure is an issue for the performance, because a dictionary is very light, but with the time, it will growth.

  • About the configuration, I read the documentation about workers and batch.size, I guess we need to audit our infra during the process to be sure to increase thoses properties ?

Thanks,

It is worth looking at those filters and the order they're used in and thinking about whether you can make them work more efficiently. I once made Logstash process some data 50x faster by looking at the data and altering the Logstash filters it went through based on what I found. E.g. the logs were going through a grok filter which was trying multiple patterns for every event. I realised that one pattern matched much more often than the other patterns did, so I made grok try that pattern first and Logstash got faster. There were a bunch of conditional statements like

 if [field] == "one" {
    do_stuff
 } else if [field] == "two" {
   do_other_stuff
 } else if [field] == "three" {
   do_some_other_stuff
 } else if [field] == "four" {
   do_yet_other_stuff

I realised that =="four" condition would be matched a lot more often than =="two" so I put the =="two" before =="four" and Logstash got faster.

When I was experimenting I used the metrics filter and the stdout plugin to compare how many events per second each variant of the filtering could process as described at
https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.htm
While testing I had all Logstash send all the processed data to

output {
    null{}
}

because it's not necessary to have Logstash output the data anywhere useful to find out if it's filters are faster.

1 Like