Logstash's Elasticsearch filter plugin bottleneck

I have a Logstash pipeline set up with multiple Elasticsearch filter plugins to do a lookup of field values ingested via an input filter. Processing and writing the events into Elasticsearch is around 500,000 events per hour. I added a few more Elasticsearch filter plugins and, to be fair, the queries are just a little bit more complex than the previous ones so I'm expecting a slight hit in terms of performance. However, the events written to Elasticsearch is currently less than 100,000 events per hour.

I checked the CPU usage of Elasticsearch nodes and they do need exceed 30%. I know that these new Elasticsearch filters set up are bottlenecking the flow. How can I check further to verify this? Perhaps getting some stats to show the bottleneck?

To add on, what might be some of the configurations I can look into to potentially increase the search performance? On Logstash's side I have configured persistent queueing but due to the bottleneck, I'm assuming events are getting dropped.

TIA

I suggest using the pipeline stats API to look at the time spent in each plugin (input, filter, and output). If most of the cost of your pipeline is in the elasticsearch filter and you add a second one then that could double the cost of the pipeline and halve the throughput.

It will be easier to interpret the output if you set the id option on each filter plugin. Otherwise the ids will be randomly generated.

1 Like

in past I had issue of logstash limiting on write to elastic turn out I had open file issue and some kernal parameter in my linux host. I write around 420,000 event per hour without issue.

one of them was in start up service file, increase that

LimitNOFILE=

another kernel parameter is fs.file-max

look for some shared memory segment in sysctl.conf file. increase them if you can.

also check /etc/security/limits.conf file and increase open file limit for logstash

Thanks for the suggestion. I was able to get some stats for the filter plugin.

Below shows the stats for an existing filter that did not encounter any slowdowns:

{
	"id": "es_filter_fp_port",
	"name": "elasticsearch",
	"events": {
	  "out": 66702,
	  "duration_in_millis": 97493,
	  "in": 66702
	},
	"flow": {
	  "worker_millis_per_event": {
		"current": 1.386,
		"last_1_minute": 1.432,
		"last_5_minutes": 1.392,
		"last_15_minutes": 1.4,
		"lifetime": 1.462
	  },
	  "worker_utilization": {
		"current": 0.07578,
		"last_1_minute": 0.1105,
		"last_5_minutes": 0.1369,
		"last_15_minutes": 0.1384,
		"lifetime": 0.1983
	  }
	}
  }

Below shows the stats for the new filter:

  {
	"id": "es_filter_flow_dest_geo",
	"name": "elasticsearch",
	"events": {
	  "out": 173832,
	  "duration_in_millis": 29832992,
	  "in": 176230
	},
	"flow": {
	  "worker_millis_per_event": {
		"current": 176.8,
		"last_1_minute": 181,
		"last_5_minutes": 172,
		"last_15_minutes": 171.5,
		"lifetime": 169.3
	  },
	  "worker_utilization": {
		"current": 71.63,
		"last_1_minute": 69.51,
		"last_5_minutes": 73.35,
		"last_15_minutes": 72.74,
		"lifetime": 60.69
	  }
	}
  }

As seen above, the millis per event values were higher as well as the worker utilization. Are there optimizations i can do on Logstash's side? Perhaps increasing the number of workers? Or this is purely due to Elasticsearch's search performance?

Thanks for the suggestion. I will try this out but most likely it might not fix the current issue as I found out the main culprit is ELasticsearch search performance

Please do not post pictures of text, they are not searchable and are not readable for some people. Just post the text.

Er, you may also want to revisit that claim/assumption, double check the filters!? Back of envelope was 100x more milliseconds spent as per your screenshots.

1 Like

Can you share your old and new queries?

Please do not post pictures of text, they are not searchable and are not readable for some people. Just post the text.

Got it. I replaced the images with text

1 Like

Can you share your old and new queries?

Old query:

{
  "size": 1,
  "sort": [ {"insert_time" : "desc"} ],
  "query": {
    "term": {
      "port.keyword": "%{[destination_port_raw]}"
    } 
  }
}

New query:

{
  "size": 1,
  "sort": [ {"geo_lower_decimal": "desc"} ],
  "query": {
    "range": {
      "geo_lower_decimal": {
        "lte": "%{[destinationAddress_dec]}"
      }
    }
  }
}

Those queries are significantly different.