Lot of queuing files on logstash

Hi,

We are running with ELK cluster 5.6.16.

Our system as below
Filebeat --> Logstash --> Elasticsearch (3 server in cluster) --> KIbana

We are observing very high queuing on out logstash servers since quite some time. It keep increasing during peak hours & get settle during lean hours. Some time, queue files counts reach to 32 files.

We are using hybrid SAN volume for storing data ( Didn't see much awaits on that) & configured heap memory 31GB & 12GB for elasticsearch & logstash respectively.

Is there any other setting need to be configure on our stack which will help to boost logstash + elastichsearch performance ?

Let us know if any other info required.
Regards
Nishant

What filters are you using? It is possible that some are consuming more resources than they need to (e.g., some grok patterns are susceptible to backtracking on certain pathological inputs), but in general, it looks like your Logstash may be underprovisioned. I would check the node stats API (http://localhost:9600/_node/stats/pipelines?pretty note: API is bound to localhost only by default and won't work via public interfaces) and look for plugins with high duration_in_millis, indicating that a lot of time is spent with that plugin blocking.


I presume you're using the persistent queue? If so, a word of caution: the PQ uses memory-mapped files to hold the queue, and behaviour on non-local volumes is neither defined nor tested (e.g., it may be slow, or it may not work as expected).

HI yaauie,

Yes, We are using persistent queue & it keep increasing throughout the days. Due to this, We didn't get logs on Kibana on real-time. Some times lags could be 30 mins, 1 hours, 2 hours & so on.

{
	"host" : "1-.t.com",
	"version" : "5.6.16",
	"http_address" : "127.0.0.1:9600",
	"id" : "3a60c1c6-b58a-46cd-a61f-3945774b6c97",
	"name" : "1-.t.com",
	"pipeline" : {
		"events" : {
			"duration_in_millis" : 9523765260,
			"in" : 3105093640,
			"filtered" : 3106875704,
			"out" : 3106869704,
			"queue_push_duration_in_millis" : 385434330
		},
		"plugins" : {
			"inputs" : [
				{
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-1",
					"events" : {
						"out" : 3105093640,
						"queue_push_duration_in_millis" : 385434330
					},
					"current_connections" : 30,
					"name" : "beats",
					"peak_connections" : 44
				}
			],
			"filters" : [
				{
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-7",
					"events" : {
						"duration_in_millis" : 157696843,
						"in" : 1260179547,
						"out" : 1260179547
					},
					"matches" : 1260179547,
					"failures" : 0,
					"patterns_per_field" : {
						"message" : 1
					},
					"name" : "grok"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-17",
					"events" : {
						"duration_in_millis" : 153603440,
						"in" : 529352300,
						"out" : 529352300
					},
					"matches" : 529347697,
					"failures" : 4603,
					"patterns_per_field" : {
						"message" : 1
					},
					"name" : "grok"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-11",
					"events" : {
						"duration_in_millis" : 199681865,
						"in" : 1260179546,
						"out" : 1260179546
					},
					"name" : "prune"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-41",
					"events" : {
						"duration_in_millis" : 528051431,
						"in" : 2170767,
						"out" : 2170767
					},
					"name" : "geoip"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-37",
					"events" : {
						"duration_in_millis" : 125614046,
						"in" : 569540943,
						"out" : 569540942
					},
					"name" : "kv"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-16",
					"events" : {
						"duration_in_millis" : 12582232,
						"in" : 62976421,
						"out" : 62976421
					},
					"name" : "prune"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-5",
					"events" : {
						"duration_in_millis" : 30860886,
						"in" : 634806175,
						"out" : 634806175
					},
					"name" : "mutate"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-47",
					"events" : {
						"duration_in_millis" : 153855446,
						"in" : 35960305,
						"out" : 35960305
					},
					"name" : "geoip"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-4",
					"events" : {
						"duration_in_millis" : 140536649,
						"in" : 634806176,
						"out" : 634806175
					},
					"name" : "kv"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-6",
					"events" : {
						"duration_in_millis" : 144866215,
						"in" : 634806175,
						"out" : 634806175
					},
					"name" : "prune"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-33",
					"events" : {
						"duration_in_millis" : 137444946,
						"in" : 569540946,
						"out" : 569540946
					},
					"matches" : 569540870,
					"failures" : 76,
					"patterns_per_field" : {
						"message" : 1
					},
					"name" : "grok"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-38",
					"events" : {
						"duration_in_millis" : 126561866,
						"in" : 569540944,
						"out" : 569540942
					},
					"name" : "prune"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-9",
					"events" : {
						"duration_in_millis" : 172135833,
						"in" : 1260179547,
						"out" : 1260179547
					},
					"name" : "kv"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-2",
					"events" : {
						"duration_in_millis" : 167517968,
						"in" : 634806178,
						"out" : 634806178
					},
					"matches" : 634372418,
					"failures" : 433760,
					"patterns_per_field" : {
						"message" : 1
					},
					"name" : "grok"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-12",
					"events" : {
						"duration_in_millis" : 14023274,
						"in" : 62976423,
						"out" : 62976423
					},
					"matches" : 62976422,
					"failures" : 1,
					"patterns_per_field" : {
						"message" : 1
					},
					"name" : "grok"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-29",
					"events" : {
						"duration_in_millis" : 687720912,
						"in" : 11890809,
						"out" : 11890809
					},
					"name" : "geoip"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-21",
					"events" : {
						"duration_in_millis" : 106403046,
						"in" : 529352300,
						"out" : 529352300
					},
					"name" : "prune"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-35",
					"events" : {
						"duration_in_millis" : 3846197030,
						"in" : 569540947,
						"out" : 569540947
					},
					"name" : "geoip"
				}, {
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-19",
					"events" : {
						"duration_in_millis" : 101377489,
						"in" : 529352300,
						"out" : 529352300
					},
					"name" : "kv"
				}
			],
			"outputs" : [
				{
					"id" : "889fe0d951b4ffe768cadccd6ef5912482dbf67a-56",
					"events" : {
					"duration_in_millis" : 1543045781,
					"in" : 3106875704,
					"out" : 3106869704
					},
					"name" : "elasticsearch"
				}
			]
		},
		"reloads" : {
			"last_error" : null,
			"successes" : 0,
			"last_success_timestamp" : null,
			"last_failure_timestamp" : null,
			"failures" : 0
		},
		"queue" : {
			"events" : 49659,
			"type" : "persisted",
			"capacity" : {
				"queue_size_in_bytes" : 3215836206,
				"page_capacity_in_bytes" : 1073741824,
				"max_queue_size_in_bytes" : 42949672960,
				"max_unread_events" : 0
			},
			"data" : {
				"path" : "/var/lib/logstash/queue/main",
				"free_space_in_bytes" : 779875213312,
				"storage_type" : "ext4"
			}
		},
		"id" : "main"
	}
}

[EDIT: fixed a syntax error in the above JSON and enclosed in code fences -- @yaauie]

Our logstash file with filters. ( I have removed few filter to limitation of this post. Same type of filter for more 4 to 5 types)

input {
    beats {
      port => 5044
      ssl => true
      ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder.crt"
      ssl_key => "/etc/pki/tls/private/logstash-forwarder.key"
    }

}

filter {
        if [type] == "test1" {
                grok {
                        patterns_dir => "/etc/logstash/patterns"
                        match => {
                                "message" => "%{NGINXACCESS}"
                        }
                }
                date {
   		        match => [ "timestamp_match", "dd/MMM/YYYY:HH:mm:ss Z", "dd/MMM/YYYY:HH:mm:ss Z"]	
#			match => [ "log_timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
			target => "@timestamp"
                }
		kv {
			source => "request"
                        field_split => "&?"
                }


               mutate {
                      convert => [ "[response]", "integer"]
                      convert => [ "[bytes_read]", "integer"]
                      convert => [ "[request_duration]", "float"]
                      convert => [ "[upstream_request_duration", "float"]
                }


    		prune {
                        whitelist_names => 	["request","httpversion","timestamp","message","Country","beat.name","bytes_read","clientip","host","http_pkaxff","http_host","http_real_ip","request_duration","request_header_agent","request_header_referer","response","source","upstream_request_duration","x_forwarded_for","auth","user","method","xforwardedfor","geoip"]
		}
		

        }
}

filter {
        if [type] == "test2" {
                grok {
                        patterns_dir => "/etc/logstash/patterns"
                        match => {
                                "message" => "%{WEBSERVICES}"
                        }
                }
                date {
                        match => [ "log_timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
                }
                kv {
                        source => "request"
                        field_split => "&?"
                }

               mutate {
                      convert => [ "[response]", "integer"]
                      convert => [ "[bytes_read]", "integer"]
                      convert => [ "[request_duration]", "float"]
                      convert => [ "[upstream_request_duration", "float"]
                }

		prune {
                        whitelist_names =>      ["request","httpversion","timestamp","message","Country","beat.name","bytes_read","clientip","host","http_pkaxff","http_host","http_real_ip","request_duration","request_header_agent","request_header_referer","response","source","upstream_request_duration","x_forwarded_for","auth","user","method"]
                }

		
        }
}

filter {
        if [type] == "test8" {
                grok {
                        patterns_dir => "/etc/logstash/patterns"
                        match => {
                                "message" => "%{NGINXACCESS}"
                        }
                }
                date {
                        match => [ "log_timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
                }

                geoip {
                      source => "http_real_ip"
                      target => "geoip"
                      database => "/etc/logstash/GeoLite2-City.mmdb"
                      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
                      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
                }
                mutate {
                      convert => [ "[geoip][coordinates]", "float"]
                      convert => [ "[response]", "integer"]
                      convert => [ "[bytes_read]", "integer"]
                      convert => [ "[request_duration]", "float"]
                      convert => [ "[upstream_request_duration", "float"]

                }

                kv {
                        source => "request"
                        field_split => "&?"
                }
                prune {
                        whitelist_names =>      ["request","httpversion","timestamp","message","Country","beat.name","bytes_read","clientip","host","http_pkaxff","http_host","http_real_ip","request_duration","request_header_agent","request_header_referer","response","source","upstream_request_duration","x_forwarded_for","auth","user","method","xforwardedfor","geoip"]
                }


        }
}

filter {
        if [type] == "error" {
                grok {
                        patterns_dir => "/etc/logstash/patterns"
                        match => {
                                "message" => "%{NGINXERROR}"
                        }
                }
		date {
			match => [ "timestamp" , "YYYY/MM/dd HH:mm:ss" ]
			remove_field => [ "timestamp" ]
		}
                kv {
                        source => "request"
                        field_split => "&?"
                }
        }
}


filter {
        if [type] == "php-fpmerror" {
                grok {
		        match => [ "message", "\[%{MONTHDAY:day}-%{MONTH:month}-%{YEAR:year} %{TIME:time} %{WORD:zone}/%{WORD:country}\] PHP %{DATA:level}\:  %{GREEDYDATA:error}" ]
		        add_field    => { "timestamp" => "%{day}-%{month}-%{year} %{time} %{zone}/%{country}" }
		        add_tag      => [ "%{level}" ]
		        add_field    => { "index" => "%{[@metadata][beat]}-%{+YYYY.MM.dd}" }
		        remove_field => [ "day", "month", "year", "time", "zone", "country" ]
                }
                kv {
                        source => "request"
                        field_split => "&?"
                }
        }
}


output {

    elasticsearch { 
        hosts => ["192.168.2.125:9200", "192.168.2.126:9200", "192.168.2.127:9200"]
        sniffing => true
        manage_template => true
	index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
        document_type => "%{[@metadata][type]}"
	#stdout { debug => true }

    }
    #stdout { codec => rubydebug }	

}

When Logstash cannot reliably keep up with inbound traffic, it is a throughput issue, indicating that the Logstash host is underprovisioned for the current workload.

There are two paths forward:

  • provision Logstash onto hardware with more resources (e.g., more CPUs on one host or multiple hosts)
  • tune your pipeline to reduce the cost of the workload

I've taken all of the filters in your pipeline from the node_stats you sent and have determined each of their raw throughputs, showing that your four geoip plugins are hogging a lot of the CPU time:

+------------------------+----------------------------------------------+------------+-------------+----------+
|         FILTER         |                  PLUGIN ID                   |   EVENTS   |    RATE     | CPU TIME |
+------------------------+----------------------------------------------+------------+-------------+----------+
| logstash-filter-geoip  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-41: |    2170767 | @    4.11/s |  528051s |
| logstash-filter-geoip  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-29: |   11890809 | @   17.29/s |  687720s |
| logstash-filter-geoip  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-35: |  569540947 | @  148.07/s | 3846197s |
| logstash-filter-geoip  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-47: |   35960305 | @  233.72/s |  153855s |
| logstash-filter-grok   | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-17: |  529352300 | @ 3446.22/s |  153603s |
| logstash-filter-grok   | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-2:  |  634806178 | @ 3789.48/s |  167517s |
| logstash-filter-grok   | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-33: |  569540946 | @ 4143.77/s |  137444s |
| logstash-filter-prune  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-6:  |  634806175 | @ 4382.01/s |  144866s |
| logstash-filter-grok   | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-12: |   62976423 | @ 4490.85/s |   14023s |
| logstash-filter-prune  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-38: |  569540944 | @ 4500.09/s |  126561s |
| logstash-filter-kv     | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-4:  |  634806176 | @ 4517.01/s |  140536s |
| logstash-filter-kv     | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-37: |  569540943 | @ 4534.05/s |  125614s |
| logstash-filter-prune  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-21: |  529352300 | @ 4974.97/s |  106403s |
| logstash-filter-prune  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-16: |   62976421 | @ 5005.18/s |   12582s |
| logstash-filter-kv     | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-19: |  529352300 | @ 5221.59/s |  101377s |
| logstash-filter-prune  | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-11: | 1260179546 | @ 6310.93/s |  199681s |
| logstash-filter-kv     | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-9:  | 1260179547 | @ 7320.84/s |  172135s |
| logstash-filter-grok   | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-7:  | 1260179547 | @ 7991.15/s |  157696s |
| logstash-filter-mutate | 889fe0d951b4ffe768cadccd6ef5912482dbf67a-5:  |  634806175 | @20569.92/s |   30860s |
+------------------------+----------------------------------------------+------------+-------------+----------+

You would likely stand to benefit greatly by providing a cache_size directive to amortise the cost of the expensive geoip lookups.

There are a couple of caveats, called out in the linked docs but repeated here so you don't miss them:

  • the default value for this (if left unspecified) is 1000, so specifying a larger value will increase the likelihood of a cache hit.
  • due to how the underlying library is written, this value is global within the Logstash process with the last declaration "winning" and overwriting any previous declarations, so you must specify the same value on all invocations of the geoip filter; failure to specify on one may result in the default value from one filter overriding an explicit value from a different filter.

Thanks. Our setup working smoothly after removing geoip overhead from logstast. Now, We are getting logs in real-time.