Speed up processing of logs

Hi I am using logstash 5.1.1 to parse file size of 2-3G (7-8M records) and there are thousands of them from the S3 bucket. My current speed is barely one file/day (260 logs for sec), in m3-xlarge aws instance(4 cpu, 15G memory), with cpus' utilization 97%. I start with -w 30, which means 30 workers. I am sending to one ES server of basic setting. I am using quite a lot of filters, shown below.

Is there any suggestion on to speed up my parsing speed?

Does a larger instance server for logstash helps?

I don't think i can just start another logstash instance coz it will mess up the sincedb file.

Thanks!!

filter {
  grok{
break_on_match => false 
match => { "message" => "%{SYSLOGBASE} %{GREEDYDATA:message}" }
overwrite => [ "message" ]  
 }  
 json {
    source => "message"
}
date{
    match => ["date", "YYYY-MM-dd HH:mm:ss", "ISO8601" ]
    timezone => "UTC"
    target => "date"
    }  
urldecode{
field => "url"
}
mutate {
lowercase => [ "url" ]
gsub => [ "url", "[\\"]", ""]
}
kv {
source => "url"
field_split => "&?"
}  
grok{
    break_on_match => false
    match => { "url" => "%{URI:uri}id/%{GREEDYDATA:image_id}" }
match => { "url" => "%{URI:uri}category/%{GREEDYDATA:category_id}" }
match => { "url" => "%{URI:uri}tag/%{GREEDYDATA:tag_content}" }
match => { "image_id" => "%{NUMBER:image_id}\?%{GREEDYDATA:uri_else}" }
overwrite => "image_id"
match => { "category_id" => "%{NUMBER:category_id}\?%{GREEDYDATA:uri_else}" }
overwrite => "category_id"
match => { "tag_content" => "%{GREEDYDATA:tag_content}\?%{GREEDYDATA:uri_else}" }
   overwrite => "tag_content"
}      

 mutate {
rename => { "[k]" => "search_term" }
gsub => [ "search_term", "[++]", " "]
}


mutate {
rename => { "[details][nb_content_ids]" => "nb_contents_ids" }
rename => { "[details][collection_positions]" => "collection_positions" }
rename => { "[id]" => "image_id" }
split => { "collection_positions" => "," }
}

geoip {
source => "ip"
}     

prune {
whitelist_names => [ "@timestamp", "geoip", "url", "search_query", "app_id", "member_id", "ip", "is_buyer","action",  "date",  "nb_contents_ids", "content_ids", "collection_positions", "image_id", "tag_content", "search_term", "category_id"  ]
}
}

How Logstash is tuned has changed over time as the pipeline has changed, and you do not state which version you are using. I will therefore assume you are using the latest version.

Logstash processing is generally, unless inputs or output performance is the limiting factor, limited by the amount of CPU available. Having 30 worker threads with only 4 CPU cores does seem excessive and could result in a fair bit of context switching. I would recommend starting at 4 and slowly increase until you are able to saturate CPU like you do now. You may also want to increase the internal batch size a bit, e.g. to 500, in order to get larger bulk requests sent to Elasticsearch.

As Logstash in your case is CPU bound, moving to a larger instance with more CPU, should improve throughput.

As your processing logic uses a decent number of filters, you may also be able to improve the efficiency of your filter configuration. The best way to get an understanding of what is taking time is to use the relatively new node stats API, and get a breakdown by filter. I can e.g. see that you have set break_on_match to false, which will cause all grok patterns to always be evaluated, using up precious CPU cycles.

Hi Christian,
Thanks for the advice.

Do i set batch_size in yml file, pipeline batch_size to 1250 (default 125)? And increase pipeline.max_inflight (to 10k, default 1k)?
Does batch size mean the number of rows read from a file at one time?
Anything else i can change in yml file?

I just looked at node stats API, but i still don't know how to optimize my config file...

Also, if i managed to split my file (2-3G) into small pieces, does that help to speed up?

If you look at the pipeline stats you can see which filters and filter types that take up most of the processing time, either in total or calculated per event. This will allow you to determine which parts that can benefit most from optimisation.

The internal batch size is set through the -b command line parameter. If you are limited by CPU it may however not make much difference.

As long as you are limited by CPU you need to either add more CPU or make your processing pipeline more efficient. Splitting into multiple files is unlikely to give any performance boost.

I changed the batch_size in the yml file from 125 to 1250, is that ok?

I now changed to a 32-core machine, hope it helps..

I am using logstash and ES 5.1.1. Seems like GET /_nodes/stats/pipeline doesn't work? It returns error:
{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "request [/_nodes/stats/pipeline] contains unrecognized metric: [pipeline]" } ], "type": "illegal_argument_exception", "reason": "request [/_nodes/stats/pipeline] contains unrecognized metric: [pipeline]" }, "status": 400 }

That is an API that Logstash exposes on port 9600, so it can not be accessed in Elasticsearch through Console.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.