I have a pipeline for java logs configured with grok filter as follows
grok {
id => 'grok'
match => {"message" => ["(?m)\[%{DATA:time}\]\[%{DATA:traceid}\] \[%{DATA:level}\] %{JAVACLASS:class} - (((?<message>.*?)( :: )(?<json>\{(.|\r|\n)*\})(?<stacktrace>(.|\r|\n)*)?)|((?<message>[^\n]*))(?<stacktrace>(.|\r|\n)*)?)",
"(?m)\[%{DATA:time}\]\[%{DATA:traceid}\] (\[%{DATA:service}\] )?\[%{DATA:level}\][\s]*(?<class>[\S]+)(:\d+)? - (?<message>[^\n]+)(\n(?<stacktrace>.*))?"]}
overwrite => [ "message" ]
}
The regex was provided to me but my own manual testing indicated it worked with the log files.
The logs are shipped by filebeat and this worked initially but I had forgotten to turn on multi-line so the configuration for filebeat was modified to the following
- type: log
enabled: true
paths:
- /var/log/java/java.log
multiline.pattern: '^\['
multiline.negate: true
multiline.match: after
fields:
log_type: java
fields_under_root: true
This configuration works for awhile but will eventually cause logstash to hang and depending on the number of workers allocated to the pipeline it will cause that number of cores to be at 100%. So currently it has 2 workers so once it starts to fail i see two cores at 100%. This causes events to stop flowing on another pipeline but I suspect that is because I'm using the distributor pattern
During the time that it is hung the stats for the pipeline show that the difference between events in and out on the grok plugin is pipeline.batch.size x pipeline.workers e.g. 250 for the current configuration. The count will remain the same until logstash is restarted
{
"id": "grok",
"patterns_per_field": {
"message": 2
},
"events": {
"in": 6875,
"out": 6625,
"duration_in_millis": 2434
},
"failures": 3652,
"name": "grok",
"matches": 3094
},
Once logstash is restarted the logs will flow once again but it will eventually go back to the situation described above.
Each time it is restarted the number of events that flow through will differ which suggests an issue with specific messages.
I've tried increasing the log level to trace but it didn't help to identify the issue
I have also separated the java pipeline out from the distributor setup so that it was listening on an independent port with the beats protocol with the same result.
I've switched from the queue being in memory to persisted
My guess is that some of the messages are malformed in a way that causes the grok to freak out but I don't know how to identify which messages are failing so that I can fix the match