Identifying events that cause logstash to hang

I have a pipeline for java logs configured with grok filter as follows

   grok  {
	id => 'grok'
   	match => {"message" => ["(?m)\[%{DATA:time}\]\[%{DATA:traceid}\] \[%{DATA:level}\] %{JAVACLASS:class} - (((?<message>.*?)( :: )(?<json>\{(.|\r|\n)*\})(?<stacktrace>(.|\r|\n)*)?)|((?<message>[^\n]*))(?<stacktrace>(.|\r|\n)*)?)",
                                "(?m)\[%{DATA:time}\]\[%{DATA:traceid}\] (\[%{DATA:service}\] )?\[%{DATA:level}\][\s]*(?<class>[\S]+)(:\d+)? - (?<message>[^\n]+)(\n(?<stacktrace>.*))?"]}
        overwrite => [ "message" ]
   }

The regex was provided to me but my own manual testing indicated it worked with the log files.

The logs are shipped by filebeat and this worked initially but I had forgotten to turn on multi-line so the configuration for filebeat was modified to the following

- type: log
  enabled: true
  paths:
    - /var/log/java/java.log
  multiline.pattern: '^\['
  multiline.negate: true
  multiline.match: after
  fields:
    log_type: java
  fields_under_root: true

This configuration works for awhile but will eventually cause logstash to hang and depending on the number of workers allocated to the pipeline it will cause that number of cores to be at 100%. So currently it has 2 workers so once it starts to fail i see two cores at 100%. This causes events to stop flowing on another pipeline but I suspect that is because I'm using the distributor pattern

During the time that it is hung the stats for the pipeline show that the difference between events in and out on the grok plugin is pipeline.batch.size x pipeline.workers e.g. 250 for the current configuration. The count will remain the same until logstash is restarted

  {
    "id": "grok",
    "patterns_per_field": {
      "message": 2
    },
    "events": {
      "in": 6875,
      "out": 6625,
      "duration_in_millis": 2434
    },
    "failures": 3652,
    "name": "grok",
    "matches": 3094
  },

Once logstash is restarted the logs will flow once again but it will eventually go back to the situation described above.

Each time it is restarted the number of events that flow through will differ which suggests an issue with specific messages.

I've tried increasing the log level to trace but it didn't help to identify the issue
I have also separated the java pipeline out from the distributor setup so that it was listening on an independent port with the beats protocol with the same result.
I've switched from the queue being in memory to persisted

My guess is that some of the messages are malformed in a way that causes the grok to freak out but I don't know how to identify which messages are failing so that I can fix the match

When using grok you should always try to use as specific patterns as possible for performance. This is discussed in this blog post. You should avoid having multiple DATA or GREEDYDATA patterns in a single expression, so even though your pattern may work it is not very efficient. I would recommend you read the post I linked to and then rewrite the pattern.

If you have a single format using different types of separators, you may also want to look into using the dissect filter instead.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.