Changed grok pattern for ingest pipeline and now grok pattern matching is timing out after 1000ms

thethomp · November 13, 2019, 8:43pm

Hello,

I'm using ingest pipelines to do some grokking and formatting before inserting documents into Elasticsearch, I've been using this grok pattern for months without any issue. Today the format of my logs changed a little bit, so I updated the pattern (changed one field from WORD to DATA), tested with the _ingest/pipeline//_simulate command, and all works fine.

I'm using Logstash containers to pull my data from a Kafka topic and write it to ES. When I start my Logstash container back up, I start to get these exceptions.

[2019-11-13T20:31:53,097][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 500 ({"type"=>"exception", "reason"=>"java.lang.IllegalArgumentException: java.lang.RuntimeException: grok pattern matching was interrupted after [1000] ms",         "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"java.lang.RuntimeException: grok pattern matching was interrupted after [1000] ms", "caused_by"=>{"type"=>"runtime_exception", "reason"=>"grok pattern matching was interrupted after [1000] ms"}}, "header"=>{"processor_type"=>"grok"}})

On the Elasticsearch side, I see an exception saying the same.

[2019-11-13T12:28:23,839][DEBUG][o.e.a.b.TransportBulkAction] [ranch-vm-node-001] failed to execute pipeline [pan_traffic_syslogs_pipeline] for document [pan_traffic_syslogs/_doc/null]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.RuntimeException: grok pattern matching was interrupted after [1000] ms
    at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:194) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:133) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:100) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.ingest.IngestService.innerExecute(IngestService.java:481) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.ingest.IngestService.access$100(IngestService.java:70) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.ingest.IngestService$4.doRun(IngestService.java:410) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.1.jar:7.1.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.IllegalArgumentException: java.lang.RuntimeException: grok pattern matching was interrupted after [1000] ms
    ... 11 more
Caused by: java.lang.RuntimeException: grok pattern matching was interrupted after [1000] ms
    at org.elasticsearch.grok.Grok.captures(Grok.java:254) ~[?:?]
    at org.elasticsearch.ingest.common.GrokProcessor.execute(GrokProcessor.java:66) ~[?:?]
    at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:123) ~[elasticsearch-7.1.1.jar:7.1.1]
    ... 9 more

I've never seen this before, and when I test using _simulate, the data is grokked instantly, so I'm a bit confused. Any ideas?

Thank you very much in advance.

thethomp · November 13, 2019, 11:17pm

Solved my problem. The issue was the logs I was testing weren't actually the ones causing the problems. I was assuming all the logs were causing the issues, when in fact it was only a small subset, and I was able to see these by adding this to my pipeline config:

"on_failure": [
{
  "set": {
    "field": "_index",
    "value": "failed-{{ _index }}"
  }
}
]

Per the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/master/handling-failure-in-pipelines.html

Once I saw the logs causing the issue, the root problem was that the grok pattern was breaking on those logs, but the way that I had changed the grok pattern caused the processor to get confused and not actually fail until after 1000ms, so it never hit the "grok pattern doesn't match" exception, and therefore never showed me the logs that were problematic.

Anyway, lesson learned, I'm adding this on_failure block to all my pipelines so I can clearly see when they're running into issues. Hope this helps someone some day!

Christian_Dahlqvist · November 14, 2019, 6:46am

It sounds like you may have multiple occurrances of DATA and/or GREEDYDATA in your patterns, which can be quite inefficient. I would recommend reading this blog post if you have not already.

thethomp · November 14, 2019, 7:31pm

Thanks, I did end up reading that blog during my investigation yesterday.

Alas, even though I am using DATA in several places in my pattern, it's still plenty fast. The issue was with some of the logs being parsed in correctly, and the fields being interpreted as something they weren't. It appeared that using WORD when non-WORD compliant characters were in the field, made the grok pattern matcher take longer, rather than just failing immediately. When I changed the remaining WORD fields to DATA, everything got better. I think I may avoid using the WORD classification, personally, it's caused me a lot of grief.

system · December 12, 2019, 7:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Grok processor in pipeline throws json_parse_exception Elasticsearch	3	879	April 24, 2018
Elasticsearch grok processor error Elasticsearch	5	1149	March 28, 2018
Kibana Ingest Pipeline Parsing exception Logstash	6	563	July 2, 2018
Logstash error while processing ingest node pipeline Elasticsearch	1	780	November 24, 2017
Grok pattern query for access logs Elasticsearch	10	1048	December 20, 2019

Changed grok pattern for ingest pipeline and now grok pattern matching is timing out after 1000ms

Related topics