Grok parser incorrectly matches JAVACLASS

Hello,

I'm upgrading ELK from version 7.3.1 to 8.13.4. I have a problem with the Logstash pipeline as it does not work as expected, or I am missing something completely.

I have Java logs, and I'm sending them to Logstash via Filebeat. The logs are multiline, and I have a multiline pattern on Filebeat, so they are sent as one event. Furthermore, I have a Logstash pipeline where I perform some extractions and mutations on these logs using Grok.

My problem is that the JAVACLASS predefined pattern matches random words, not only Java classes. An example from the debugger is in the code snippet below:

Sample Data:

Error occurred
java.lang.Exception
	at test.TestController.getServerError(TestController.java:28)
	at jdk.internal.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
...

Grok Pattern:

^%{JAVACLASS:exception}

Structured Data:

{
  "exception": "Error"
}

I expect that java.lang.Exception will be matched, but this is not the case since the first word is always matched. Other patterns, such as %{IP:ip}, work as expected and match the IP wherever it appears in the input text.

I did a lot of debugging and found that this problem started happening from Logstash version 7.12.0.

Please, can you help me with this?

Random words (such as "Error") are valid Java class names. The syntax does not require a package name.

7.12.0 introduced the ecs-v1 patterns, but AFAIK they were off by default, and in any case, both legacy and ecs-v1 patterns defined JAVACLASS as

(?:[a-zA-Z$_][a-zA-Z$_0-9]*\.)*[a-zA-Z$_][a-zA-Z$_0-9]*

which would definitely match "Error".

Going further back (8 years ago) there was a duplicate definitionthat did require a package name, but I don't see how you could have been referencing that in 7.3.1.

You could define your own pattern, changing

(?:[a-zA-Z$_][a-zA-Z$_0-9]*\.)*[a-zA-Z$_][a-zA-Z$_0-9]*

to

(?:[a-zA-Z$_][a-zA-Z$_0-9]*\.)+[a-zA-Z$_][a-zA-Z$_0-9]*

Yes, you are right, the pattern is exactly the same. But how is it possible that in Logstash 7.11.2 (the version prior to 7.12.0), I get a different output? You can see it in the code below.
Pipeline config:

        input { stdin { } }
        output {
          stdout { codec => rubydebug }
        }
        filter {
          grok {
            match => { "message" => "%{JAVACLASS:exception}"}
          }
        }

Input:

test

Output

{
          "tags" => [
        [0] "_grokparsefailure"
    ],
      "@version" => "1",
    "@timestamp" => 2024-06-06T14:05:06.635Z,
       "message" => "test",
          "host" => "bbcf307fbbf4"
}

But in Logstash 7.12.0 for the same Input I get the next output:

{
          "host" => "5a4962317c19",
     "exception" => "test",
       "message" => "test",
      "@version" => "1",
    "@timestamp" => 2024-06-06T14:11:49.749Z
}

OK, so I downloaded 7.11.2 and 7.12.0 on Windows. I confirmed that 7.11.2 does not match "test\r" to JAVACLASS but 7.12.0 does.

I then modified the configuration to be

      grok {
        pattern_definitions => { "MYCLASS" => "(?:[a-zA-Z$_][a-zA-Z$_0-9]*\.)*[a-zA-Z$_][a-zA-Z$_0-9]*" }
        match => { "message" => "%{MYCLASS:exception}"}
      }

and found that 7.11.2 does match that. That means it is not using the JAVACLASS pattern that we expect. I started logstash with log.level debug and found that it is loading that duplicate pattern that I thought had been removed much earlier. I guess the main logstash release didn't refresh logstash-patterns-core as soon as I expected.

[2024-06-06T11:33:46,735]... Adding pattern {"COMMONAPACHELOG"=>"%{HTTPD_COMMONLOG}"}
[2024-06-06T11:33:46,736]... Adding pattern {"COMBINEDAPACHELOG"=>"%{HTTPD_COMBINEDLOG}"}
[2024-06-06T11:33:46,736]... Adding pattern {"JAVACLASS"=>"(?:[a-zA-Z$_][a-zA-Z$_0-9]*\\.)*[a-zA-Z$_][a-zA-Z$_0-9]*"}
[2024-06-06T11:33:46,736]... Adding pattern {"JAVAFILE"=>"(?:[A-Za-z0-9_. -]+)"}
[2024-06-06T11:33:46,737]... Adding pattern {"JAVAMETHOD"=>"(?:(<(?:cl)?init>)|[a-zA-Z$_][a-zA-Z$_0-9]*)"}
[2024-06-06T11:33:46,737]... Adding pattern {"JAVASTACKTRACEPART"=>"%{SPACE}at %{JAVACLASS:class}\\.%{JAVAMETHOD:method}\\(%{JAVAFILE:file}(?::%{NUMBER:line})?\\)"}
[2024-06-06T11:33:46,737]... Adding pattern {"JAVATHREAD"=>"(?:[A-Z]{2}-Processor[\\d]+)"}
[2024-06-06T11:33:46,738]... Adding pattern {"JAVACLASS"=>"(?:[a-zA-Z0-9-]+\\.)+[A-Za-z0-9$]+"}
[2024-06-06T11:33:46,739]... Adding pattern {"JAVAFILE"=>"(?:[A-Za-z0-9_.-]+)"}

Hope that helps.

Thank you for the clarification! I was very curious about this issue and spent three days investigating how it was possible. Now everything makes sense.