Failing to match "53...\n" with "[0-9][0-9]\\.\\.\\.\n?" in Discover filter. Help?

timbav · January 15, 2024, 6:38pm

I'm on v 7.10.0 of Kibana.
My logs are line-based Json, and the JSON looks like this:

    {"dlog":{ ..., "line":"53...\n", ...}, ...}

Those dots are three . characters, not a unicode ellipsis:

0000000   ,   "   l   i   n   e   "   :   "   5   7   .   .   .   \   n
0000020   "   ,   "   i   n   d   e   x   "   :   7   3   }

I'm trying to exclude records looking like that (and other patterns), to leave more interesting records showing. In Discovery, I added a filter and gave it a Query DSL filter like this:

    { "query": { "regexp": { "dlog.line": { "value": "[0-9][0-9]\\.\\.\\.\n?" } } } }

but it matches nothing: it includes or excludes everything. I can't just have a single backslash in front of each dot, or Kibana won't let me type enter on the query. (I removed the case-sensitive flag because there are no letters here.)

I've tried replacing the \\. with [.], in case the double escape is wrong somehow, but that makes no difference. I've tried \\n in case that needs to be literal. No dice.

If I remove all the dot-matching stuff then my target lines match... along with everything else that has two digits in it, even if those digits aren't at the start of the line - which seems to contradict the dictum I read everywhere about Lucene regexes needing to match the entire line.

What completely freaks me out is that if I just put "[0-9][0-9]..." then these lines do NOT match, despite having two digits followed by 3 of any character:

    27	List of devices attached
11...
localhost [127.0.0.1] 5542 (?) : Connection refused

but these do:

== STATUS: 2024-01-15.17:49:01 watchdog-quitting-adb-tail-10s 5542
                 ^ 15.17 lighlit
INFO    | boot time 49978 ms
                  ^ 49978 highlit
17:22:08 up 373 days, 23:45,  0 users,  load average: 27.76, 88.87, 73.64
                                                    ^ 27.76 and 88.87 and 73.64 all separately highlit

Making thing even more confusing is the yellow highlights that light up things at random: 23:45 is two digits followed by three others, but it's not highlit.

I do observe that if I put "[0-9][0-9].." with only two dots, suddenly the strings that match are those with two digits and two others. Is it that these regexes are matched against every 'word' in the string, for some definition of 'word' that sometimes matches colon, but that someone's forgotten to mention that in the regex documentation, or provide a useful example demonstrating as much?

So... questionable documentation aside, how can I match the equivalent of PCRE /^\d\d\.\.\.\n?$/?

dadoonet · January 15, 2024, 9:21pm

What is the type of this dlog.line field?

timbav · January 16, 2024, 10:17am

It's a string:

t dlog.line 11...

or

t dlog.line localhost [127.0.0.1] 5539 (?) : Connection refused

dadoonet · January 16, 2024, 11:32am

I meant the data type in the Elasticsearch mapping.

What is the output of (replace INDEXNAMEHERE with the actual index name):

GET /INDEXNAMEHERE/_mapping/field/dlog.line

See Get field mapping API | Elasticsearch Guide [8.11] | Elastic

timbav · January 19, 2024, 3:09pm

Ah - I didn't know about http://-some-host-/app/dev_tools#/console - I tried to use it as a curl api at first!

GET /mobile_core_emubroker-2021.09.30/_mapping/field/dlog.line
{
  "mobile_core_emubroker-2024.01.19" : {
    "mappings" : {
      "dlog.line" : {
        "full_name" : "dlog.line",
        "mapping" : {
          "line" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

dadoonet · January 19, 2024, 8:08pm

Try on dlog.line.keyword then.

timbav · January 25, 2024, 9:42am

Huh - thank you - that worked...

{
  "query": {
    "regexp": {
      "dlog.line.keyword": {
        "value": "[0-9][0-9]*\\.\\.\\.\n"
      }
    }
  }
}

successfully matched the lines precisely starting with some digits followed by literal dots.

So .keyword actually means 'the actual data' and the field name is... a set of words?

It's a bit surprising - might be worth adding something to the documentation about this.

How would I go about making that suggestion?

Thanks again.

dadoonet · January 25, 2024, 12:02pm

It's all about mapping.

If your field type is text, then it's analyzed at index time. Read this about analysis.

If your field type is keyword, then it's indexed as is. No analysis process.

See Field data types | Elasticsearch Guide [8.12] | Elastic

For example, you can index strings to both text and keyword fields. However, text field values are analyzed for full-text search while keyword strings are left as-is for filtering and sorting.

That's a super important notion you need to understand when using a search engine.

system · February 22, 2024, 12:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with regular expressions Kibana	6	403	July 2, 2020
Using Regex in Kibana Query DSL Elasticsearch kql-kibana-query-language	5	1633	June 2, 2022
KIbana Query: regexp of 10 numbers Kibana	5	5825	March 4, 2020
RegEx Query in Discover Kibana	3	5540	August 6, 2019
Lucene Regex issues Kibana	3	386	September 4, 2023

Failing to match "53...\n" with "[0-9][0-9]\\.\\.\\.\n?" in Discover filter. Help?

Related topics