Failing to match "53...\n" with "[0-9][0-9]\\.\\.\\.\n?" in Discover filter. Help?

I'm on v 7.10.0 of Kibana.
My logs are line-based Json, and the JSON looks like this:

    {"dlog":{ ..., "line":"53...\n", ...}, ...}

Those dots are three . characters, not a unicode ellipsis:

0000000   ,   "   l   i   n   e   "   :   "   5   7   .   .   .   \   n
0000020   "   ,   "   i   n   d   e   x   "   :   7   3   }

I'm trying to exclude records looking like that (and other patterns), to leave more interesting records showing. In Discovery, I added a filter and gave it a Query DSL filter like this:

    { "query": { "regexp": { "dlog.line": { "value": "[0-9][0-9]\\.\\.\\.\n?" } } } }

but it matches nothing: it includes or excludes everything. I can't just have a single backslash in front of each dot, or Kibana won't let me type enter on the query. (I removed the case-sensitive flag because there are no letters here.)

I've tried replacing the \\. with [.], in case the double escape is wrong somehow, but that makes no difference. I've tried \\n in case that needs to be literal. No dice.

If I remove all the dot-matching stuff then my target lines match... along with everything else that has two digits in it, even if those digits aren't at the start of the line - which seems to contradict the dictum I read everywhere about Lucene regexes needing to match the entire line.

What completely freaks me out is that if I just put "[0-9][0-9]..." then these lines do NOT match, despite having two digits followed by 3 of any character:

    27	List of devices attached
11...
localhost [127.0.0.1] 5542 (?) : Connection refused

but these do:

== STATUS: 2024-01-15.17:49:01 watchdog-quitting-adb-tail-10s 5542
                 ^ 15.17 lighlit
INFO    | boot time 49978 ms
                  ^ 49978 highlit
17:22:08 up 373 days, 23:45,  0 users,  load average: 27.76, 88.87, 73.64
                                                    ^ 27.76 and 88.87 and 73.64 all separately highlit

Making thing even more confusing is the yellow highlights that light up things at random: 23:45 is two digits followed by three others, but it's not highlit.

I do observe that if I put "[0-9][0-9].." with only two dots, suddenly the strings that match are those with two digits and two others. Is it that these regexes are matched against every 'word' in the string, for some definition of 'word' that sometimes matches colon, but that someone's forgotten to mention that in the regex documentation, or provide a useful example demonstrating as much?

So... questionable documentation aside, how can I match the equivalent of PCRE /^\d\d\.\.\.\n?$/?

What is the type of this dlog.line field?

It's a string:

t dlog.line 11...

or

t dlog.line localhost [127.0.0.1] 5539 (?) : Connection refused

I meant the data type in the Elasticsearch mapping.

What is the output of (replace INDEXNAMEHERE with the actual index name):

GET /INDEXNAMEHERE/_mapping/field/dlog.line

See Get field mapping API | Elasticsearch Guide [8.11] | Elastic

Ah - I didn't know about http://-some-host-/app/dev_tools#/console - I tried to use it as a curl api at first!

GET /mobile_core_emubroker-2021.09.30/_mapping/field/dlog.line
{
  "mobile_core_emubroker-2024.01.19" : {
    "mappings" : {
      "dlog.line" : {
        "full_name" : "dlog.line",
        "mapping" : {
          "line" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Try on dlog.line.keyword then.

Huh - thank you - that worked...

{
  "query": {
    "regexp": {
      "dlog.line.keyword": {
        "value": "[0-9][0-9]*\\.\\.\\.\n"
      }
    }
  }
}

successfully matched the lines precisely starting with some digits followed by literal dots.

So .keyword actually means 'the actual data' and the field name is... a set of words?

It's a bit surprising - might be worth adding something to the documentation about this.

How would I go about making that suggestion?

Thanks again.

It's all about mapping.

If your field type is text, then it's analyzed at index time. Read this about analysis.

If your field type is keyword, then it's indexed as is. No analysis process.

See Field data types | Elasticsearch Guide [8.12] | Elastic

For example, you can index strings to both text and keyword fields. However, text field values are analyzed for full-text search while keyword strings are left as-is for filtering and sorting.

That's a super important notion you need to understand when using a search engine.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.