Matching documents with empty symbol in the end using regexp

111343 · October 18, 2021, 7:54pm

Hi I have an Elasticsearch document that can have various forms of the string "creative=".
Logs that have good format look like "creative=xxx", and logs that have bad format look like "creative=".
Example of bad log:

https://s2s.adjust.com/bigclient?s2s=1&idfa=123456789&rt=1&campaign=crto_nb_ios&criteo_click_id=1&os_name=ios&cost_amount=x&cost_type=CPC&cost_currency=TRY&creative=

Example of good log:

https://s2s.adjust.com/bigclient?s2s=1&idfa=123456789&rt=1&campaign=crto_nb_ios&criteo_click_id=1&os_name=ios&cost_amount=x&cost_type=CPC&cost_currency=TRY&creative=xyz

So I'd like to find all documents with bad format, i.e creative= structure.
I tried /.*creative=$/ and /.*creative=/ but not much of the results, I get empty response. According to the Elastic docs Lucene (Regexp support for matching at the beginning (^) and the end ($)) patterns are always anchored and there is no need for anchoring with "^" and "$".

Any ideas how to search the documents ending with "creative="? (Basically trying to parse ad-server logs and find the records that don't have creative macro filled properly)

Mark_Harwood · October 20, 2021, 1:12pm

You're probably searching the tokenised strings, not the full value.
Unless you've defined something special in the index mapping, the default behaviour is strings like your URL will have been tokenized into words and then indexed - minus the "=" sign.
The other thing that happens by default is that whole strings <=256 characters are also indexed into an un-tokenized field with the same name but the .keyword suffix added.

So the solution is probably to search the url.keyword field if your original field was url.

111343 · October 21, 2021, 9:37am

thank you. So "=" sign is removed during the tokenization? (similarly how the whitespace would be removed)

Just to confirm - is tokenizer rule defined in the index map? (e.g if i wanted to remove "=" from the tokenizer)

In any case, even if the tokenizer removes "=" I'd expect the word "creative" to be at the end of the text. Is there not a way to tell kibana to match the last word? (there is a long thread about it here Match text ending in a word - #8 not sure if anything has changed since )

url.keyword didn't work.

P.S here is the content of my analyzer. I don't see "=" as a tokenizer.

      "analysis": {
        "normalizer": {
          "lowercase": {
            "filter": "lowercase",
            "type": "custom"
          }
        },
        "analyzer": {
          "whitespace_lowercase": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          },
          "keyword_lowercase": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "keyword"
          }
        }
      }

Mark_Harwood · October 21, 2021, 12:11pm

I'd recommend checking out the analyze api to see what your mappings do with your strings before indexing.

This will help you solve current and future search+mapping problems

system · November 18, 2021, 12:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regexp support for matching at the beginning (^) and the end ($) Elasticsearch	4	6360	July 5, 2017
Regexp matches when it shouldn't, doesn't match when it should Elasticsearch	4	885	August 21, 2017
Help needed with Regular Expression search and special characters Elasticsearch	1	1326	July 5, 2017
Help: Elasticsearch Regexp query Elasticsearch	7	1328	December 3, 2020
Help with regexp query syntax Elasticsearch	5	1931	September 26, 2017

Matching documents with empty symbol in the end using regexp

Related topics