Matching documents with empty symbol in the end using regexp

Hi I have an Elasticsearch document that can have various forms of the string "creative=".
Logs that have good format look like "creative=xxx", and logs that have bad format look like "creative=".
Example of bad log:

https://s2s.adjust.com/bigclient?s2s=1&idfa=123456789&rt=1&campaign=crto_nb_ios&criteo_click_id=1&os_name=ios&cost_amount=x&cost_type=CPC&cost_currency=TRY&creative=

Example of good log:

https://s2s.adjust.com/bigclient?s2s=1&idfa=123456789&rt=1&campaign=crto_nb_ios&criteo_click_id=1&os_name=ios&cost_amount=x&cost_type=CPC&cost_currency=TRY&creative=xyz

So I'd like to find all documents with bad format, i.e creative= structure.
I tried /.*creative=$/ and /.*creative=/ but not much of the results, I get empty response. According to the Elastic docs Lucene (Regexp support for matching at the beginning (^) and the end ($)) patterns are always anchored and there is no need for anchoring with "^" and "$".

Any ideas how to search the documents ending with "creative="? (Basically trying to parse ad-server logs and find the records that don't have creative macro filled properly)

You're probably searching the tokenised strings, not the full value.
Unless you've defined something special in the index mapping, the default behaviour is strings like your URL will have been tokenized into words and then indexed - minus the "=" sign.
The other thing that happens by default is that whole strings <=256 characters are also indexed into an un-tokenized field with the same name but the .keyword suffix added.

So the solution is probably to search the url.keyword field if your original field was url.

thank you. So "=" sign is removed during the tokenization? (similarly how the whitespace would be removed)

Just to confirm - is tokenizer rule defined in the index map? (e.g if i wanted to remove "=" from the tokenizer)

In any case, even if the tokenizer removes "=" I'd expect the word "creative" to be at the end of the text. Is there not a way to tell kibana to match the last word? (there is a long thread about it here Match text ending in a word - #8 not sure if anything has changed since )

url.keyword didn't work.

P.S here is the content of my analyzer. I don't see "=" as a tokenizer.

      "analysis": {
        "normalizer": {
          "lowercase": {
            "filter": "lowercase",
            "type": "custom"
          }
        },
        "analyzer": {
          "whitespace_lowercase": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          },
          "keyword_lowercase": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "keyword"
          }
        }
      }

I'd recommend checking out the analyze api to see what your mappings do with your strings before indexing.

This will help you solve current and future search+mapping problems

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.