Pattern analyzer regex help

Got a question regarding the pattern analyzer. Example text:

S6UlZgYCJaSIQcy03OOA==Ieuwc7Ix/CQfwoDSOVJl== 2oZjflRSRkcj4/OHcp78==

It's encrypted (each letter is hashed to 20 characters ending with == sign). I would like to create tokenizer for each letter, tokens should be:

S6UlZgYCJaSIQcy03OOA==
Ieuwc7Ix/CQfwoDSOVJl==
2oZjflRSRkcj4/OHcp78==

If I put "pattern": "==" it will create token without == sign (e.g. S6UlZgYCJaSIQcy03OOA). Is there any way to also include a separator as part of the token, or maybe some another logic like "take 20 characters [ skip whitespace ] and create 1 token, then take another 20 characters [ skip whitespace ] and create 2nd token, etc?

The rules are quite simple

  • 1 letter = 20 characters
  • Every hashed letter ends with == sign
  • Whitespace is also a separator, if the new word starts, it will be separated by whitespace.

It would be cool if I can combine just pattern to be included in the token, and the whitespace analyzer together.

Hi @ansamHox

I hope that help you.

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "==",
          "replacement": "==<>"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ["<>"]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "S6UlZgYCJaSIQcy03OOA==Ieuwc7Ix/CQfwoDSOVJl== 2oZjflRSRkcj4/OHcp78=="
}
2 Likes

yep, that's it, I never thought about replacing :slight_smile:

just had to add whitespace in the pattern list and it works as it should:

"pattern": ["<>", " "]

tnx

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.