Pattern analyzer regex help

ansamHox · July 26, 2022, 10:38pm

Got a question regarding the pattern analyzer. Example text:

S6UlZgYCJaSIQcy03OOA==Ieuwc7Ix/CQfwoDSOVJl== 2oZjflRSRkcj4/OHcp78==

It's encrypted (each letter is hashed to 20 characters ending with == sign). I would like to create tokenizer for each letter, tokens should be:

S6UlZgYCJaSIQcy03OOA==
Ieuwc7Ix/CQfwoDSOVJl==
2oZjflRSRkcj4/OHcp78==

If I put "pattern": "==" it will create token without == sign (e.g. S6UlZgYCJaSIQcy03OOA). Is there any way to also include a separator as part of the token, or maybe some another logic like "take 20 characters [ skip whitespace ] and create 1 token, then take another 20 characters [ skip whitespace ] and create 2nd token, etc?

The rules are quite simple

1 letter = 20 characters
Every hashed letter ends with == sign
Whitespace is also a separator, if the new word starts, it will be separated by whitespace.

It would be cool if I can combine just pattern to be included in the token, and the whitespace analyzer together.

RabBit_BR · July 27, 2022, 12:37am

Hi @ansamHox

I hope that help you.

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "==",
          "replacement": "==<>"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ["<>"]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "S6UlZgYCJaSIQcy03OOA==Ieuwc7Ix/CQfwoDSOVJl== 2oZjflRSRkcj4/OHcp78=="
}

ansamHox · July 27, 2022, 12:31pm

yep, that's it, I never thought about replacing

just had to add whitespace in the pattern list and it works as it should:

"pattern": ["<>", " "]

tnx

system · August 24, 2022, 12:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with custom analyzer/tokenizer Elasticsearch	2	997	July 5, 2017
Bug in official document sample Elasticsearch	4	725	July 5, 2017
Create an analyzer to tokenize non-alphanumeric characters Elasticsearch	7	2302	July 5, 2017
Custom analyzer and phrase search Elasticsearch	1	89	June 17, 2024
Analyzing Hebrew string Elasticsearch	4	2477	July 4, 2019

Pattern analyzer regex help

Related topics