Custom analyzer and phrase search

I have some encrypted text and I need to search on it. Every character in a word consists of 3 sets of 3 characters, and some random 3 characters in between. Example:
Character "O" contains of [*A1B*C1E*D5X]
Character "K" contains of [*Bx2*Ik8*CC9*]

Word "OK" looks like *A1B*C1E*D5X*Bx2*Ik8*CC9*, and * represents any other characters, so actually word "OK" can look like:

  • XA2A1BC5lC1Eah7D5Xx11Bx2jkiIk820bCC91ht
  • S2yA1BvkqC1Eou6D5X908Bx2mh3Ik8jutCC9asx

I'm struggling how to create optimum analyzer (except tokenizing N by N, let's say 3 characters are good, followed by 3 random ones, followed by 3 good, etc.). What I would like to do is search for: A1BC1ED5XBx2Ik8CC9 (they need to be in that same order) and get back the results.

Tried ngram (min and max 3), and also something like this:

"analysis": {
      "tokenizer": {
        "sequence_tokenizer": {
          "type": "pattern",
          "pattern": "(?<=\\G.{3})",
          "group": -1
        }
      }

and regexp, but regex is quite slow of course. Any hints/ideas?

Hi @nestor1

I don't know if you had a solution, but if the search term is not encrypted, you can try indexing with the analyzer using regex and in the search using other analyzer ngram to get token size 3.
This way, at search time you do not use the regex analyzer. Check whether the performance is acceptable.