I have some encrypted text and I need to search on it. Every character in a word consists of 3 sets of 3 characters, and some random 3 characters in between. Example:
Character "O" contains of [*A1B*C1E*D5X]
Character "K" contains of [*Bx2*Ik8*CC9*]
Word "OK" looks like *A1B*C1E*D5X*Bx2*Ik8*CC9*
, and * represents any other characters, so actually word "OK" can look like:
- XA2A1BC5lC1Eah7D5Xx11Bx2jkiIk820bCC91ht
- S2yA1BvkqC1Eou6D5X908Bx2mh3Ik8jutCC9asx
I'm struggling how to create optimum analyzer (except tokenizing N by N, let's say 3 characters are good, followed by 3 random ones, followed by 3 good, etc.). What I would like to do is search for: A1BC1ED5XBx2Ik8CC9 (they need to be in that same order) and get back the results.
Tried ngram (min and max 3), and also something like this:
"analysis": {
"tokenizer": {
"sequence_tokenizer": {
"type": "pattern",
"pattern": "(?<=\\G.{3})",
"group": -1
}
}
and regexp, but regex is quite slow of course. Any hints/ideas?