Custom analyzer and phrase search

nestor1 · June 14, 2024, 12:03pm

I have some encrypted text and I need to search on it. Every character in a word consists of 3 sets of 3 characters, and some random 3 characters in between. Example:
Character "O" contains of [*A1B*C1E*D5X]
Character "K" contains of [*Bx2*Ik8*CC9*]

Word "OK" looks like *A1B*C1E*D5X*Bx2*Ik8*CC9*, and * represents any other characters, so actually word "OK" can look like:

XA2A1BC5lC1Eah7D5Xx11Bx2jkiIk820bCC91ht
S2yA1BvkqC1Eou6D5X908Bx2mh3Ik8jutCC9asx

I'm struggling how to create optimum analyzer (except tokenizing N by N, let's say 3 characters are good, followed by 3 random ones, followed by 3 good, etc.). What I would like to do is search for: A1BC1ED5XBx2Ik8CC9 (they need to be in that same order) and get back the results.

Tried ngram (min and max 3), and also something like this:

"analysis": {
      "tokenizer": {
        "sequence_tokenizer": {
          "type": "pattern",
          "pattern": "(?<=\\G.{3})",
          "group": -1
        }
      }

and regexp, but regex is quite slow of course. Any hints/ideas?

RabBit_BR · June 17, 2024, 12:17am

Hi @nestor1

I don't know if you had a solution, but if the search term is not encrypted, you can try indexing with the analyzer using regex and in the search using other analyzer ngram to get token size 3.
This way, at search time you do not use the regex analyzer. Check whether the performance is acceptable.

Topic		Replies	Views
Search for special characters Elasticsearch	8	10534	February 5, 2018
Pattern analyzer regex help Elasticsearch	3	252	August 24, 2022
Search using ngram max_ngram Elasticsearch	2	542	March 29, 2018
Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter Elasticsearch	2	669	September 9, 2020
Phrases with special characters Elasticsearch	1	1386	July 6, 2017

Custom analyzer and phrase search

Related topics