Fuzzy regexp search


#1

Hi,

I am trying to find matches between words and there reduced form using Elasticsearch.

Let say I have the input word shmp (indexed as shampoo in ES), I generate the following regex s.*?h.*?m.*?p.*? and execute the following query:

{
  "query": {
    "regexp": { "name": "s.*?h.*?m.*?p.*?" }
  }
}

Is there a way to also do a fuzzy search in the same time (of max_expansions 1 for example) so that s.*h.*?n.*?p.*? would match ?

However I'm not sure that is the best way to go, maybe there are some work around that I did not think of.


(Nik Everett) #2

Can you be more specific about what you want the input to the search to be and what you want to find?

I don't know how much background knowledge you have so I'll just throw out two things that are relevant to what you are talking about that you might already know and have worked through:

  1. Regex searches search against analyzed terms, not the whole text. Searching against the whole text would be slow and there isn't anything built into Elasticsearch to do that.
  2. You can always wrap queries in a bool query. If you have two queries listed in its should section then (by default) either one can match.

#3

What I really want to do is from a reduced version of a word to find it's full version that is indexed in ES.
The input to the search is the reduced form of a word :

  • mssge should match message
  • shmp should match shampoo

The problem is that there might be some mistake in the reduced version that I use for the query (because for instance it is the result of an OCR which is not 100% accurate). But I don't know how it will differ so it is difficult to predict all the possibilities for the bool query. Usually one letter could diverge in the reduced form.

Maybe I should not use regex, but I am not sure of which built in features of ES is best to answer my problem.


(Nik Everett) #4

You are probably better off using a fuzzy query in that case. It'll match things with an edit distance of 1 or 2. Its much much slower than a term query or a normal match query but it'll get the job done. It'll be slightly less slow if you set the prefix_length option to 1 or 2.


(system) #5