PostgreSQL "LIKE" alternative: Custom Tokenizer for both way edgeNGram?

Aldo_Sarmiento · June 1, 2017, 7:24pm

In PostgreSQL there's a LIKE clause where you can wildcard a search term like WHERE content LIKE '%blue%'.

This will match any record where the content column contains "blue".

I was thinking that with ElasticSearch, I can get away with something less complex if I need to: check if there's any word that either starts or ends with non-analyzed search term "blue".

I started looking into edgeNGram, which is really good for front-side autocomplete-like searching. So that handles the case where I can find a word that starts with "blue", but lacks the ends with "blue" logic.

e.g.: edgeNGram for term "buy blueshield":
['b', 'bu', 'buy', 'b', 'bl','blu','blue','blues','bluesh','blueshi','blueshie','blueshiel','blueshield']

So searching for non-analyzed term "blue" would indeed match here. But what if the word "blue" was at the end of "blueshield"?

Term "buy shieldblue" would tokenize to:
['b', 'bu', 'buy', 's', 'sh','shi' ,'shie','shiel','shield','shieldb','shieldbl','shieldblu','shieldblue']

This wouldn't be a hit for non-analyzed search term "blue".

So I'm assuming that in order to achieve this I'd have to write my own tokenizer? If so, do I have to do this in Java? Or is there a way to do in via _settings API?

dakrone · June 1, 2017, 10:30pm

It sounds like you want the ngram token filter (or tokenizer, though I recommend the token filter rather than the tokenizer), which does the breaking apart, but does not limit itself to only one side of the token. In your case you'd get (with bigrams): [bu, uy] and [bl, lu, ue, es, sh, hi, ie, el, ld] for "buy blueshield" assuming you tokenized on the space character.

https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-ngram-tokenizer.html

Aldo_Sarmiento · June 2, 2017, 9:59pm

With that case though, if i did a search for "blue", it would tokenize the search term "blue" => [bl lu ue]

Then it would also match "blackshield" since they have "bl" in common, no?

I was wanting to have the search term be non analyzed, which would then be searchable in the edgeNGram.

dakrone · June 2, 2017, 10:27pm

That's correct, though blueshield would score higher than blackshield because it has more in common.

You could alleviate this by using trigrams instead of bigrams, so

[blu, lue, ues, esh, shi, hie, iel, eld]
[bla, lac, ack, cks, ksh, shi, hie, iel, eld]

And then blue would be [blu, lue] and only match blueshield.

You could do this with regular ngrams also, the only issue is that you might have too large a term to match, I would recommend an ngram approach over an edgengram approach here though (see example above).

Aldo_Sarmiento · June 2, 2017, 10:34pm

I see. Yeah there were other cases where bi/tri grams were producing some pretty off results. For instance searching for "martin" scored less that a body of text that had "martinez" 2 times.

Also people will be using this to look for phone numbers, and as such, splitting them into grams on the search term resulted in useless results as well.

Search term: 949531531 => [949 495 953 .... 531]

So people having phone numbers matching those grams would show up. Ultimately what I'm saying is the entire search term needs to be taken into consideration, not broken up. Order and size of the search term matter a lot in this case.

dakrone · June 2, 2017, 11:08pm

I would recommend analyzing the field multiple ways (using a multi-field), and then at query time you can choose search either the regular or ngram version, or both. You can also boost one particular field (for instance, boosting the non-ngram version of the field) so that exact matches are scored higher than ngram matches. For phone numbers you can choose to skip searching the ngram version of the field entirely.

system · June 30, 2017, 11:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.