Proximity and "OR" or equivalency searching

We're migrating from Oracle's text search using CONTAINS to Elastic, and one search commonly performed by our "power users" is a NEAR (proximity) search in conjunction with an EQUIValence (=) operator.

Equivalence operators basically allows users to define "synonyms" at runtime, which acts sort of like an 'OR' but allows the whole expression to be treated as a single term. So something like:

apple=orange=fruit

Would match on all three terms equally.

By itself it's not much more than an OR, but within a proximity (NEAR) search in Oracle you can do things like:

NEAR((apple=orange=fruit, smoothie=shake=milkshake), 3)

and the above would do a proximity search where the terms apple, orange OR fruit appear within a span of three terms of smoothie, shake or milkshake.

So far, if I use query string query syntax, the only way I could see to search the above would be something like:

"apple smoothie"~3 "orange smoothie"~3 "fruit smoothie"~3 "apple shake"~3 ... etc.

If we employed a synonym token filter for the above I assume (?) it could work the same, but is there any way this can be done at runtime? if not, perhaps it's something worthwhile to add?

Yep! You can index your data using one analyzer, then search it using a different analyzer. One of the best use-cases for that functionality is synonyms because most people don't want to actually index the synonyms, just match them at query time.

Here is a quick demo I whipped up. Basically you create an index with a synonym analyzer, index docs regularly, then use a match-phrase query with slop of 3 and the new synonym analyzer.

1 Like

I don't think you can solve this issue with the query_string syntax. You would have to use the query dsl and the span_or and span_near queries.

1 Like

awesome thanks man! I'll check it out

Ah, @jpountz brings up a good point I forgot about: phrase matching with slop doesn't guarantee ordering. So "smoothie fruit" is just as likely to match as "fruit smoothie", since phrase matching just checks for number of edits.

If you need order-dependent, sloppy phrase matching (eg. "fruit" must come before "smoothie") you'll probably need the span family like he mentioned. Or you could index 3-word shingles and search those instead.

Order generally doesn't matter for our purposes, and we've got a 1-3 shingle setup for some of the relevant fields.

Part of this is getting our users to use the system more appropriately. I think our power-users will find that there are better ways to actually get at the data they want. The NEAR + EQUIV searches they've been doing with the Oracle search is, IMHO, more of a workaround of the limitations of the indexing therein.

Either way this is informative, thanks all!