Workaround for using wild cards in phrases and proximity searches(Elastic Search)


(saud.rehman) #1

Problem:

Recently I wanted to do a proximity search on elastic search index. I wanted to search all docs where ‘measles’ and ‘vaccin*’ were with 25 characters to each other. Plus I wanted both of them to be in order.

The elastic search proximity search wasn’t an option because of two reasons.

  1. Proximity search doesn’t support wildcards. e.g (“measles vaccine”)~25 is supported but (“measles vacci*”)~25 or (“measle* vacci*”) is not supported.

  2. Proximity search doesn’t check the respect the order of words in phrase e.g (“measles vaccine”)~25 and (“vaccine measles”)~25 will give same results.

Solution:
Few examples to resolve this issue using span_near

  1. (“measles vacci*”)~25
    {
    "query": {
    "span_near": {
    "clauses": [
    {
    "span_or": {
    "clauses": [
    {
    "span_term": {
    "text": "measles"
    }
    }
    ]
    }
    },
    {
    "span_or": {
    "clauses": [
    {
    "span_multi": {
    "match": {
    "prefix": {
    "text": {
    "value": "vacci"
    }
    }
    }
    }
    }
    ]
    }
    }
    ],
    "slop": 25,
    "in_order": "true”,
    "collect_payloads": "true"
    }
    }
    }

// in_order can be used to toggle between ordered or unordered.

  1. “measle* vacci*”
    {
    "query": {
    "span_near": {
    "clauses": [
    {
    "span_or": {
    "clauses": [
    {
    "span_multi": {
    "match": {
    "prefix": {
    "text": {
    "value": "measle"
    }
    }
    }
    }
    }
    ]
    }
    },
    {
    "span_or": {
    "clauses": [
    {
    "span_multi": {
    "match": {
    "prefix": {
    "text": {
    "value": "vacci"
    }
    }
    }
    }
    }
    ]
    }
    }
    ],
    "slop": 0,
    "in_order": "true",
    "collect_payloads": "true"
    }
    }
    }

  2. Grouping. Now lets assume you want to find all docs where (canada OR toronto OR “North york”) NEAR (measles OR vaccin*). And they should be near to each other by 30 characters.
    {
    "query": {
    "span_near": {
    "clauses": [
    {
    "span_or": {
    "clauses": [
    {
    "span_near": {
    "clauses": [
    {
    "span_term": {
    "text": "North"
    }
    },
    {
    "span_term": {
    "text": "york"
    }
    }
    ],
    "slop": 0,
    "in_order": "true",
    "collect_payloads": "true"
    }
    },
    {
    "span_term": {
    "text": "toronto"
    }
    },
    {
    "span_term": {
    "text": "canada"
    }
    }
    ]
    }
    },
    {
    "span_or": {
    "clauses": [
    {
    "span_term": {
    "text": "measles"
    }
    },
    {
    "span_multi": {
    "match": {
    "prefix": {
    "text": {
    "value": "vaccin"
    }
    }
    }
    }
    }
    ]
    }
    }
    ],
    "slop": 30,
    "in_order": "false",
    "collect_payloads": "true"
    }
    }
    }

If any one knows better solution than this one please comment. Any suggestions how I can build a parser to take query from user e.g (quick AND near(foxes OR rats, toronto OR ontario, 30)) and convert that to elastic search span_near using above workaround. Boolean operators and parenthesis precedence is what I am finding hard to handle. Any open source PHP library which can help me change user written queries with parenthesis and boolean operator to ES filters.

@shay banon, @ steven @uri: any plans to have such operator support in the query_string

References:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-span-near-query.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-span-multi-term-query.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-span-or-query.html


(system) #2