Marking the missing words in search results


(Ibrahim Tasyurt) #1

Hello,
I want to implement a feature to show missing search terms in search results similar to Google's.
Is there any proper way to do in Elasticsearch. Current Elasticsearch version we are using is 1.5.2.

Thank you.

An example from Google is below, where "potatoes" is missing.


(Isabel Drost-Fromm) #2

You should have a look at the match query, in particular the minimum_should_match option:

https://www.elastic.co/guide/en/elasticsearch/reference/5.x/query-dsl-match-query.html

By the way - what's the reason for still being on 1.5.2?


(David Pilato) #3

The only thing I can think of is to do the following on the client:

Let's say you search for elasticsearch potatoes:

  • Split the text (you can use _analyze endpoint for that)
  • For each generated token, here: elasticsearch, potatoes, generate a should clause using named queries.
  • As a response, you will get for each doc, the list of queries which matched,. So you can on client side know which queries did not match.

Full example

Tested on 5.0.1

Index some data

DELETE index
POST index/doc
{
  "content": "elasticsearch you know for search"
}
POST index/doc
{
  "content": "potatoes you know for lunch"
}

Analyze the text

GET _analyze
{
  "text": "elasticsearch potatoes"
}

it gives back 2 tokens:

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "potatoes",
      "start_offset": 14,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Create the search

Iterate over the tokens and create the should clause for them:

GET index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "elasticsearch",
              "_name": "elasticsearch"
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "potatoes",
              "_name": "potatoes"
            }
          }
        }
      ]
    }
  }
}

It gives:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2824934,
    "hits": [
      {
        "_index": "index",
        "_type": "doc",
        "_id": "AViRm-3jWVDw7QDfjwKn",
        "_score": 0.2824934,
        "_source": {
          "content": "elasticsearch you know for search"
        },
        "matched_queries": [
          "elasticsearch"
        ]
      },
      {
        "_index": "index",
        "_type": "doc",
        "_id": "AViRm_VGWVDw7QDfjwKo",
        "_score": 0.2824934,
        "_source": {
          "content": "potatoes you know for lunch"
        },
        "matched_queries": [
          "potatoes"
        ]
      }
    ]
  }
}

So you know that for the first doc, potatoes is missing and for the second doc elasticsearch is missing.

Using a template

You can also use a template like this:

GET index/_search/template
{
  "inline": "{\"query\":{\"bool\":{\"should\":[{{#term}}{\"match\":{\"content\":{\"query\": \"{{.}}\",\"_name\": \"{{.}}\"}}},{{/term}}{}]}}}",
  "params": {
    "term": [ "elasticsearch", "potatoes" ]
  }
}

The template part is actually the following but it needs to be wrapped here in a string as it's not a valid JSON otherwise:

{
  "query": {
    "bool": {
      "should": [
        {{#term}}
        {
          "match": {
            "content": {
              "query": "{{.}}",
              "_name": "{{.}}"
            }
          }
        },
        {{/term}}
        {}
      ]
    }
  }
}

I hope this helps.


(Ibrahim Tasyurt) #4

Thank you,
We're using 1.5.2 since the system is already live and we have some other blockers (plugin migration etc.) before we upgrade to a new version.

Let me give more context that, I don't want excluding results matching less than some minimum, but mark the missing words for each field. You can think of it some sort of inverse of the highlighting.


(David Pilato) #5

I just updated my answer to show you also how you can use search templates to make that even easier to use.


(Ibrahim Tasyurt) #7

Thank you very much,
your answer and matched_queries looks like an option.
Some concerns we have:

  1. We use a custom analyzer with a number of synonyms like('javascript=>js'), so we're going to need the original terms(before analysis)
  2. Adding other clauses may affect our score calculation, but I think it's something that can be easily handled.

Btw, does implementing a server side solution(like implementing a plugin) make sense here or a bit too heavy for this use case?

Thank you for your help.
ibrahim--


(David Pilato) #8

Yeah may be. Or use the same custom analyzer in _analyze API but then use term query instead of match query and you'll have the same behavior.

You can but the problem with plugins is that they are tight to elasticsearch version. So you will need to update it any time you upgrade elasticsearch. Using only the REST endpoint would be easier to maintain IMO.


(Ibrahim Tasyurt) #9

Thanks. I'll give a try.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.