Proximity between multiple (boolean) queries


#1

I have a boolean which retrieves documents with 2 specific keywords/phrases (keyword one and keyword two):

        "query": {
        "bool":{
            "must": [
                {"match_phrase": {"main_text": "keyword one"}},
                {"match_phrase": {"main_text": "keyword two"}}
            ]
        }
        }

Is it possible to add proximity between "keyword one" and "keyword two"? I.e. given these documents:

Document 1: this document contains both keyword one and keyword two.
Document 2: this document contains keyword one as well as other keywords, including keyword two.
Document 3: this documents contains keywords. Keyword one refers to a topic that is very interesting. One should definitely look into it. Keyword two is boring.

I need to retrieve all of these 3 documents because they contain both phrases, but need to add weights to phrases/keywords that are closer to each other; in this case document 1 > document 2 > document 3. Sort of "slop" between the different queries rather than within each one (because keyword one should be matched exactly, and same for keyword two).

Cheers


(Christoph) #2

Hi,

I was going to suggest to take a look at Span Queries, but since they are a bit messy to use I thought about another thing and it seems to work at least for the simple test case you mentioned:

PUT /test/t/1
{
  "main_text" : "this document contains both keyword one and keyword two"
}

PUT /test/t/2
{
  "main_text" : "this document contains keyword one as well as other keywords, including keyword two"
}

PUT /test/t/3
{
  "main_text" : "this documents contains keywords. Keyword one refers to a topic that is very interesting. One should definitely look into it. Keyword two is boring."
}

PUT /test/t/4
{
  "main_text" : "this keyword is one and another keyword is two, but I don't want to see them both"
}

GET /test/t/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "main_text": "keyword one"
          }
        },
        {
          "match_phrase": {
            "main_text": "keyword two"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "main_text": {
             "query" : "keyword one keyword two",
            "slop" : 100
            }
          }
        }
      ]
    }
  }
}

The filter part ensures that you only get back documents with both keywords, but they have a constant score. The should clause is supposed to increase the score for documents where the contained terms are closer together within a certain slop distance. I don't know how much of a slop is acceptable here for your case.

This is somewhat of a quick hack but it might get you somewhere. Keep in mind that real world examples might be much messier than the three toy documents mentioned here (what happens e.g. if there are multiple occurences of keywords throughout the text? Which distance do you want to be applied there etc...). Also, phrase matching with slop comes with a certain performance cost. If you really want this to run as fast as possible you should think about shingling your test (so you can match e.g. Bigramms directly). Also using the "distance scoring" part of the query only in the rescoring phase might help in terms of performance cost.


(system) #3