Proximity between multiple (boolean) queries

kno3 · October 19, 2016, 11:12am

I have a boolean which retrieves documents with 2 specific keywords/phrases (keyword one and keyword two):

        "query": {
        "bool":{
            "must": [
                {"match_phrase": {"main_text": "keyword one"}},
                {"match_phrase": {"main_text": "keyword two"}}
            ]
        }
        }

Is it possible to add proximity between "keyword one" and "keyword two"? I.e. given these documents:

Document 1: this document contains both keyword one and keyword two.
Document 2: this document contains keyword one as well as other keywords, including keyword two.
Document 3: this documents contains keywords. Keyword one refers to a topic that is very interesting. One should definitely look into it. Keyword two is boring.

I need to retrieve all of these 3 documents because they contain both phrases, but need to add weights to phrases/keywords that are closer to each other; in this case document 1 > document 2 > document 3. Sort of "slop" between the different queries rather than within each one (because keyword one should be matched exactly, and same for keyword two).

Cheers

cbuescher · October 19, 2016, 2:03pm

Hi,

I was going to suggest to take a look at Span Queries, but since they are a bit messy to use I thought about another thing and it seems to work at least for the simple test case you mentioned:

PUT /test/t/1
{
  "main_text" : "this document contains both keyword one and keyword two"
}

PUT /test/t/2
{
  "main_text" : "this document contains keyword one as well as other keywords, including keyword two"
}

PUT /test/t/3
{
  "main_text" : "this documents contains keywords. Keyword one refers to a topic that is very interesting. One should definitely look into it. Keyword two is boring."
}

PUT /test/t/4
{
  "main_text" : "this keyword is one and another keyword is two, but I don't want to see them both"
}

GET /test/t/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "main_text": "keyword one"
          }
        },
        {
          "match_phrase": {
            "main_text": "keyword two"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "main_text": {
             "query" : "keyword one keyword two",
            "slop" : 100
            }
          }
        }
      ]
    }
  }
}

The filter part ensures that you only get back documents with both keywords, but they have a constant score. The should clause is supposed to increase the score for documents where the contained terms are closer together within a certain slop distance. I don't know how much of a slop is acceptable here for your case.

This is somewhat of a quick hack but it might get you somewhere. Keep in mind that real world examples might be much messier than the three toy documents mentioned here (what happens e.g. if there are multiple occurences of keywords throughout the text? Which distance do you want to be applied there etc...). Also, phrase matching with slop comes with a certain performance cost. If you really want this to run as fast as possible you should think about shingling your test (so you can match e.g. Bigramms directly). Also using the "distance scoring" part of the query only in the rescoring phase might help in terms of performance cost.

Topic		Replies	Views
Boolean should and proximity? Elasticsearch	1	279	September 1, 2020
Highlighting issue with proximity phrase match Elasticsearch	1	577	July 6, 2017
Combine two queries Elasticsearch	2	656	November 5, 2020
Proximity phrase matching Elasticsearch	2	461	July 6, 2017
Proximity with Boolean operators? Elasticsearch	1	269	December 1, 2022

Proximity between multiple (boolean) queries

Related topics