Return documents that match a minimal number of words in the same sentence

flm · August 5, 2020, 10:13am

I need to return documents that match at least N words in the same sentence.

I split my documents per sentence and index each one as a separate value like so:

PUT /test_index/_doc/id1
    {
      "texts": [
"Your first step is the subject line.",
"You will have just seconds to gain the full attention of your reader."]
    }

and leave the position_increment_gap to the default 100.

Let say I need to match a minimum of 2 words.
I need to return the document if I search for the terms ("bla", "attention", "reader") but not for ("bla", "subject", "reader"). "bla" is not in the document, "attention" and "reader" are on the same sentence, "subject" and "reader" are not.

The approach with a boolean should query and minimum_should_match does not work, as this query returns the document when it shouldn't:

"query" : {
          "bool": {
            "should": [
              {"term": {"texts": "subject"}},
              {"term": {"texts": "reader"}}
            ],
            "minimum_should_match": 2
          }
      }

So I need a way to mix proximity and minimum should match.
Is there a way to achieve that?

abdon · August 6, 2020, 11:42am

The nested type could be a solution here. By indexing every sentence as a nested object, you can query these sentences independently.

First, you set up a nested type in your index' mapping:

PUT test_index
{
  "mappings": {
    "properties": {
      "texts": {
        "type": "nested"
      }
    }
  }
}

Next, you index your document, using a slightly different structure:

PUT /test_index/_doc/id1
{
  "texts": [
    {
      "text": "Your first step is the subject line."
    },
    {
      "text": "You will have just seconds to gain the full attention of your reader."
    }
  ]
}

Now, you can use the nested query to get to the desired results:

GET /test_index/_search
{
  "query": {
    "nested": {
      "path": "texts",
      "query": {
        "match": {
          "texts.text": {
            "query": "bla attention reader",
            "minimum_should_match": 2
          }
        }
      }
    }
  }
}

GET /test_index/_search
{
  "query": {
    "nested": {
      "path": "texts",
      "query": {
        "match": {
          "texts.text": {
            "query": "bla subject reader",
            "minimum_should_match": 2
          }
        }
      }
    }
  }
}

flm · August 10, 2020, 11:08am

Thanks Abdon, that's great. I overlooked this "nested type" feature.

system · September 7, 2020, 11:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Only return with minimum hit count Elasticsearch	1	551	July 6, 2017
Combining proximity with min number of words Elasticsearch	1	282	December 2, 2020
Querying for length of document Elasticsearch	3	250	June 14, 2022
Flexible search nested documents Elasticsearch	2	273	January 26, 2021
How to return all documents where a string occurs in the document at least N times Elasticsearch	2	780	July 5, 2017

Return documents that match a minimal number of words in the same sentence

Related topics