Return only highest scoring document from a family of documents


(Jon Hourany) #1

Let's say I have documents mapped such that:

PUT test_documents
{
  "mappings": {
    "doc": {
      "properties": {
        "parent_id": { "type": "keyword" },
        "body": { "type": "text" }
      }
    }
  }
}

Where body is some body of text and parent_id is the id of the parent document where that body of text came from

PUT test_documents/doc/1
{
  "parent_id": "ZOO BOOK",
  "body": "Zoo's are places where you can see animals"
}

PUT test_documents/doc/2
{
  "parent_id": "ZOO BOOK",
  "body": "Zoo's have lots of animals"
}

PUT test_documents/doc/3
{
  "parent_id": "VET BOOK",
  "body": "Vet's are doctors for animals"
}

When I do a search on this text for both "zoo's" and "animals" I'll get all three documents back as expected

GET test_documents/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "body": "zoo's animals"
          }
        }
      ]
    }
  }
}

but what I'd like is for the return to only have the highest scoring member from each document that shares a parent_id so that in this case, the return would only have 2 documents: the highest scoring member from "ZOO BOOK" and the highest scoring from "VET BOOK" in order of relevance so that if the order of relevance was "ZOO BOOK", "VET BOOK", "ZOO BOOK" this distinct list would just be "ZOO BOOK", "VET BOOK".

I tried doing aggregation on the parent_id field but that didn't really do what I wanted.


(Abdon Pijpelink) #2

Take a look at the field collapsing feature. It allows you to return the highest scoring document for unique values of a specific field.

To get to what you want to do, your request would look something like this:

GET test_documents/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "body": "zoo's animals"
          }
        }
      ]
    }
  },
  "collapse": {
    "field": "parent_id"
  }
}

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.