Aggregate on fields and get distinct documents

Imran_Azad · March 9, 2016, 2:07pm

So I have three documents:

Document 1:

Title: Paracetamol

Subject: SideEffects

Body: "some content"
Document 2:

Title: Paracetamol

Type: SideEffects

Body: "some different content"
Document 3:

Title: Paracetamol

Type: IndicationsAndDose

Body: "some different content"
Is it possible to aggregate the results by Title and Type where subject is equal to SideEffects so that the hit count is equal to two. So the following documents would be returned:

Document 1:

Title: Paracetamol

Subject: SideEffects

Body: "some content"
Document 3:

Title: Paracetamol

Type: IndicationsAndDose

Body: "some different content"
Although I'm not sure how I would control which documents gets dropped between Document 1 and 2 I guess the less relevant one should be dropped.

cbuescher · March 9, 2016, 2:48pm

Hi,

you can combine Terms Aggregation and Top Hits Aggregation. The first let's you group on one or more fields, the later lets you retrieve the most relevant document being aggregated.

Assuming you have documents like you describe, this would rougly look like the following:

GET /index/doc/_search
{
  "aggs": {
    "agg1": {
      "terms": {
        "field": "title"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "type"
          }, 
        "aggs": {
                "top_docs": {
                    "top_hits": {
                        "sort": [
                            {
                                "subject": {
                                    "order": "asc"
                                }
                            }
                        ],
                        "_source": {
                            "include": [
                                "title", "type", "subject"
                            ]
                        },
                        "size" : 1
                    }
                }
            }
        }
      }
    }
  }
  , "size": 0
}

And a result would look something like:

"aggregations": {
    "agg1": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "a",
          "doc_count": 3,
          "agg2": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "b",
                "doc_count": 2,
                "top_docs": {
                  "hits": {
                    "total": 2,
                    "max_score": null,
                    "hits": [
                      {
                        "_index": "index",
                        "_type": "doc",
                        "_id": "AVNby0Besrt2YTZcBcvs",
                        "_score": null,
                        "_source": {
                          "subject": "one",
                          "title": "a",
                          "type": "b"
                        },
                        "sort": [
                          "one"
                        ]
                      }
                    ]
                  }
                }
              },
              {
                "key": "c",
                "doc_count": 1,
                "top_docs": {
                  "hits": {
                    "total": 1,
                    "max_score": null,
                    "hits": [
                      {
                        "_index": "index",
                        "_type": "doc",
                        "_id": "AVNby96Usrt2YTZcBcvu",
                        "_score": null,
                        "_source": {
                          "subject": "three",
                          "title": "a",
                          "type": "c"
                        },
                        "sort": [
                          "three"
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }

Note that in this case we are only getting the topmost document, but you could also get more. The order is determined by the main query (omitted here) or in this case by ascending sort on the subject field, but there's many ways to do this.

Hope this helps.

Imran_Azad · March 9, 2016, 5:39pm

This is brilliant, many thanks!