Top hits/bucket agg vs field collapse [5.3]

Thomas_Millward_Wrig · January 25, 2018, 11:54am

I'm trying to understand the difference between these types of query, and when to use each.

I have documents that have a "unique_hash" value, and want to return the latest document for each unique hash value within a given query. The script that will consume this data will need to iterate over all matching documents. I have explored size/from and partitions respectively for the sake of reasonable response times.

As far as I can tell, I can achieve this with either:

{
  "collapse": {
    "field": "unique_hash.keyword",
    "inner_hits": {
      "name": "latest",
      "size": 1,
      "sort": [
        "created_at"
      ]
    }
  },
  "size": 100,
  "from": 100
}

or:

{
  "aggs": {
    "unique_hashes": {
      "terms": {
        "field": "unique_hash.keyword",
        "size": 100000, 
        "include": {
          "partition": 1,
          "num_partitions": 100
        }
      },
      "aggs": {
        "vulnerabilities": {
          "top_hits": {
            "sort": [
              {
                "created_at": "desc"
              }
            ],
            "size": 1
          }
        }
      }
    }
  }
}

In terms of response time, the difference doesn't seem enormous, so I'm wondering which is the most appropriate for my use case?

jimczi · February 2, 2018, 9:32am

The field collapsing is only applied on top documents, this means that aggregations that runs on a collapse query will see all results (without collapsing).
It was added to speed up field collapsing query that don't need to apply the results of the collapsing to aggregations. Previously field collapsing was only possible with aggregations (1) so we've added a specialized implementation for queries.

For the terms aggregation, why are you setting an include with a partition ? This will search on 1/100 of your data randomly. If you want the same response than the collapse query you need to remove the partitioning and set the terms size to 100 (or 200 if you want to retrieve results from 100 since this aggregation doesn't support paging). See the link below for a full example on field collapsing using the terms aggregation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html

These two options have different use cases and limitations.
The terms aggregation is very powerful, it can be nested under another bucket aggregation, it supports sorting buckets using the result of another aggregation, ... so it can do much more than simple field collapsing.
The collapse query option is a simple way to achieve first level field collapsing without the cost of running a full aggregation on it. It is usually preferred for search use cases that don't require analytics on the collapsed results and seek for speed.

system · March 2, 2018, 9:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Total hits with field collapsing Elasticsearch	4	10053	December 6, 2017
Faceting, sorting, paginating within buckets Elasticsearch	1	171	June 28, 2023
Field collapse - can't sort inner hits to take latest document? Elasticsearch	6	2802	February 27, 2018
Optimizing agg on collapse Elasticsearch	1	278	November 16, 2021
Collapse over more then 10,000 docs Elasticsearch	2	900	July 24, 2022

Top hits/bucket agg vs field collapse [5.3]

Related topics