Top hits/bucket agg vs field collapse [5.3]


(Thomas Millward Wright) #1

I'm trying to understand the difference between these types of query, and when to use each.

I have documents that have a "unique_hash" value, and want to return the latest document for each unique hash value within a given query. The script that will consume this data will need to iterate over all matching documents. I have explored size/from and partitions respectively for the sake of reasonable response times.

As far as I can tell, I can achieve this with either:

{
  "collapse": {
    "field": "unique_hash.keyword",
    "inner_hits": {
      "name": "latest",
      "size": 1,
      "sort": [
        "created_at"
      ]
    }
  },
  "size": 100,
  "from": 100
}

or:

{
  "aggs": {
    "unique_hashes": {
      "terms": {
        "field": "unique_hash.keyword",
        "size": 100000, 
        "include": {
          "partition": 1,
          "num_partitions": 100
        }
      },
      "aggs": {
        "vulnerabilities": {
          "top_hits": {
            "sort": [
              {
                "created_at": "desc"
              }
            ],
            "size": 1
          }
        }
      }
    }
  }
}

In terms of response time, the difference doesn't seem enormous, so I'm wondering which is the most appropriate for my use case?


(Jimferenczi) #2

The field collapsing is only applied on top documents, this means that aggregations that runs on a collapse query will see all results (without collapsing).
It was added to speed up field collapsing query that don't need to apply the results of the collapsing to aggregations. Previously field collapsing was only possible with aggregations (1) so we've added a specialized implementation for queries.

For the terms aggregation, why are you setting an include with a partition ? This will search on 1/100 of your data randomly. If you want the same response than the collapse query you need to remove the partitioning and set the terms size to 100 (or 200 if you want to retrieve results from 100 since this aggregation doesn't support paging). See the link below for a full example on field collapsing using the terms aggregation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html

These two options have different use cases and limitations.
The terms aggregation is very powerful, it can be nested under another bucket aggregation, it supports sorting buckets using the result of another aggregation, ... so it can do much more than simple field collapsing.
The collapse query option is a simple way to achieve first level field collapsing without the cost of running a full aggregation on it. It is usually preferred for search use cases that don't require analytics on the collapsed results and seek for speed.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.