Sorting aggregation buckets on string field


#1

I'm trying to group documents on a field and then sort them on another string field. My aim is to only get one document per group and sort them.

For example, let's say the mapping looks like (simplified):

{
..
  "group" : {"type" : "string"}
  "title" : {"type" : "string"}
}

I want to group documents on group, then sort on title. For this I am then using a terms aggregation with field="group". For sorting I am using ordering based on a sub aggregation. For numeric values this works fine by using min and max aggregations, but I can't find anything similar for string ordering. Any ideas?

This is the code for numeric sorting:

AbstractAggregationBuilder bucketsAgg = terms("group_agg_name")
        .field("group")
        .order(Terms.Order.aggregation("order_agg_name", false))
        .subAggregation(
                max("order_agg_name").field("numeric")
        )
        .subAggregation(
                topHits ...

        );

(Colin Goodheart-Smithe) #2

Sorting on non numeric metric aggregations is not currently supported.

However, if I understand your request correctly you are trying to sort the group buckets based on a property of an individual document (title). This would not work even if string sorting was supported since multiple documents can fall into a bucket and ES would not know which document's title to use for sorting (or how to combine the document's titles to produce a sorting value).


#3

Ok, thanks. Can you think of any other way of removing duplicates (duplicate based on some property in the document) other than aggregations?

We are having problem with performance. We have a query going from 250ms to 9000ms when using group aggregations to remove duplicate entries in search results. Looking in to filters or post filters as perhaps a possible way.


(Colin Goodheart-Smithe) #4

Could you explain a bit more about your use-case and what you are trying to achieve? Having some example documents, and queries in a cURL recreation in a gist or something might also help as well


#5

Yes, one case we have is searching for music tracks, were you search for a track title. These tracks may contain different versions which should not be shown multiple times. To filter out the duplicates we do a group aggregation on a grouping string for the track. The number of tracks is about 30 million.

I could give you some documents etc, but I see now that the problem seems to be using a string for grouping. When I use a integer type for example i get response quickly. Should I avoid using string for field and term aggregations? Seems really strange. I have to test more.


#6

Or rather, the number of unique values for the field I am grouping seems to decide how long the aggregation takes. I thought my aggregation was done only on the search result, and in that case I don't think that this should have such a big effect, but maybe I am not aggregating on all data or something.

For example, this request should only aggregate on the search result, right?

{
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "title": {
        "query": "Thriller",
        "type": "boolean"
      }
    }
  },
  "aggregations": {
    "group_agg": {
      "terms": {
        "field": "group",
        "size": 10,
        "order": {
          "order_agg": "desc"
        }
      },
      "aggregations": {
        "order_agg": {
          "max": {
            "script": "_score"
          }
        },
        "group_agg_top_hit": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

#7

Update on this.
I read this: https://github.com/elastic/elasticsearch/issues/5498 and added "executionHint("map")" to the terms aggregator and that solved the problem it seems.


(system) #8