I want to group documents on group, then sort on title. For this I am then using a terms aggregation with field="group". For sorting I am using ordering based on a sub aggregation. For numeric values this works fine by using min and max aggregations, but I can't find anything similar for string ordering. Any ideas?
Sorting on non numeric metric aggregations is not currently supported.
However, if I understand your request correctly you are trying to sort the group buckets based on a property of an individual document (title). This would not work even if string sorting was supported since multiple documents can fall into a bucket and ES would not know which document's title to use for sorting (or how to combine the document's titles to produce a sorting value).
Ok, thanks. Can you think of any other way of removing duplicates (duplicate based on some property in the document) other than aggregations?
We are having problem with performance. We have a query going from 250ms to 9000ms when using group aggregations to remove duplicate entries in search results. Looking in to filters or post filters as perhaps a possible way.
Could you explain a bit more about your use-case and what you are trying to achieve? Having some example documents, and queries in a cURL recreation in a gist or something might also help as well
Yes, one case we have is searching for music tracks, were you search for a track title. These tracks may contain different versions which should not be shown multiple times. To filter out the duplicates we do a group aggregation on a grouping string for the track. The number of tracks is about 30 million.
I could give you some documents etc, but I see now that the problem seems to be using a string for grouping. When I use a integer type for example i get response quickly. Should I avoid using string for field and term aggregations? Seems really strange. I have to test more.
Or rather, the number of unique values for the field I am grouping seems to decide how long the aggregation takes. I thought my aggregation was done only on the search result, and in that case I don't think that this should have such a big effect, but maybe I am not aggregating on all data or something.
For example, this request should only aggregate on the search result, right?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.