I have a requirement to retrieve all instances of records having unique value for a particular column. All the records having that same unique value must appear in one cluster. The number of records in the index could be in billions.
Should i be using scroll with aggregation? I somewhere read aggregation is not the best solution for this one.
The other approach could be to scroll over those records and sort on that particular column. For this approach, i wanted to know whether the sorting will be over the 10000 records to be presented or all the matching records will be sorted first and then 10000 records will presented.
You can run a filtered query on the particular column and then scroll results, but it can be long and heavy, depending on the number of "selected documents".
Note that sorting sorts over all records.
Extra question: Do you need to aggregate selected documents or not ?
Recently I "solved" a search dilemn with the following trick: We have complex searches with various parameters and they can return lot of records or not. My trick is to run the query with size = 0 to get the totalHits and then run a query with complex aggregations or to scroll and doing aggregations in our code.
The limit is fixed to 30000 records. Above this limit I let ES do aggregations, under this limit it's very quicker to do it ourself. Running queries with size=0 is super fast (< 100ms), but running the same query with complex aggregrations can take over 6/7s !
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.