How is partitioning in aggregations achieved? How is the data selected for each partition and what guarantees can be made surrounding the ordering of results in each partition and across partitions?
How can partitioning be used to paginate results if I want to guarantee some ordering of the aggregation results?
The value of each term e.g. ipaddress 146.204.187.221 is hashed and then we take the modulo of the number of paritions you picked e.g. 20. This means all terms will be evenly assigned between 20 partitions. We use the same technique to route documents evenly to shards based on their doc IDs.
For each aggregation request you make you pick one of your partitions e.g. partition 7 of 20. This means each shard will be performing analysis on the same subset of terms.
The terms within one request (e.g. partition 1 of 20) will be ordered by whatever criteria you pick.
There is no guarantee of order across partitions because partitions by design are a randomised division of the data.
You can't. "Some ordering" could be expanded to include something as tricky as userIds sorted by their last logged access date. Given very large numbers of user IDs and a distributed index it is impossible to compute this total ordering without resorting to map/reduce style streaming of masses of data across the network and creating temporary data files etc.
Using term partitioning however you can reduce the amount of data streamed but have to live with in-partition ordering only. This would be adequate for the scenario of finding userIDs that have expired but will not solve all problems. Another option for the complex scenarios like "last access dates" is to opt for entity-centric indexes based around user.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.