I have a group of user access data which contain the user name, access timestamp, and some of other info. There will be multiple records with the same username but different access timestamp. What I want is to do is to find the latest access records of all user, sort the result by latest access timestamp, and paginate the result.
In general what I need is:
Aggregate by user name
Sort the aggregate result by latest access timestamp
Pagnated the result.
I tried the wizard and it told me to use the terms aggregation. And here is my search script
The script above manages to paginate the result through partitions, but the result is only sorted within each bucket. I didn't manage to sort the result across partitions.
I also tried composite aggregation, but it only supports sorting the results by grouped key(which is username in my case). I didn't find a way to sort through timestamp with composite aggregation.
Hi Stephen,
Another approach is to create an âentity-centricâ index using the âtransformsâ api. This will create a new index with one document for each user summarising their activity with your choice of attributes (eg first access date, last access etc). These can then be efficiently sorted by date and paginated.
However there is a trade off here because these transform documents are only updated periodically according to a schedule and this means there is a lag between user activity and their summary record being up to date.
Can you live with a delay in reported last accesses?
Thanks for the suggestion. I believe our business&service will be confusing if the lag is more that a few minutes. Moreover,we need not only the summarized data but also other detailed info to be contained in response. In this case, does the transforms api still solve our issue? Or any other solution?
The transforms api uses aggregations (including the option of scripts) to fuse data so you can store pretty much what you want in the entity documents, eg details from the userâs last access.
With regards to the lag issue - if the most popular request is to show currently active users with 100% accuracy Iâd be tempted to query the last five minutes of access event data directly and use a terms aggregation on user ids sorted by a sub agg on max access date. That would be close to a live view by reading the event data.
The problem with maintaining a close-to-live entity-centric index of user info is that youâd have to update each user document after every access and Lucene is not engineered for such heavy updates. When you introduce a lag, many access events made by a user can be summarised in a single update.
So for recent data it may make sense to use the live event data but for looking at older time windows an entity centric index maintained using the transforms api will work better. Youâd have to do some work in your client to make the pagination seamless
Thanks for the explanationïŒ Looking for a old user data is one of our requirement. So it looks like we have to do some pagination on our client side.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.