All of the scripting examples (whether sorting, filtering or computing new values) take place on a single document. Is it possible to remove filtered results prior to aggregation using custom scripts or a plugin?
Use case: My documents are idempotent, and represent changes to a user's account. Each account update indexes a new document. I want to aggregate records on an account field, but I only want to include the user's latest update in the aggregation.
I'd have an index that just contains the latest data per user in that case. If you still need the history then maybe two indexes? It'd be a pain to make sure they stay in sync but it'd be simpler than trying to squash the documents on the fly during an aggregation.
Are you suggesting that before I index the latest change into one index, I copy the existing record to an "archived" index and delete it from the "live" index? I don't get it.
More to the point, are you suggesting this approach because there is no way to filter an aggregated collection of records to include only the latest record?
What I'm trying to do is similar to this pseudo-SQL statement:
SELECT * from T where T.submitted = ( SELECT MAX(T.submitted) WHERE T.user = ? )
I think he is suggesting keeping all the raw records in one index and have a separate index where you store the latest status, using the user ID as a key so that any update for a user coming in will overwrite the previous status. That way you have fast access to the last change made for every user as wells the entire history.
However, that is not an option for me due to the size of these indices and our rollover strategy would get really complicated, but I do see how it is a workaround.
So, no ability to create an aggregation that limits the records to only include the latest record? Something like max(_timestamp)? No way to do it with a custom plugin?
Although I am not sure how you would do it, it may be possible to do all that processing at query time. This could however be slow and may also be difficult to scale well. Using a separate index to hold the most recent state will make these queries much more efficient as well as scale well. If you are performing this type of query rarely, doing all this work for each query may be fine, but if you need to get this information on a regular basis, you will most likely benefit from doing the work up front by having a separate index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.