How about an ability to custom define 'the most relevant document', e.g. order the aggregated docs based on most recently updated rather than based on the count.
To illustrate by reusing example with the tags: order the tags not based on how many documents containing a certain tag, but order the tags based on the recency of the documents containing a certain tag.
Alternatively what if we could use nested aggregation with ability to include fields in the lowest resolution bucket. E.g. aggs -> terms by fieldA -> max by fieldB -> include fieldC in the output.
The problem here is that aggregation are by their definition summarisations of collections of documents, not documents on their own. So while you can order your terms by a summary of the date field across all the documents in a bucket (say the maximum date) by adding a metric aggregation alongside the top_hits aggregation, you could not order your terms by a single document as the buckets isn't about a single documents its about a collection of them (a bucket).
The same goes for including fieldC in the output. The question would be how to include fieldC since the bucket contains more than one document so potentially more than one value of fieldC. If you wanted to return the top N values of fieldC you could add a terms aggregation alongside the max aggregation (of fieldB) in your example, if you wanted to return the number of unique values of fieldC you could add a cardinality aggregation, if you wanted to include the value of fieldC for the top N documents in the bucket your could use the top_hits aggregation and set it to only output fieldC for each document. But again because these functions are performed on the buckets (a collection of documents) rather than the individual documents, it would not be possible to include an item from documents themselves in the aggregation output, only items computed from summarising across the documents in the bucket.
I think the case I am running into is a combination of a aggregation and a lookup.
My document structure for playback heartbeat:
{
HeartbeatId: "guid",
ProfileId: "guid",
AssetId: "guid",
LastModifiedDate: "dateTime",
ResumePoint: "timespan"
}
Requirement: find latest resume point for each asset by profileId sorted in desc order.
My current solution is two requests:
First request:
filtered by profileId aggs
terms aggs assetId, sorted by maxDate: desc
child aggs max LastModifiedDate to generate maxDate for sorting of the parent aggs
that gives me list of unique asset ids by profile id, sorted by max modified date in desc order
Second request:
multi search by profile id and asset id with size: 1 and sort order LastModifiedDate desc.
that gives me resume point from the latest heartbeat
Ideally I would encapsulate this logic into one (and efficient) round trip between service and elastic search.
Any alternative suggestion to my implementation?
P.S. There are tens of millions of heartbeat docs in the type.
So, if I understand correctly you want to get the most recent document for each assetId ordered by maxDate (descending), for each of a list of profileIds. Is that correct?
Also could you post the requests you are using to do this at the moment?
So you could use the top_hits aggregation here to list the most recent document for each assetId. Your first request would then look something like the following and you could get rid of the second request:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.