Top hits by sort criteria or include source into arbitrary aggs

Vladimir_Khazin · July 14, 2015, 7:09pm

https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-metrics-top-hits-aggregation.html is great feature!

How about an ability to custom define 'the most relevant document', e.g. order the aggregated docs based on most recently updated rather than based on the count.

To illustrate by reusing example with the tags: order the tags not based on how many documents containing a certain tag, but order the tags based on the recency of the documents containing a certain tag.

Alternatively what if we could use nested aggregation with ability to include fields in the lowest resolution bucket. E.g. aggs -> terms by fieldA -> max by fieldB -> include fieldC in the output.

colings86 · July 14, 2015, 9:53pm

The problem here is that aggregation are by their definition summarisations of collections of documents, not documents on their own. So while you can order your terms by a summary of the date field across all the documents in a bucket (say the maximum date) by adding a metric aggregation alongside the top_hits aggregation, you could not order your terms by a single document as the buckets isn't about a single documents its about a collection of them (a bucket).

The same goes for including fieldC in the output. The question would be how to include fieldC since the bucket contains more than one document so potentially more than one value of fieldC. If you wanted to return the top N values of fieldC you could add a terms aggregation alongside the max aggregation (of fieldB) in your example, if you wanted to return the number of unique values of fieldC you could add a cardinality aggregation, if you wanted to include the value of fieldC for the top N documents in the bucket your could use the top_hits aggregation and set it to only output fieldC for each document. But again because these functions are performed on the buckets (a collection of documents) rather than the individual documents, it would not be possible to include an item from documents themselves in the aggregation output, only items computed from summarising across the documents in the bucket.

Vladimir_Khazin · July 14, 2015, 10:22pm

Thank you for your comments!

I think the case I am running into is a combination of a aggregation and a lookup.

My document structure for playback heartbeat:
{
HeartbeatId: "guid",
ProfileId: "guid",
AssetId: "guid",
LastModifiedDate: "dateTime",
ResumePoint: "timespan"
}

Requirement: find latest resume point for each asset by profileId sorted in desc order.

My current solution is two requests:
First request:

filtered by profileId aggs
terms aggs assetId, sorted by maxDate: desc
child aggs max LastModifiedDate to generate maxDate for sorting of the parent aggs
that gives me list of unique asset ids by profile id, sorted by max modified date in desc order

Second request:

multi search by profile id and asset id with size: 1 and sort order LastModifiedDate desc.
that gives me resume point from the latest heartbeat

Ideally I would encapsulate this logic into one (and efficient) round trip between service and elastic search.
Any alternative suggestion to my implementation?

P.S. There are tens of millions of heartbeat docs in the type.

colings86 · July 15, 2015, 12:54pm

So, if I understand correctly you want to get the most recent document for each assetId ordered by maxDate (descending), for each of a list of profileIds. Is that correct?

Also could you post the requests you are using to do this at the moment?

Vladimir_Khazin · July 21, 2015, 6:16pm

Sorry for the delay - discuss.elastic.co was not accessible for couple of days and I have switched my attention elsewhere.

Here is my first request:

{  
   "size":0,
   "aggs":{  
      "watchHistoryByProfile":{  
         "filter":{  
            "and":[  
               {  
                  "term":{  
                     "ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"
                  }
               },
               {  
                  "not":{  
                     "term":{  
                        "UserDeleted":true
                     }
                  }
               },
               {  
                  "not":{  
                     "term":{  
                        "ContentType":3
                     }
                  }
               }
            ]
         },
         "aggs":{  
            "assets":{  
               "terms":{  
                  "field":"AssetId",
                  "order":{  
                     "maxDate":"desc"
                  },
                  "size":128
               },
               "aggs":{  
                  "maxDate":{  
                     "max":{  
                        "field":"LastModifiedDate"
                     }
                  }
               }
            }
         }
      }
   }
}

And here is my second request:

{"index":"shomi","type":"heartbeat"}
{"size":1,"filter":{"and":[{"term":{"ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"}},{"term":{"AssetId":"c46ce139-3dd6-4dc3-ae40-76a93cc7500a"}},{"not":{"term":{"UserDeleted":true}}}]},"sort":{"LastModifiedDate":"desc"}}{"index":"shomi","type":"heartbeat"}
{"size":1,"filter":{"and":[{"term":{"ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"}},{"term":{"AssetId":"9eb4d4d1-8cb0-42d0-8401-424bf18db44a"}},{"not":{"term":{"UserDeleted":true}}}]},"sort":{"LastModifiedDate":"desc"}}

colings86 · July 22, 2015, 7:31am

So you could use the top_hits aggregation here to list the most recent document for each assetId. Your first request would then look something like the following and you could get rid of the second request:

{
  "size": 0,
  "aggs": {
    "watchHistoryByProfile": {
      "filter": {
        "and": [
          {
            "term": {
              "ProfileId": "74408640-3f3d-4f71-af68-f9d43c2f73a5"
            }
          },
          {
            "not": {
              "term": {
                "UserDeleted": true
              }
            }
          },
          {
            "not": {
              "term": {
                "ContentType": 3
              }
            }
          }
        ]
      },
      "aggs": {
        "assets": {
          "terms": {
            "field": "AssetId",
            "order": {
              "maxDate": "desc"
            },
            "size": 128
          },
          "aggs": {
            "most_recent_doc": {
              "top_hits": {
                "sort": [
                  {
                    "LastModifiedDate": {
                      "order": "desc"
                    }
                  }
                ],
                "size": 1
              }
            },
            "maxDate": {
              "max": {
                "field": "LastModifiedDate"
              }
            }
          }
        }
      }
    }
  }
}

Just out of interest, which country are you accessing the forums from? I only ask because I did not see an outage over the last week from the UK.

Vladimir_Khazin · July 22, 2015, 3:00pm

Technically from the same as you are - from Canada

Topic		Replies	Views
Sort aggregation based on TopHits (ie top 10) average score Elasticsearch	2	705	May 25, 2021
How to use `top_hits` hits as input of another elasticsearch pipeline aggregation Elasticsearch	1	677	September 6, 2018
Sorting results from composite aggregation Elasticsearch	14	3424	August 3, 2020
Top Hits within Top Hits and custom key sorting with explicit order Elasticsearch	1	398	July 6, 2017
ELASTICSEARCH - Sort agg values only in most recent document Elasticsearch	3	390	October 6, 2020

Top hits by sort criteria or include source into arbitrary aggs

Related topics