Top hits by sort criteria or include source into arbitrary aggs


(Vladimir Khazin) #1

https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-metrics-top-hits-aggregation.html is great feature!

How about an ability to custom define 'the most relevant document', e.g. order the aggregated docs based on most recently updated rather than based on the count.

To illustrate by reusing example with the tags: order the tags not based on how many documents containing a certain tag, but order the tags based on the recency of the documents containing a certain tag.

Alternatively what if we could use nested aggregation with ability to include fields in the lowest resolution bucket. E.g. aggs -> terms by fieldA -> max by fieldB -> include fieldC in the output.


(Colin Goodheart-Smithe) #2

The problem here is that aggregation are by their definition summarisations of collections of documents, not documents on their own. So while you can order your terms by a summary of the date field across all the documents in a bucket (say the maximum date) by adding a metric aggregation alongside the top_hits aggregation, you could not order your terms by a single document as the buckets isn't about a single documents its about a collection of them (a bucket).

The same goes for including fieldC in the output. The question would be how to include fieldC since the bucket contains more than one document so potentially more than one value of fieldC. If you wanted to return the top N values of fieldC you could add a terms aggregation alongside the max aggregation (of fieldB) in your example, if you wanted to return the number of unique values of fieldC you could add a cardinality aggregation, if you wanted to include the value of fieldC for the top N documents in the bucket your could use the top_hits aggregation and set it to only output fieldC for each document. But again because these functions are performed on the buckets (a collection of documents) rather than the individual documents, it would not be possible to include an item from documents themselves in the aggregation output, only items computed from summarising across the documents in the bucket.


(Vladimir Khazin) #3

Thank you for your comments!

I think the case I am running into is a combination of a aggregation and a lookup.

My document structure for playback heartbeat:
{
HeartbeatId: "guid",
ProfileId: "guid",
AssetId: "guid",
LastModifiedDate: "dateTime",
ResumePoint: "timespan"
}

Requirement: find latest resume point for each asset by profileId sorted in desc order.

My current solution is two requests:
First request:

  1. filtered by profileId aggs
  2. terms aggs assetId, sorted by maxDate: desc
  3. child aggs max LastModifiedDate to generate maxDate for sorting of the parent aggs
  4. that gives me list of unique asset ids by profile id, sorted by max modified date in desc order

Second request:

  1. multi search by profile id and asset id with size: 1 and sort order LastModifiedDate desc.
  2. that gives me resume point from the latest heartbeat

Ideally I would encapsulate this logic into one (and efficient) round trip between service and elastic search.
Any alternative suggestion to my implementation?

P.S. There are tens of millions of heartbeat docs in the type.


(Colin Goodheart-Smithe) #4

So, if I understand correctly you want to get the most recent document for each assetId ordered by maxDate (descending), for each of a list of profileIds. Is that correct?

Also could you post the requests you are using to do this at the moment?


(Vladimir Khazin) #5

Sorry for the delay - discuss.elastic.co was not accessible for couple of days and I have switched my attention elsewhere.

Here is my first request:

{  
   "size":0,
   "aggs":{  
      "watchHistoryByProfile":{  
         "filter":{  
            "and":[  
               {  
                  "term":{  
                     "ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"
                  }
               },
               {  
                  "not":{  
                     "term":{  
                        "UserDeleted":true
                     }
                  }
               },
               {  
                  "not":{  
                     "term":{  
                        "ContentType":3
                     }
                  }
               }
            ]
         },
         "aggs":{  
            "assets":{  
               "terms":{  
                  "field":"AssetId",
                  "order":{  
                     "maxDate":"desc"
                  },
                  "size":128
               },
               "aggs":{  
                  "maxDate":{  
                     "max":{  
                        "field":"LastModifiedDate"
                     }
                  }
               }
            }
         }
      }
   }
}

And here is my second request:

{"index":"shomi","type":"heartbeat"}
{"size":1,"filter":{"and":[{"term":{"ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"}},{"term":{"AssetId":"c46ce139-3dd6-4dc3-ae40-76a93cc7500a"}},{"not":{"term":{"UserDeleted":true}}}]},"sort":{"LastModifiedDate":"desc"}}{"index":"shomi","type":"heartbeat"}
{"size":1,"filter":{"and":[{"term":{"ProfileId":"74408640-3f3d-4f71-af68-f9d43c2f73a5"}},{"term":{"AssetId":"9eb4d4d1-8cb0-42d0-8401-424bf18db44a"}},{"not":{"term":{"UserDeleted":true}}}]},"sort":{"LastModifiedDate":"desc"}}


(Colin Goodheart-Smithe) #6

So you could use the top_hits aggregation here to list the most recent document for each assetId. Your first request would then look something like the following and you could get rid of the second request:

{
  "size": 0,
  "aggs": {
    "watchHistoryByProfile": {
      "filter": {
        "and": [
          {
            "term": {
              "ProfileId": "74408640-3f3d-4f71-af68-f9d43c2f73a5"
            }
          },
          {
            "not": {
              "term": {
                "UserDeleted": true
              }
            }
          },
          {
            "not": {
              "term": {
                "ContentType": 3
              }
            }
          }
        ]
      },
      "aggs": {
        "assets": {
          "terms": {
            "field": "AssetId",
            "order": {
              "maxDate": "desc"
            },
            "size": 128
          },
          "aggs": {
            "most_recent_doc": {
              "top_hits": {
                "sort": [
                  {
                    "LastModifiedDate": {
                      "order": "desc"
                    }
                  }
                ],
                "size": 1
              }
            },
            "maxDate": {
              "max": {
                "field": "LastModifiedDate"
              }
            }
          }
        }
      }
    }
  }
}

Just out of interest, which country are you accessing the forums from? I only ask because I did not see an outage over the last week from the UK.


(Vladimir Khazin) #7

Technically from the same as you are - from Canada :wink:


(system) #8