Top hits query with same score?

I would like to ask a basic question about top hits aggregation .

How does top hits aggregation choose the document when the score is the same?
As condition , sort order is set as default and size is set to 1.

In below sample , top hits aggregation returned the earliest document.

[ywatanabe@localhost ~]$ curl -XGET localhost:9200/aggstest3/_search?pretty -d '{"size" : 2 , "query" : {"match_all" : {}}, "aggs" : { "host" : { "terms" : {"field" : "host.keyword" }, "aggs" : {"time" :{"top_hits" : { "size" : 1}} } }} }'
{
  "took" : 21,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "aggstest3",
        "_type" : "test",
        "_id" : "AV-gGu741A4C-ViaDhvV",
        "_score" : 1.0,
        "_source" : {
          "datetime" : "2017/11/09 00:00:00",
          "host" : "a"
        }
      },
      {
        "_index" : "aggstest3",
        "_type" : "test",
        "_id" : "AV-gGsQK1A4C-ViaDhuV",
        "_score" : 1.0,
        "_source" : {
          "datetime" : "2017/11/09 00:01:00",
          "host" : "a"
        }
      }
    ]
  },
  "aggregations" : {
    "host" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "a",
          "doc_count" : 2,
          "time" : {
            "hits" : {
              "total" : 2,
              "max_score" : 1.0,
              "hits" : [
                {
                  "_index" : "aggstest3",
                  "_type" : "test",
                  "_id" : "AV-gGu741A4C-ViaDhvV",
                  "_score" : 1.0,
                  "_source" : {
                    "datetime" : "2017/11/09 00:00:00",
                    "host" : "a"
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

The top_hits aggregation uses the internal doc_id (in Lucene) as a tiebreak for documents with same sort values.
The internal doc_id can differ for the same document inside each replica of the same shard so it's recommended to use another tiebreaker for sort in order to get consistent results. Fo instance you could do:
sort: ["_score", "datetime"] to force top_hits to rank documents based on score first and use datetime as a tiebreaker.

@jimczi

Thanks for the reply !

I would like to check one more thing. doc_id is not _id .

Correct?

Yes doc_id is not the document _id. This is an internal id that you cannot control and that is used by Lucene to identify documents inside an index.

Got 8t. Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.