Top Hits aggregation not working with rescoring despite 2.4.2 fix


(Janek Bevendorff) #1

Hey there,

I have a large collection of web documents which I want to search. In order to do that I run a fast match query over all documents and then rescore the top N hits. To avoid getting, e.g., only Wikipedia pages when searching for "wikipedia", I want to aggregate by domain name using a top hits aggregation. After hours of debugging I finally found out why it wasn't working and that the bug was fixed in Elasticsearch 2.4.2.

However, I have the problem that my max aggregation which I use to order the buckets still doesn't work. My aggregation looks like this:

final AggregationBuilder aggregation = AggregationBuilders.terms("hosts")
            .field("warc_target_hostname_raw")
            .order(Terms.Order.aggregation("top_score", false))
            .subAggregation(AggregationBuilders.topHits("top_sites").setSize(4))
            .subAggregation(AggregationBuilders.max("top_score").script(new Script("_score")));

which translates to this JSON:

{
  "hosts": {
    "terms": {
      "field": "warc_target_hostname_raw",
      "order": {
        "top_score": "desc"
      }
    },
    "aggregations": {
      "top_sites": {
        "top_hits": {
          "size": 4
        }
      },
      "top_score": {
        "max": {
          "script": {
            "inline": "_score"
          }
        }
      }
    }
  }
}

It turns out, the "_score" script still uses the old scores before rescoring and I only get garbage results. I can remove the max aggregation altogether, but then my query is terribly slow and buckets aren't sorted by their top ranked documents which makes the aggregation useless.

Is there any way to work around this problem or does this need another fix in Elasticsearch?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.