Significant terms aggregations results dependent on size request parameter?

Hi,

So I have a very complicated query that involves:

  • terms filters
  • custom function scores
  • minimum score
  • parent-child setup with a has_parent query

I then ran a significant terms aggregation on this query with chi_square with custom background filter and not including negatives.

I ran this twice, once where I set the size request parameter to 0. Note this is the size request parameter that is on the same level as the query and aggs, NOT the size parameter on the aggregation itself. The second time, I did not set this size request parameter.

Comparing the two runs, I got significant different results on the significant terms even though I verified that the number of matching documents from the query was the same both times.

Is there a reason why the significant terms aggregation results depends heavily on this upper level size parameter? I was under the impression that this size parameter only affects the number of hits returned to you for matching docs, but doesn't affect the results of the aggregation. I only notice this significant difference on this complicated query I'm doing. I generally don't see a difference on a simpler query. Also, I would've expected that if somehow the size parameter was limiting the document set for the aggregations, that setting the size to 0 would result in no significant terms, but that's not the case either. I'm running this on a 1 shard/1 replica index.

Here's some related background that should help debugging (I'm hazy on the exact versions of elasticsearch where these changes came in).

We decided to optimise for the case where size:0 was passed. The thinking being if you only want aggs and no hits then scores were not important and

  1. We know there is no need for a 2nd network trip to retrieve top-scoring docs.
  2. Aggs look at all matching docs so there was no need to score any of them.

We then discovered there were some aggs that were interested in scores e.g. top_hits and sampler so we added an extra method to call and check if any of the aggs in the request needed scores and if so run the scoring logic even if size=0.

Without checking deeper on versions when these changes came in and what version you are running I can't know exactly what the behaviour is.

Perhaps an easier route is to simplify your query to the smallest reproducible example. Maybe swap the simpler terms agg in for the funkier significant_terms agg and remove other cruft. Also I wonder if the function_score and size:0 pairing may be the key to the problem given the switch designed to turn off scoring.

Cheers
Mark