Aggregation is not limited to documents in the query in multi-field results

Hi everyone, running Elasticsearch 7.10.1.

We have the following index:

                         "potential_participants": {
                                "properties": {
                                    "surface": {
                                        "type": "text",
                                        "fields": {
                                            "raw": {
                                              "type":  "keyword"
                                            }
                                        }
                                    },
                                    "type": {
                                        "type": "keyword"
                                    }
                                }
                            }

And are trying to send a /_search request with the following aggregation:

{
	"min_score": 10,
	"query": {
		"match": {
			"events.event_units.potential_participants.surface": "Nyenburgh Holding B.V."
		}
	},
	"aggs": {
		"entities": {
			"terms": {
				"field": "events.event_units.potential_participants.surface.raw"
			}
		}
	}
}

We've really liked the quality of the hits from this query, we're seeing good documents coming back with the "potential named entities" we are truly interested in.
However, to our great surprise, "potential named entities" contained in the "hits" section are not aggregated via the "surface.raw" field. How? Why?
According to the documentation, the scope of aggregations is set with a query parameter. Thus, one would expect our aggregation to have at least one surface.raw bucket matching up with the surface fields in "hits". In fact, this is not the case, we see zero overlap between the two results.
How would we go about debugging this aggregation? All the best

How much is the cardinality of the surface field? Terms aggregation returns only 10 buckets by default. Is it the cause of your results?

This is in fact the correct hint to the problem. Thanks Tomo_M. We've decided to use the results returned by the match query to fire a secondary query with the aggregations based on a terms query. In terms of performance this is orders of magnitudes faster.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.