Is using "filter" inside "aggs" more efficient than "query" at the top level of an aggregation body?

I'm working in a legacy code base and I think I've uncovered an inefficient way of doing a filter aggregation based on what I read in the docs.

These are my assumptions:

  • Using filters to eliminate documents from search results is more efficient than using full text search and only taking into account the highest score results
  • For aggregations, because you never care about score and only care about which bucket something falls into, you should only use filtering, not querying

Because my concern right now is only with how the query is done, and not the index mapping used, I'm combining the aggregation from the legacy code base with what I believe is the optimal way to index it, according to what I've read in the Elasticsearch 7.12 docs, with this mapping (copy paste from JS):

{
    properties: {
        bio: {
            type: 'text',
            fields: {
                keyword: {
                    type: 'keyword',
                },
            },
        },
    },
},

This is the data I've indexed (code from JS):

await client.index({
    index: 'test_index',
    body:
    {
        bio: 'Dogs are the best pet.',
    },
});
await client.index({
    index: 'test_index',
    body: {
        bio: 'Cats are cute.',
    },
});
await client.index({
    index: 'test_index',
    body: {
        bio: 'Cats are cute.',
    },
});
await client.index({
    index: 'test_index',
    body: {
        bio: 'Cats are the greatest.',
    },
});

This is a minimal version of the aggregation in the legacy code I found. It's a use case where we want to know "For all documents whose 'bio' property is like 'cats', return a count of each distinct 'bio' property":

POST http://localhost:9200/test_index/_search

{
	"size": 0,
	"query": {
		"match": {
			"bio": "cats"
		}
	},
	"aggs": {
		"bios_with_cats": {
			"terms": {
				"field": "bio.keyword"
			}
		}
	}
}

The results make sense:

"buckets": [
    {
        "key": "Cats are cute.",
        "doc_count": 2
    },
    {
        "key": "Cats are the greatest.",
        "doc_count": 1
    }
]

But my concern is with the fact that this doesn't appear to be the way to do this use case according to what I read in the docs (Filter aggregation | Elasticsearch Guide [7.12] | Elastic). In the docs, they show a nested "aggs" object, with a "filter" object at the same level as the nested "aggs" object.

Here's my version of the aggregation according to what I see in the docs:

POST http://localhost:9200/test_index/_search

{
	"size": 0,
	"aggs": {
		"bios_with_cats": {
			"filter": {
				"match": {
					"bio": "cats"
				}
			},
			"aggs": {
				"bios": {
					"terms": {
						"field": "bio.keyword"
					}
				}
			}
		}
	}
}

The results are the same.

This version following how it's written in the docs makes more sense to me. My understanding is that "query" contexts are about scoring documents, whereas "filter" contexts are about excluding them from results being calculated altogether. It strikes me as more efficient to use filters whenever possible, especially since the docs say that Elasticsearch caches filters to improve performance.

Is the version in the docs better to use? Would there ever be a reason to write it the way I found it in our legacy code base where a "query" context is used instead?

Generally the top level query is more efficient than the filter agg. The filter agg is really for if you need an extra filter inside of another agg or something like that.

Your right that the aggs don't do scoring, but if you don't fetch any documents then the top level query doesn't use the score either. If we don't need the score we don't build it.

1 Like

Wow, I'm glad I asked, since I had it backwards. I think what threw me off was that the docs seem to emphasize the difference between "query context" and "filter context". So I figured since I saw the word filter in the part of the docs on aggregations, that this would be the ideal way to do it. I don't recall seeing a clear example of my use case in the search part of the docs. I first saw a clear example when I got to that specific page I linked on the filter aggregation.

Mind clarifying a few more things for me?

Generally the top level query is more efficient than the filter agg.

In general, with Elasticsearch, since it's a good idea to exclude documents as early as possible, is it a good idea to put the things that would exclude the most documents at the top level if possible?

The filter agg is really for if you need an extra filter inside of another agg or something like that.

Would an example of this be doing an aggregation into buckets and then excluding things from those buckets? If so, wouldn't it always be better to do the filter at the top level then because it'd be more efficient to filter and then group into buckets instead of the other way around?

if you don't fetch any documents then the top level query doesn't use the score either. If we don't need the score we don't build it.

Is this an optimization Elasticsearch does, similar to an SQL database analyzing the complete SQL query and then choosing to use an index because it thinks it will help? In this case, Elasticsearch chooses to skip building scores because it knows that it's doing an aggregation and therefore doesn't need them?

The top level query is generally evaluated "leap frog" fashion so the order doesn't matter. The queries have cost heuristics we use to help drive the evaluation too. But the aggs and the top level queries don't always mesh to run the optimizations "together". Sometimes they do. Sometimes they don't.

You are right about filtering things out in the top level query. In the past few release I've merged some changes that let aggs participate in the top level query optimization. It can really help.

Like postgres we use metadata to decide how to execute stuff. But postgres has 30 years on us so there are places where we don't have alternative implementations to pick. And we have some aggs they don't. I think. It's been a few years since I seriously used postgres. And we have the distributed stuff. Different problems.

Anyway, in an ideal world the top level filter agg would be as good as a top level query. And if I have my way it will be in a release or two. But it's just never made the top of my list.

1 Like

Awesome answers. Thanks for sharing that info.

I'd love to see those improvements merged in, but I understand why it might be a low priority. It seems like aggregations with full text searching is a bit if a niche use case for Elasticsearch. Besides, when it comes to this legacy code base, there are other areas where I can optimize performance where I'm sure we'll get lots of performance improvement. In my optimized mapping, I'm using multi fields, one text and one keyword. In our code base that I'm refreshing, we were locked into using just a text field, and so we had to enable fielddata explicitly in the mappings upon reaching newer versions to retain that behavior. I believe once we switch to multi field and aggregate only on the keyword field, performance will improve.

That'll help, yeah. Field data on text fields is super expensive. Always in terms of memory and when you modify the index it's very cpu intensive.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.