I'm working in a legacy code base and I think I've uncovered an inefficient way of doing a filter aggregation based on what I read in the docs.
These are my assumptions:
- Using filters to eliminate documents from search results is more efficient than using full text search and only taking into account the highest score results
- For aggregations, because you never care about score and only care about which bucket something falls into, you should only use filtering, not querying
Because my concern right now is only with how the query is done, and not the index mapping used, I'm combining the aggregation from the legacy code base with what I believe is the optimal way to index it, according to what I've read in the Elasticsearch 7.12 docs, with this mapping (copy paste from JS):
{
properties: {
bio: {
type: 'text',
fields: {
keyword: {
type: 'keyword',
},
},
},
},
},
This is the data I've indexed (code from JS):
await client.index({
index: 'test_index',
body:
{
bio: 'Dogs are the best pet.',
},
});
await client.index({
index: 'test_index',
body: {
bio: 'Cats are cute.',
},
});
await client.index({
index: 'test_index',
body: {
bio: 'Cats are cute.',
},
});
await client.index({
index: 'test_index',
body: {
bio: 'Cats are the greatest.',
},
});
This is a minimal version of the aggregation in the legacy code I found. It's a use case where we want to know "For all documents whose 'bio' property is like 'cats', return a count of each distinct 'bio' property":
POST http://localhost:9200/test_index/_search
{
"size": 0,
"query": {
"match": {
"bio": "cats"
}
},
"aggs": {
"bios_with_cats": {
"terms": {
"field": "bio.keyword"
}
}
}
}
The results make sense:
"buckets": [
{
"key": "Cats are cute.",
"doc_count": 2
},
{
"key": "Cats are the greatest.",
"doc_count": 1
}
]
But my concern is with the fact that this doesn't appear to be the way to do this use case according to what I read in the docs (Filter aggregation | Elasticsearch Guide [7.12] | Elastic). In the docs, they show a nested "aggs" object, with a "filter" object at the same level as the nested "aggs" object.
Here's my version of the aggregation according to what I see in the docs:
POST http://localhost:9200/test_index/_search
{
"size": 0,
"aggs": {
"bios_with_cats": {
"filter": {
"match": {
"bio": "cats"
}
},
"aggs": {
"bios": {
"terms": {
"field": "bio.keyword"
}
}
}
}
}
}
The results are the same.
This version following how it's written in the docs makes more sense to me. My understanding is that "query" contexts are about scoring documents, whereas "filter" contexts are about excluding them from results being calculated altogether. It strikes me as more efficient to use filters whenever possible, especially since the docs say that Elasticsearch caches filters to improve performance.
Is the version in the docs better to use? Would there ever be a reason to write it the way I found it in our legacy code base where a "query" context is used instead?