DISCLAIMER: I am relatively new to Elasticsearch, so I apologize in case my question is too "basic" or falls into "everybody should know this" category
Hi! I have a performance question. Let's say, we have this denormalized data in an index:
[
{
"key_id": 1,
"language": "en",
"value": "<some long value here>"
},
{
"key_id": 1,
"language": "fr",
"value": "<some long value here>"
},
{
"key_id": 1,
"language": "de",
"value": "<some long value here>"
},
{
"key_id": 2,
"language": "en",
"value": "<some long value here>"
},
{
"key_id": 2,
"language": "fr",
"value": "<some long value here>"
},
{
"key_id": 2,
"language": "de",
"value": "<some long value here>"
}
]
The goal is to allow the user to search the values in a way a text editor does. This means that wildcard
search must be used to allow for partial word matching (please do not focus on the wildcard
part we know it's expensive).
So, each key_id
has a set of languages and values for them. The editor displays all languages for each key_id
, meaning that if we search the values, we are not interested if all language values for a key_id
satisfy the search. So basically a query would be something like this:
{
"collapse":
{
"field": "key_id"
},
"query":
{
"bool":
{
"must":
[
{
"wildcard":
{
"value": "*ello wor*"
}
}
]
}
}
}
As you can see, we only need to know if a given key_id
contains what we are searching for, however, it looks like Elasticsearch is performing this wildcard
search on each language
item of the key_id
. So let's say, the wildcard
search has matched the result in the "en" value, it will still perform a wildcard
search on "fr" and "de" values of the same key_id
, which is a bit wasteful if you ask me.
The actual data is a bit more complicated with each "key" potentially having an unlimited number of languages assigned to it as well as the length of the values is potentially unlimited. This means these "extra" searches add up very quickly. Maybe I just don't get it and this is not how it works.
So the question is: Is there a more efficient way to "collapse" the search result per key_id
or make Elasticsearch not search values for key_ids
that already matched the query?
Thanks in advance!