Apologies for that, written in a little bit of hurry. I will try to explain it more clear.
As far as I know, we are currently using ES 7.10.1.
I will use simplified analogy, but should be enough to highlight the current issue.
Let's say we use one index called fulltext
for all tenants, each tenant has same structure of the document which have let's say 2 fields tenant_id
& title
with following mappings:
{
"tenant_id": {
"type": "integer",
"index": true
},
"title": {
"type": "text"
}
}
Field title
has some analyzer as well, but I omitted it here. We setup aliases for each tenant as following example for tenant with id 12345
:
{
"actions": [
{
"add": {
"index": "fulltext",
"alias": "fulltext-12345",
"filter": {
"term": { "tenant_id": "12345" }
}
}
]
}
All actions regarding adding documents or searching are performed on alias directly, e.g. GET /fulltext-12345/_search
.
When user starts to input search query, let's say com
, we try to find relevant results for words starting with com
with following simplified query:
GET /fulltext-12345/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"wildcard": {
"title": {
"value": "com*",
"boost": 2,
"rewrite": "top_terms_60"
}
}
}
]
}
}
}
We use top_terms_60
to have at least some relevance score calculation on the results, otherwise the returned documents don't look good as they seem to be chaotic (constant scoring issue).
However it seems that this rewrite does following:
- find top 60 terms based on provided value across whole index
- apply filter for specific tenant id (
12345
in this case)
- try to search items based on provided top 60 terms via
should
query as mentioned in docs
- because those 60 terms were affected by other tenants, we don't get expected items in our tenant (to be more precise, just subset of expected items or none at all)
Let's say that tenant 12345
has 3 items with titles such as communication
, comparison
, community
. But in the same index, there are many other items from other tenants with titles containing compile
, compilation
, compare
, company
, etc. Based on that, these top 60 terms contain many terms but only one matching for comparison
in tenant 12345
. Therefore when I expect to get all 3 items (communication
, comparison
, community
) I get only 1 (comparison
).
We know that solution to it would be not using wildcards at all, switch to edge n-grams for instance which have proper scoring logic in there, but there is no enough capacity to rework our current solution, therefore I am trying my luck, whether there is something which we missed in order to improve the query with the following setup.
Hope this explains our current problem, if you need anything to know I will try my best. Thank you!