Multitenant index using wildcard rewrite process documents before filters are applied

In our project, we use multitenant index with aliases in which each alias points to specific tenant based on filter for tenant id.

We use fulltext search, currently via wildcards (I know this is the least efficient solution, however reworking it to n-grams or similar one si out of scope for now because of capacity for required effort), in which we need to use 'rewrite' => 'top_terms_60' in order to get some scoring of the results because, by default, pure wildcard search doesn't score our documents. This rewrite is responsible that it finds top 60 terms across all documents in the same index before filters are applied to find those terms only for specific tenant.

I think, that I already know the answer, but it's worth a shot to me to ask, whether there is some possibility to apply this top_terms_60 rewrite after the documents have been filtered out?

The problem is that with the number of tenants rising, some searches for specific tenants return 1 to 0 results for 3-letter searches because documents from other tenants were used and those top 60 terms are not present in requested tenant search. For instance, some tenant should return 15 results for 3-letter input, but returns only 1 because of this issue. It might be the situation that later on, no results will be returned.

Which version of Elasticsearch are you using?

I am not sure I understand what you are doing. Can you please provide a sample query?

Apologies for that, written in a little bit of hurry. I will try to explain it more clear.

As far as I know, we are currently using ES 7.10.1.

I will use simplified analogy, but should be enough to highlight the current issue.

Let's say we use one index called fulltext for all tenants, each tenant has same structure of the document which have let's say 2 fields tenant_id & title with following mappings:

{
  "tenant_id": {
    "type": "integer",
    "index": true
  },
  "title": {
    "type": "text"
  }
}

Field title has some analyzer as well, but I omitted it here. We setup aliases for each tenant as following example for tenant with id 12345:

{
  "actions": [
    {
      "add": {
        "index": "fulltext",
        "alias": "fulltext-12345",
        "filter": {
          "term": { "tenant_id": "12345" }
        }
    }
  ]
}

All actions regarding adding documents or searching are performed on alias directly, e.g. GET /fulltext-12345/_search.

When user starts to input search query, let's say com, we try to find relevant results for words starting with com with following simplified query:

GET /fulltext-12345/_search
{
  "query": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "wildcard": {
            "title": {
              "value": "com*",
              "boost": 2,
              "rewrite": "top_terms_60"
            }
          }
        }
      ]
    }
  }
}

We use top_terms_60 to have at least some relevance score calculation on the results, otherwise the returned documents don't look good as they seem to be chaotic (constant scoring issue).

However it seems that this rewrite does following:

  • find top 60 terms based on provided value across whole index
  • apply filter for specific tenant id (12345 in this case)
  • try to search items based on provided top 60 terms via should query as mentioned in docs
  • because those 60 terms were affected by other tenants, we don't get expected items in our tenant (to be more precise, just subset of expected items or none at all)

Let's say that tenant 12345 has 3 items with titles such as communication, comparison, community. But in the same index, there are many other items from other tenants with titles containing compile, compilation, compare, company, etc. Based on that, these top 60 terms contain many terms but only one matching for comparison in tenant 12345. Therefore when I expect to get all 3 items (communication, comparison, community) I get only 1 (comparison).

We know that solution to it would be not using wildcards at all, switch to edge n-grams for instance which have proper scoring logic in there, but there is no enough capacity to rework our current solution, therefore I am trying my luck, whether there is something which we missed in order to improve the query with the following setup.

Hope this explains our current problem, if you need anything to know I will try my best. Thank you!

That is a very good explanation but I must admit I have never used rewrite with wildcard queries. Based on what I read about it I doubt it is possble to make it take the filter into consideration, so I would suspect you might be stuck. What would be the impact of skipping the rewriting completely?

Yes, this is exactly what I would have suggested. It would be a lot more efficient and be easier to score.

The impact would be the mentioned constant scoring issue. In the example above it's fine if I get 3 results in any order. But in real cases it might be 15 results or more for example and that relevance would matter in there so more relevant results must be on top. Without that, it might be that the item you look for is at the very end.

Couldn't agree more.

However thank you very much for making your time for this topic, appreciate that! :slight_smile:

One ugly potential workaround that may reduce the problem to some extent might be to send in 2 queries in parallel using the multi-search API, although as your queries are expensive this may be a bad idea. It may also not be a lot less effort than actually addressing the problem properly.

The first query would be the one with tenant filter and no rewrite, which would return results unsorted. The second query would be with rewrite but without tenant filter. You would then need to resort the result of the first query based on the scores of the second in your application. It would probably give you the same result you get today but with a full result set which is partially scored based on relevancy.

Thank you for this idea!

From feasibility point of view, it’s definitely more feasible as the only thing which need to be altered is search query and post processing. But as you mention, it still requires some effort to write application logic to provide expected results.

The performance of this approach is troubling me as well as we can have thousands of documents per tenants (even tens of thousands) and querying them with too general input might bring a lot of results which would also result in memory issues besides slower performance. As the number of tenants is still rising, this would become less effective over time.

However it might not be completely “the bad” idea, when looking at a trade-offs and our business needs. It’s still just a bandage over the problem which should have been addressed differently, but doesn’t require such big effort as the original one. We need to think about it and consider all the trade-offs. Thank you again for coming up with a workaround!

Update: just thinking about it again and I am wondering about the scoring issue. If we call the second query, we get the score across whole index. So the items in one tenant might be scored incorrectly even though some term is the most frequent term in the specific tenant, it might be the least frequent term in the whole index and other items in that tenant might get a higher score thus resulting in wrong sorted results. That seems to be another trade-off of this approach :smile: probably very risky game changer.

Update 2: thinking about it one more time and I think that we don’t get the scoring for all items we need in specific tenant. As the top_terms_60 rewrite finds those top 60 terms across whole index, it will return only those items belonging to those 60 terms. So the issue of score pairing would behave similarly. If specific tenant for provided search input should return 15 results but returns only 1 in our current approach, that would mean that with workaround approach, I would be able to find only that one item with score and other ones would be unscored as they wouldn’t be present in the second response.

Update 3: maybee the workaround might also be to increase the number of top terms to search. Currently it’s 60, if we put for instance 200, the chance of getting terms related to the specific tenant is significantly higher. But I don’t know the know-how of how does it affect the overall performance, guess we would have to give it a try. :smile:

1 Like

So, just to let you know with the update :slight_smile:

We went with the Update 3 option, so to debug the number of suitable top terms and 200 was already enough to get all results for specific tenant. I decided to increase it even more to 500 to have some safe range and performance seems to not be affected that much (maybe just 5 more ms to the response).

So I believe, we can close this topic for now, we already know what's the future path for future problems, hopefully we get a chance to rework our solution for n-gram alternative. Thank you for helping me here! :slight_smile: