Collapse performance with heavy operations

DISCLAIMER: I am relatively new to Elasticsearch, so I apologize in case my question is too "basic" or falls into "everybody should know this" category :smile:

Hi! I have a performance question. Let's say, we have this denormalized data in an index:

[
  {
    "key_id": 1,
    "language": "en",
    "value": "<some long value here>"
  },
  {
    "key_id": 1,
    "language": "fr",
    "value": "<some long value here>"
  },
  {
    "key_id": 1,
    "language": "de",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "en",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "fr",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "de",
    "value": "<some long value here>"
  }
]

The goal is to allow the user to search the values in a way a text editor does. This means that wildcard search must be used to allow for partial word matching (please do not focus on the wildcard part :laughing: we know it's expensive).

So, each key_id has a set of languages and values for them. The editor displays all languages for each key_id, meaning that if we search the values, we are not interested if all language values for a key_id satisfy the search. So basically a query would be something like this:

{
  "collapse":
  {
    "field": "key_id"
  },
  "query":
  {
    "bool":
    {
      "must":
      [
        {
          "wildcard":
          {
            "value": "*ello wor*"
          }
        }
      ]
    }
  }
}

As you can see, we only need to know if a given key_id contains what we are searching for, however, it looks like Elasticsearch is performing this wildcard search on each language item of the key_id. So let's say, the wildcard search has matched the result in the "en" value, it will still perform a wildcard search on "fr" and "de" values of the same key_id, which is a bit wasteful if you ask me.

The actual data is a bit more complicated with each "key" potentially having an unlimited number of languages assigned to it as well as the length of the values is potentially unlimited. This means these "extra" searches add up very quickly. Maybe I just don't get it and this is not how it works.

So the question is: Is there a more efficient way to "collapse" the search result per key_id or make Elasticsearch not search values for key_ids that already matched the query?

Thanks in advance!

Anyone? :smile:

Did you try just putting all the translations into a single document? Elasticsearch is good at returning single docs, and it sounds like you want to retrieve all the translations at once, so splitting them up into multiple documents is working against the grain.

You mentioned "unlimited" a couple of times in your question but realistically there's only finitely many languages in the world and the size of the text in a natural-language document is rarely measured in MBs let alone GBs so maybe you can revisit your assumptions that this needs to support truly unlimited scaling.

Thank you for your reply!

Right now, these translations are nested within the key. However, we are thinking about denormalizing and putting each translation into its own document (since nested is bad for performance + right now if a single translation changes, that means that all nested documents are reindexed).

Elasticsearch is only used for filtering (to retrieve key_ids) and the original data is selected from another database.

So, the problem is even more complicated by the fact, that different users may have access to different languages (think about this like - a German translator does not need access to the French language), which means, that during the search we need to take into account only a certain set of languages and placing them into a single array will give incorrect results in case the string is found in a translation that the user does not have access to.

So the question comes down to - why does elastic still needs to search all items with the same key_id even though the results are collapsed using key_id (also, no aggregations are present). It would seem logical that if a match is found in a "collapsed" group, Elasticsearch should exclude that group from further searching.

P.S.
Yes, while technically the number of languages in the world is not infinite if you take into account the combination of regions and scripts + fictional languages, the number gets pretty big. Also, some users store something like "Terms of Service" completely with HTML formatting per key, so the values get pretty big.

It returns the top document for each key, so even if it finds one matching doc it needs to check whether any of the other docs with that key score more highly. I don't see an obvious way to break out of this early as you want.

Breaking out early is only reasonable if you don't care about the score, for which you need to tell Elasticsearch that any matching doc(s) will do. For instance, try using a filter instead of a must and/or specify a sort order (e.g. _doc says to get the first matching doc). I'm not familiar with the implementation so not sure if either of those actually trigger the optimisation you seek, but that's the first thing I'd try.

Nope, filtering or sorting does not help. The total returned by the query still reflects the number of items that wildcard would match, which means that all items are scanned anyway.

Looks like Elasticsearch does the collapsing as a post-action or something. Maybe it's a good optimization to look into for newer versions of Elasticsearch? :laughing:

Anyway, thank you for your time!

Sounds like a reasonable feature request to me - would you open a Github issue to suggest it?

Sure, will do. Thanks for the suggestion.

P.S.
Added a suggestion - [Optimisation] Do not do additional search for collapsed groups · Issue #80281 · elastic/elasticsearch · GitHub. Hopefully, the description is clear enough.

1 Like

:+1: maybe mention that you want this to happen in the case that we're sorting by doc ID or otherwise ignoring the scores - if you care about scores then there's no way to avoid the extra work.

Made some tweaks in the proposal :+1:

Welcome to our community! :smiley:

I know you've got a solution, but I wanted to say there's no such thing as a basic question, just a question you are looking for an answer for, and we're happy to help!

2 Likes

Well since I don't have much experience with Elasticsearch, I thought that maybe I just lack the knowledge and the answer is actually simple :laughing:. Turns out it's not...

Thanks for the warm welcome! :laughing:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.