Collapse performance with heavy operations

DrParanoia · October 31, 2021, 9:32am

DISCLAIMER: I am relatively new to Elasticsearch, so I apologize in case my question is too "basic" or falls into "everybody should know this" category

Hi! I have a performance question. Let's say, we have this denormalized data in an index:

[
  {
    "key_id": 1,
    "language": "en",
    "value": "<some long value here>"
  },
  {
    "key_id": 1,
    "language": "fr",
    "value": "<some long value here>"
  },
  {
    "key_id": 1,
    "language": "de",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "en",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "fr",
    "value": "<some long value here>"
  },
  {
    "key_id": 2,
    "language": "de",
    "value": "<some long value here>"
  }
]

The goal is to allow the user to search the values in a way a text editor does. This means that wildcard search must be used to allow for partial word matching (please do not focus on the wildcard part we know it's expensive).

So, each key_id has a set of languages and values for them. The editor displays all languages for each key_id, meaning that if we search the values, we are not interested if all language values for a key_id satisfy the search. So basically a query would be something like this:

{
  "collapse":
  {
    "field": "key_id"
  },
  "query":
  {
    "bool":
    {
      "must":
      [
        {
          "wildcard":
          {
            "value": "*ello wor*"
          }
        }
      ]
    }
  }
}

As you can see, we only need to know if a given key_id contains what we are searching for, however, it looks like Elasticsearch is performing this wildcard search on each language item of the key_id. So let's say, the wildcard search has matched the result in the "en" value, it will still perform a wildcard search on "fr" and "de" values of the same key_id, which is a bit wasteful if you ask me.

The actual data is a bit more complicated with each "key" potentially having an unlimited number of languages assigned to it as well as the length of the values is potentially unlimited. This means these "extra" searches add up very quickly. Maybe I just don't get it and this is not how it works.

So the question is: Is there a more efficient way to "collapse" the search result per key_id or make Elasticsearch not search values for key_ids that already matched the query?

Thanks in advance!

DrParanoia · November 3, 2021, 10:31am

Anyone?

DavidTurner · November 3, 2021, 10:57am

Did you try just putting all the translations into a single document? Elasticsearch is good at returning single docs, and it sounds like you want to retrieve all the translations at once, so splitting them up into multiple documents is working against the grain.

You mentioned "unlimited" a couple of times in your question but realistically there's only finitely many languages in the world and the size of the text in a natural-language document is rarely measured in MBs let alone GBs so maybe you can revisit your assumptions that this needs to support truly unlimited scaling.

DrParanoia · November 3, 2021, 11:18am

Thank you for your reply!

Right now, these translations are nested within the key. However, we are thinking about denormalizing and putting each translation into its own document (since nested is bad for performance + right now if a single translation changes, that means that all nested documents are reindexed).

Elasticsearch is only used for filtering (to retrieve key_ids) and the original data is selected from another database.

So, the problem is even more complicated by the fact, that different users may have access to different languages (think about this like - a German translator does not need access to the French language), which means, that during the search we need to take into account only a certain set of languages and placing them into a single array will give incorrect results in case the string is found in a translation that the user does not have access to.

So the question comes down to - why does elastic still needs to search all items with the same key_id even though the results are collapsed using key_id (also, no aggregations are present). It would seem logical that if a match is found in a "collapsed" group, Elasticsearch should exclude that group from further searching.

P.S.
Yes, while technically the number of languages in the world is not infinite if you take into account the combination of regions and scripts + fictional languages, the number gets pretty big. Also, some users store something like "Terms of Service" completely with HTML formatting per key, so the values get pretty big.

DavidTurner · November 3, 2021, 12:47pm

It returns the top document for each key, so even if it finds one matching doc it needs to check whether any of the other docs with that key score more highly. I don't see an obvious way to break out of this early as you want.

Breaking out early is only reasonable if you don't care about the score, for which you need to tell Elasticsearch that any matching doc(s) will do. For instance, try using a filter instead of a must and/or specify a sort order (e.g. _doc says to get the first matching doc). I'm not familiar with the implementation so not sure if either of those actually trigger the optimisation you seek, but that's the first thing I'd try.

DrParanoia · November 3, 2021, 1:52pm

Nope, filtering or sorting does not help. The total returned by the query still reflects the number of items that wildcard would match, which means that all items are scanned anyway.

Looks like Elasticsearch does the collapsing as a post-action or something. Maybe it's a good optimization to look into for newer versions of Elasticsearch?

Anyway, thank you for your time!

DavidTurner · November 3, 2021, 2:10pm

Sounds like a reasonable feature request to me - would you open a Github issue to suggest it?

DrParanoia · November 3, 2021, 2:37pm

Sure, will do. Thanks for the suggestion.

P.S.
Added a suggestion - [Optimisation] Do not do additional search for collapsed groups · Issue #80281 · elastic/elasticsearch · GitHub. Hopefully, the description is clear enough.

DavidTurner · November 3, 2021, 2:39pm

maybe mention that you want this to happen in the case that we're sorting by doc ID or otherwise ignoring the scores - if you care about scores then there's no way to avoid the extra work.

DrParanoia · November 3, 2021, 2:52pm

Made some tweaks in the proposal

warkolm · November 4, 2021, 12:29am

Welcome to our community!

I know you've got a solution, but I wanted to say there's no such thing as a basic question, just a question you are looking for an answer for, and we're happy to help!

DrParanoia · November 4, 2021, 10:11am

Well since I don't have much experience with Elasticsearch, I thought that maybe I just lack the knowledge and the answer is actually simple . Turns out it's not...

Thanks for the warm welcome!

system · December 2, 2021, 10:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Do collapse keys benefit from document routing? Elasticsearch	1	159	October 21, 2023
Collapse with multi value Elasticsearch	14	5792	May 19, 2021
Collapse, Term Aggregation, Grouping Elasticsearch	1	250	July 21, 2022
Search results grouping (aka field combining/collapsing, distinct, de-dup) or alternate sollution Elasticsearch	6	560	July 6, 2017
Collapse field Elasticsearch	4	437	July 6, 2017

Collapse performance with heavy operations

Related topics