Improve performance for update_by_query relational updates

rolfschmidt · July 14, 2023, 12:57pm

Hi,
I would like to ask for advice because I'm not sure how to optimize my data structures or requests to get more performance out of my system.

I have a Elasticsearch with 1.000.000 objects stored in it. Let's say I have stored users in my elastic. Each user has an organization.

{
  "id": 2,
  "organization_id": 1,
  "login": "nicole.braun@example.org",
  "firstname": "Nicole",
  "lastname": "Braun",
  "email": "nicole.braun@example.org",
  "organization": {
    "id": 1,
    "name": "Example Foundation",
    "shared": true,
    "domain": "",
    "domain_assignment": false,
    "active": true,
    "note": "",
    "updated_by_id": 1,
    "created_by_id": 1,
    "created_at": "2023-07-12T08:04:56.110Z",
    "updated_at": "2023-07-13T14:01:37.675Z"
  }
}

For this example user Nicole it also stores relational data. Normally it is a hash structure which contains the full object.
This solves 2 use cases for us:

We can find the user by searching for the organizaton name Example Foundation
We can use the sub structure to have extended queries. E.g. Search for all users which have active organizations (organization.active: true).

Now in the app it is possible that the organization name can change sometimes and the relational data (user.organization) needs to get updated.

I used an update_by_query request to solve this. So I search for all users which have the user.organization_id == 1 and update the data structure.

    url = build_url(type: type, action: '_update_by_query', with_pipeline: false, with_document_type: false, url_params: { conflicts: 'proceed' })
    return if url.blank?

    where = { organization_id: 1 }
    data  = {
      'organization' => {
        "id": 1,
        "name": "Example Foundation NEW NAME", # example change name
        "shared": true,
        "domain": "",
        "domain_assignment": false,
        "active": true,
        "note": "",
        "updated_by_id": 1,
        "created_by_id": 1,
        "created_at": "2023-07-12T08:04:56.110Z",
        "updated_at": "2023-07-13T14:01:37.675Z"
      }
    }

    script_list = []
    data.each do |key, _value|
      script_list.push("ctx._source.#{key}=params.#{key}")
    end

    data = {
      script: {
        lang:   'painless',
        source: script_list.join(';'),
        params: data,
      },
      query:  {
        term: where,
      },
    }

    # ...

This worked great for a while, but now we have the situation that the system has grown and there is an organization which has 1.000.000 users.
In our system this takes like 10 minutes to update the organization data in the 1kk users.

My questions are:

Is there a better way to store the relational data, so it keeps searchable (2 use cases in the top) and is easier to update?
Are there any config settings which would allow us to update faster?

I am grateful for any answer that could help.

RabBit_BR · July 14, 2023, 3:29pm

Hi @rolfschmidt

Maintaining relationships is really complicated but have you tried changing the slicing?

I would try a new approach to avoid these relationships, for example denormalize everything and check how it will perform.

rolfschmidt · July 19, 2023, 3:30pm

Thanks @RabBit_BR for you reply, I tried slicing, but it did not work in the first try. We have to increase the number of shards to make it work, I guess I will keep this thread up to date, I don't have a dump of it for testing yet.

system · August 16, 2023, 3:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance of 'Update by query' for frequent runs with at most 1 object returned Elasticsearch	2	285	July 5, 2021
Update_by_query?routing=orgid performance veryslow Elasticsearch painless , eql-elastic-query-language	7	188	February 26, 2024
How to increase elasticsearch update and skip high load? Elasticsearch	1	280	October 3, 2022
Update by query performance question Elasticsearch	1	319	August 18, 2020
How update, update_by_query in ES really work? Elasticsearch	8	2504	October 4, 2022

Improve performance for update_by_query relational updates

Related topics