Improve performance for update_by_query relational updates

Hi,
I would like to ask for advice because I'm not sure how to optimize my data structures or requests to get more performance out of my system.

I have a Elasticsearch with 1.000.000 objects stored in it. Let's say I have stored users in my elastic. Each user has an organization.

{
  "id": 2,
  "organization_id": 1,
  "login": "nicole.braun@example.org",
  "firstname": "Nicole",
  "lastname": "Braun",
  "email": "nicole.braun@example.org",
  "organization": {
    "id": 1,
    "name": "Example Foundation",
    "shared": true,
    "domain": "",
    "domain_assignment": false,
    "active": true,
    "note": "",
    "updated_by_id": 1,
    "created_by_id": 1,
    "created_at": "2023-07-12T08:04:56.110Z",
    "updated_at": "2023-07-13T14:01:37.675Z"
  }
}

For this example user Nicole it also stores relational data. Normally it is a hash structure which contains the full object.
This solves 2 use cases for us:

  1. We can find the user by searching for the organizaton name Example Foundation

  2. We can use the sub structure to have extended queries. E.g. Search for all users which have active organizations (organization.active: true).

Now in the app it is possible that the organization name can change sometimes and the relational data (user.organization) needs to get updated.

I used an update_by_query request to solve this. So I search for all users which have the user.organization_id == 1 and update the data structure.

    url = build_url(type: type, action: '_update_by_query', with_pipeline: false, with_document_type: false, url_params: { conflicts: 'proceed' })
    return if url.blank?

    where = { organization_id: 1 }
    data  = {
      'organization' => {
        "id": 1,
        "name": "Example Foundation NEW NAME", # example change name
        "shared": true,
        "domain": "",
        "domain_assignment": false,
        "active": true,
        "note": "",
        "updated_by_id": 1,
        "created_by_id": 1,
        "created_at": "2023-07-12T08:04:56.110Z",
        "updated_at": "2023-07-13T14:01:37.675Z"
      }
    }

    script_list = []
    data.each do |key, _value|
      script_list.push("ctx._source.#{key}=params.#{key}")
    end

    data = {
      script: {
        lang:   'painless',
        source: script_list.join(';'),
        params: data,
      },
      query:  {
        term: where,
      },
    }

    # ...

This worked great for a while, but now we have the situation that the system has grown and there is an organization which has 1.000.000 users.
In our system this takes like 10 minutes to update the organization data in the 1kk users.

My questions are:

  1. Is there a better way to store the relational data, so it keeps searchable (2 use cases in the top) and is easier to update?

  2. Are there any config settings which would allow us to update faster?

I am grateful for any answer that could help.

Hi @rolfschmidt

Maintaining relationships is really complicated but have you tried changing the slicing?

I would try a new approach to avoid these relationships, for example denormalize everything and check how it will perform.

1 Like

Thanks @RabBit_BR for you reply, I tried slicing, but it did not work in the first try. We have to increase the number of shards to make it work, I guess :smiley: I will keep this thread up to date, I don't have a dump of it for testing yet.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.