Hi,
I would like to ask for advice because I'm not sure how to optimize my data structures or requests to get more performance out of my system.
I have a Elasticsearch with 1.000.000 objects stored in it. Let's say I have stored users in my elastic. Each user has an organization.
{
"id": 2,
"organization_id": 1,
"login": "nicole.braun@example.org",
"firstname": "Nicole",
"lastname": "Braun",
"email": "nicole.braun@example.org",
"organization": {
"id": 1,
"name": "Example Foundation",
"shared": true,
"domain": "",
"domain_assignment": false,
"active": true,
"note": "",
"updated_by_id": 1,
"created_by_id": 1,
"created_at": "2023-07-12T08:04:56.110Z",
"updated_at": "2023-07-13T14:01:37.675Z"
}
}
For this example user Nicole it also stores relational data. Normally it is a hash structure which contains the full object.
This solves 2 use cases for us:
-
We can find the user by searching for the organizaton name Example Foundation
-
We can use the sub structure to have extended queries. E.g. Search for all users which have active organizations (organization.active: true).
Now in the app it is possible that the organization name can change sometimes and the relational data (user.organization) needs to get updated.
I used an update_by_query request to solve this. So I search for all users which have the user.organization_id == 1 and update the data structure.
url = build_url(type: type, action: '_update_by_query', with_pipeline: false, with_document_type: false, url_params: { conflicts: 'proceed' })
return if url.blank?
where = { organization_id: 1 }
data = {
'organization' => {
"id": 1,
"name": "Example Foundation NEW NAME", # example change name
"shared": true,
"domain": "",
"domain_assignment": false,
"active": true,
"note": "",
"updated_by_id": 1,
"created_by_id": 1,
"created_at": "2023-07-12T08:04:56.110Z",
"updated_at": "2023-07-13T14:01:37.675Z"
}
}
script_list = []
data.each do |key, _value|
script_list.push("ctx._source.#{key}=params.#{key}")
end
data = {
script: {
lang: 'painless',
source: script_list.join(';'),
params: data,
},
query: {
term: where,
},
}
# ...
This worked great for a while, but now we have the situation that the system has grown and there is an organization which has 1.000.000 users.
In our system this takes like 10 minutes to update the organization data in the 1kk users.
My questions are:
-
Is there a better way to store the relational data, so it keeps searchable (2 use cases in the top) and is easier to update?
-
Are there any config settings which would allow us to update faster?
I am grateful for any answer that could help.