Is it possible to use Bulk API to upsert large number of documents 100, 1000 ... but i want them to be inserted based on a query for example based on my userId property
Use case for this is that we need to handle thousands of updates of files that we dont know ids to without reeding them all and putting it in redis or something similar
Is this something that can not be done ? We really need to do it this way to ensure scalability and performace
No, that is not possible. When using the bulk API you indeed need to specify document IDs and can not use queries.
The update by query API seems to be what you will need to use, and while you may be able to group updates through some clever scripted updates you will need to run this multiple times. If you run it in parallel you likely need to ensure the queries do not affect the same documents as that would lead to version conflicts.
1 Like
thank you very much
EDIT: never the less i think this should be a common requirement for a lot of elastic users worth looking into it
I have tried updateByQuery but i can only update one file at the time can you provide me with an example off what you were thinking off?
my example:
const updateResponse = await elasticSearchClient.update<ElasticDocument>({
index: index,
id: userId,
doc: {
user_id: userId,
provider: provider,
},
doc_as_upsert: true,
});
No, not really. It would depend on how you are updating and how many matches/updates there are per query. It is quite possible that you may need to run one for each.
We have thousands of users doing thousands of updates that would kill our servers i can imagine we are not the first with this type of question are there any other similar posts maybe i can search trough to find something usefull updating one by one is a no go for us unfortunately
Do you have any combination of fields that can not be updated and would be unique to each document so you could use this (or maybe a hash of these fields) as a document ID?
I have a property that is unique and is stored in the document and that property is never changeing
If that is known or retrievable when you update it would be a good candidate for document ID.
not something i can guarantee because we have multiple sources of data and theoretically they could be the same.
But hash with provider field might be a good solution