I'm curious if a bulk indexing API request compacts / uniques when ingested by Elasticsearch, i.e. will a create, then delete, then create, then update of a single document id collapse into just 1 create?
If not, why not?
I'm curious if a bulk indexing API request compacts / uniques when ingested by Elasticsearch, i.e. will a create, then delete, then create, then update of a single document id collapse into just 1 create?
If not, why not?
No, it doesn't. The actions on an individual document execute independently. With your example:
If the document already existed then the first step, a create, would fail but the other actions would work normally. In principle they could be collapsed, particularly in this example since it includes a delete, but more generally that doesn't seem to be true often enough for this to be an optimisation that's worth implementing.
When working with Flink (or possibly any other stateful streaming system) which produces continuous retracts and inserts from many joins, I've found that it may update the same document many many times in a single bulk request.
Are there any cases where it's not possible to collapse multiple updates on the client? I think you're asking about doing this within Elasticsearch itself, but I'd expect the client to have more information about what the documents mean and therefore be able to do a better job of collapsing the sequence of operations before putting it into a bulk request at all.
Flink's Elasticsearch connector is fairly opaque and non-configurable but at the end of the pipeline it just wraps the ES java client with mostly default settings. It might be possible to fork it but then there's a concern about divergence from source-of-truth, and it's also not the smallest codebase to work with. Doing this from Elasticsearch itself seems like it would be easier in this case.
I think the same arguments apply to not doing this server-side in Elasticsearch too - you can't configure anything like this today, and it's certainly not a small or simple codebase either. IMO it's better to push this kind of highly-parallelisable work out to the edges as much as possible.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.