Ensuring Document Ordering in Bulk Ingestion

Hi everyone,

I'm currently working on implementing bulk operations for documents. In my scenario, I receive PubSub messages and generate corresponding Elasticsearch documents using the data from these PubSub messages. I have a concern regarding the ordering of these documents during bulk operations.

For example:

I receive messages through PubSub:
message_1, message_2, message_3, message_4

Subsequently, I create the following Elasticsearch documents:

document_1, document_2, document_3, document_4

My objective is to maintain the order of these documents within Elasticsearch. I'm wondering if the bulk operation guarantees the preservation of the order I've created.

Any insights or guidance on whether the order of documents can be ensured during bulk ingestion would be greatly appreciated.

Hi @Ivelin_Yanev Welcome to the community

What version are you on... Just curious.

Can you explain a bit more what you mean by ordering? Is that based on ingestion timestamp or some other field?

Hi,

Elasticsearch does maintain the order of operations in a bulk request. When you send a bulk request to Elasticsearch, it processes the operations in the order they appear in the request. This means that if you have document_1, document_2, document_3, and document_4 in that order in your bulk request, they will be processed in that same order.

However, it's important to note that this doesn't necessarily mean that the documents will be available for search in the order they were processed. Elasticsearch is a distributed system, and once documents are indexed, they go through a process called refresh to make them available for search. This process is asynchronous and doesn't guarantee that documents will become searchable in the order they were indexed.

Regards

Hi @stephenb

In case you're inquiring about the Elasticsearch version, I'm currently using the v8.11 Java client.

Allow me to provide more context:

I receive ordered PubSub messages, and I've taken steps to ensure their sequencing. Specifically, when I receive multiple messages of the same type, I can affirm that the latest message is indeed the most recent one.

My question pertains to the ordering of messages from the broker. If I send these ordered message(as create the proper Elasticsearch documents) in bulk, is there an assurance that they will be stored in Elasticsearch exactly as I've arranged them?

Your insights into this matter would be greatly appreciated.

Hi, Thank you for your response. Could you kindly share the documentation resource supporting your statement regarding the order of operations in a bulk request?

To clarify, I'm not concerned about searching or retrieving based on the order of processing. My primary focus is on ensuring that if the bulk operation guarantees ordering, it will resolve my issue.

Here's the specific challenge I'm addressing: I might receive multiple messages of the same type, and the order is guaranteed by the PubSub broker. Consequently, I expect the last message to represent the latest version of the Elasticsearch document.

I am not sure this is explicitly called out in the documentation.

If you are indexing a document with a single id multiple times in a single bulk request I believe the ordering is preserved and the last one will prevail. If you however are sending multiple bulk requests in parallel based on the data in the PubSub broker there is no guarantee across bulk requests. As described earlier, note that some of the documents in the bulk request may be made available for search before others.

It is also worth noting that frequenty indexing/updating the same document can result in significant overhead and reduce indexing throughput.

Thanks @Christian_Dahlqvist

I would like to inquire about the scenario where a build operation is executed, and if a specific document fails, will it lead to the failure of the entire build process or not?

All items of a bulk request are indexed individually, so failures will just affect parts of the bulk request. The success or failure of individual documents is reported in the response. You could therefore have e.g. 3 updates to the same document in a bulk request and have only the second one fail. There is no transactions or atomicity.

1 Like

In this scenario, I have three documents (doc1, doc2, doc3) containing the same data. Ideally, in a perfect case, the last document (doc3) would represent the latest version, given that my ordering is guaranteed by PubSub. However, if the last document (doc3) encounters a failure, Elasticsearch would then have an outdated version (not the latest) available.

Additionally, I need to confirm the acknowledgment message to PubSub once these documents are successfully stored in Elasticsearch.

True, but you can identify this from the bulk response and potentially retry (if needed).

Hi @Christian_Dahlqvist
I believe there is a perfect approach for my case.

I will utilize versioning with version_type set to external. This way, my data will have versioning, and I can leverage it. Elasticsearch will ensure that the latest version is stored. Consequently, I won't need to be concerned about the ordering of the documents in this scenario.

If you have a field in your data that indicate the version that may indeed be a good solution.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.