[hadoop] Why does the default value of es.batch.write.refresh is true?

elasticsearch-hadoop document link says

es.batch.write.refresh (default true)
Whether to invoke an index refresh or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.

However, the official document of ES about refresh says that

The Index, Update, Delete, and Bulk APIs support setting refresh to control when changes made by this request are made visible to search.
This (set refresh=true of the bulk request) should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint.
true creates less efficient indexes constructs (tiny segments) that must later be merged into more efficient index constructs (larger segments). Meaning that the cost of true is paid at index time to create the tiny segment, at search time to search the tiny segment, and at merge time to make the larger segments.

The confliction between their documents is confusing. Can anybody help explain more about the reason that elastic-hadoop use true as the default value?

@f9865 It's important to note that ES-Hadoop only performs the refresh operation on the index after the task is done writing all bulk operations to Elasticsearch. The logic here is that 90% of the time, the expectation is that the data is already searchable once the job is completed. For many users that are executing at much larger scales in production, we suggest turning this feature off. Like most of the configuration settings in ES-Hadoop, the default values are meant to make the getting started experience with the software as easy as possible, while still allowing for production grade tuning.

after the task is done writing all bulk operations to Elasticsearch
once the job is completed

Does this mean the whole Hadoop task finish, or a single batch? Let's say I have a task that will write 1bn documents to the index in total, but the batch configuration is

es.batch.size.bytes = 5mb
es.batch.size.entries = 6000

So does ES-Hadoop only refresh after all of the 1bn documents write to the index, or only after single batch finishes?

It will refresh only after the hadoop task finishes.

It's important to note though that it does this at the task level. If you have 10 tasks, each writing out 1bn documents, it's possible that 1 task will finish first and do a refresh while the other 9 tasks are still running. If each task was writing out 1000 documents, the refresh wouldn't be that big of a deal. You likely wont see any performance issues and ES will eventually merge up those segments. Since we're (in this case hypothetically) writing a large amount of data in each task, then tasks that finish earlier can and will trigger a refresh sooner than the other tasks might expect by doing an index wide refresh when they finish.

In practice, if you're going to be writing a very large amount of data per task, it's probably best to disable the refresh option and do the refresh yourself after the job completes. There's no hook for ES-Hadoop to reliably use at the moment to do a single refresh after the job is completed. Only at the task level can we do that.

Hope that helps!

It does help. Thanks a lot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.