[hadoop] Why does the default value of es.batch.write.refresh is true?

f9865 · July 24, 2019, 6:22am

elasticsearch-hadoop document link says

es.batch.write.refresh (default true)
Whether to invoke an index refresh or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.

However, the official document of ES about refresh says that

The Index, Update, Delete, and Bulk APIs support setting refresh to control when changes made by this request are made visible to search.
This (set refresh=true of the bulk request) should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint.
true creates less efficient indexes constructs (tiny segments) that must later be merged into more efficient index constructs (larger segments). Meaning that the cost of true is paid at index time to create the tiny segment, at search time to search the tiny segment, and at merge time to make the larger segments.

The confliction between their documents is confusing. Can anybody help explain more about the reason that elastic-hadoop use true as the default value?

james.baiera · August 14, 2019, 7:14pm

@f9865 It's important to note that ES-Hadoop only performs the refresh operation on the index after the task is done writing all bulk operations to Elasticsearch. The logic here is that 90% of the time, the expectation is that the data is already searchable once the job is completed. For many users that are executing at much larger scales in production, we suggest turning this feature off. Like most of the configuration settings in ES-Hadoop, the default values are meant to make the getting started experience with the software as easy as possible, while still allowing for production grade tuning.

f9865 · August 15, 2019, 3:56am

after the task is done writing all bulk operations to Elasticsearch
once the job is completed

Does this mean the whole Hadoop task finish, or a single batch? Let's say I have a task that will write 1bn documents to the index in total, but the batch configuration is

es.batch.size.bytes = 5mb
es.batch.size.entries = 6000

So does ES-Hadoop only refresh after all of the 1bn documents write to the index, or only after single batch finishes?
Thanks

james.baiera · August 15, 2019, 1:54pm

It will refresh only after the hadoop task finishes.

It's important to note though that it does this at the task level. If you have 10 tasks, each writing out 1bn documents, it's possible that 1 task will finish first and do a refresh while the other 9 tasks are still running. If each task was writing out 1000 documents, the refresh wouldn't be that big of a deal. You likely wont see any performance issues and ES will eventually merge up those segments. Since we're (in this case hypothetically) writing a large amount of data in each task, then tasks that finish earlier can and will trigger a refresh sooner than the other tasks might expect by doing an index wide refresh when they finish.

In practice, if you're going to be writing a very large amount of data per task, it's probably best to disable the refresh option and do the refresh yourself after the job completes. There's no hook for ES-Hadoop to reliably use at the moment to do a single refresh after the job is completed. Only at the task level can we do that.

Hope that helps!

f9865 · August 16, 2019, 3:45am

It does help. Thanks a lot.

system · September 13, 2019, 3:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
'Refresh' setting for a bulk update request Elasticsearch language-clients	8	1081	February 28, 2022
Difference between `index.refresh_interval` vs `es.batch.write.refresh` Elasticsearch es-hadoop	5	3812	October 10, 2019
To refresh or not to refresh Elasticsearch	5	1036	August 4, 2018
Does index operation (refresh:true) return before or after refresh? Elasticsearch	5	2155	February 20, 2017
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017

[hadoop] Why does the default value of es.batch.write.refresh is true?

Related topics