Java Client Bulk API performance settings ES 5.x

Eduard_Kubanda · September 1, 2017, 10:21am

Hi there,
I am writing Java application to perform bulk update to Elasticsearch. I need to reach the best possible indexing performance. I understand complexity of cluster and hardware settings, but in this thread I want to clarify some settings in Java client which should help indexing rate.

Now I am using Java Client, but I am considering to rewrite code using Java Bulk processor. Is there any optimization using Bulk processor, or is the Bulk processor only kind of automatic task processing interface ?

My inspiration comes from documentation articles and this article series:
https://qbox.io/blog/maximize-guide-elasticsearch-indexing-performance-part-1

My workflow:

Update index settings before bulk.
Get data, create a thread pool, add a number of index/update requests to each thread. Do bulk requests. Repeat until I have a data.
Update index settings after bulk.

Update index settings before bulk:
I temporary disable refresh interval and set number of replicas to 0.

UpdateSettingsResponse updateResponse = client.admin().indices().prepareUpdateSettings("test")
.setSettings(Settings.builder()
.put("index.refresh_interval", -1)
.put("index.number_of_replicas", 0))
.get();

Update index settings after bulk:
Set back default settings for refresh interval and replicas.

UpdateSettingsResponse updateResponse = client.admin().indices().prepareUpdateSettings("test")
.setSettings(Settings.builder()
.put("index.refresh_interval", "1s")
.put("index.number_of_replicas", 1))
.get();

Do Merge, Flush (?), Refresh (?) (this is part I am confused the most about)
My code:

ForceMergeResponse mergeResponse = client.prepareForceMerge("test").setMaxNumSegments(1).get();

FlushResponse flushResponse = client.prepareFlush("test").get();

Flush makes Lucene commit and empties transaction log.

RefreshResponse refreshResponse = indicesAdminClient.prepareRefresh(elasticSearchIndexName).get();

Refresh makes documents searchable.

Index will be used for bulk index/update operations only (search requests will be allowed after successful bulk operations).
Am I doing after bulk operations right ?
Thank you.

thiago · September 2, 2017, 1:11pm

It is a good idea to do that only if it's an outstanding long bulk operation. If this is a regular bulk operation that happens on a regular basis, then it's better to just keep the refresh interval set. Also, remember that while refresh doesn't happens, newly indexed data won't be available for searching.

Do not do that for an index that you will still index otherwise it will screw up the index's automatic merging algorithm. You should only do that for an index that you are not going to index anymore after this operation.

Do not call refresh manually in an adhoc way. This will create many many tiny segments which turns the automatic merging even more resource consuming. You should just set a refresh interval and let the refresh happen automatically in the background.

Just keep it simple and do not any of what you have mentioned here. Use the BulkProcessor, feed it with documents and let it do the rest of the job. Do not mess around with refresh and merging, bulk operations should be as simple as sending documents and no extra work is needed.

Eduard_Kubanda · September 6, 2017, 12:53pm

Hi,
thank you for the reply.

It is a good idea to do that only if it’s an outstanding long bulk operation. If this is a regular bulk operation that happens on a regular basis, then it’s better to just keep the refresh interval set. Also, remember that while refresh doesn’t happens, newly indexed data won’t be available for searching.

Do not do that for an index that you will still index otherwise it will screw up the index’s automatic merging algorithm. You should only do that for an index that you are not going to index anymore after this operation.

My process of indexing and publishing of data should works the way that I do large import of data to ES and then I allow this data to be published and searchable.
I do not fully understand "an index that you are not going to index anymore " for my situation, and here comes my next question.
Consider situation I need to do large import of data to ES, but next day(s) I need to do import of updated data.
Is there any advance in doing update/upsert of this data in same index, or can I just create new index with updated data (and delete the old one)?
Is there an internal logic which influences search results depending on index/update (upsert) operation ?

I think for my situation I will update just refresh and replicas setting before and after bulk import (as I described in post above). Can I consider it as safe operation ?

Thank you.

Christian_Dahlqvist · September 6, 2017, 1:40pm

If you perform bulk indexing and updates and want to make sure the full set of changes is made available at once, you can use aliases to switch between different version of an index.

thiago · September 7, 2017, 2:09am

Regarding my statement an index that you are not going to index anymore. I meant precisely this, an index that is practically read-only, meaning that it won't receive any new data and neither updates or deletes. This index can be force merged to 1 segment safely.

Regarding the process you have described, then consider creating new index everyday (you may use use aliases to assist you with that, just as Christian mentioned). If, by the end of the index operation, this index with new data can be safely considered read-only then you can force merge to 1 segment, as explained above.

I don't follow that question, sorry.

Changing refresh is safe as long as you remember to either: 1. Revert back to some actual refresh interval; 2. Call refresh manually. If refresh, somehow, is never called then some data won't be searchable.
Changing replicas could be considered safe if you are just mirroring data from another source and, in the event of a node crash while it was with 0 replicas, you could simply repeat the operation from the original source.

Eduard_Kubanda · September 7, 2017, 9:38am

Thank for replies and explanation guys. Really helpful

Is there an internal logic which influences search results depending on index/update (upsert) operation ?

Sorry I wrote it very generally.
Consider I got an index with data. Index is used for search operations for a time . After a time I will create new index, with new and updated data and this new index will be used for search operations (old one will be backed up and no longer used).
There were search requests directed to old index, so I guess ES creates a metadata (or cache) for search request -> search result. When I create new index, no such metadata will be available. Will be search relevance for search requests directed to new index somehow different ?
I understand that changing fulltext fields content and/or adding new records will influence search result. My question was about usefulness of update/upsert operation in same index, but after this topic I guess I can simply create new indices when I need.

Thank you.

system · October 5, 2017, 9:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disabling indexing during bulk publishing using the java api Elasticsearch	3	1486	July 6, 2017
Java Bulk indexing API performances Elasticsearch	3	318	July 6, 2017
Settings to use with RestHighLevelClient and BulkProcessor Elasticsearch	2	3867	December 15, 2017
Indexing Speed Degrade With the Time Elasticsearch	1	463	August 29, 2017
Bulk Processor taking too long Elasticsearch	10	1386	June 6, 2018

Java Client Bulk API performance settings ES 5.x

Related topics