Modifying Refresh Settings for BulkIngester Instances

Hi team,

I'm currently working with the BulkIngester in my Elasticsearch project and I have a question regarding the refresh setting.

After creating the BulkIngester instance using the following code snippet:

ingester = BulkIngester.of(
    b -> b.client(esClient)
          .maxOperations(elasticSearchConfig.maxOperations())
          .flushInterval(elasticSearchConfig.flushInterval(), TimeUnit.SECONDS)
          .globalSettings(builder -> builder.refresh(Refresh.True))
          .listener(new BulkIngesterListener()));

I've been trying to figure out how to change the refresh setting from Refresh.True to Refresh.False, but so far, I haven't found a proper way to do it.

Could you please advise if there's a way to modify the refresh setting after the BulkIngester has been created?

Alternatively, if modifying the refresh setting directly is not possible, I'm curious if it's feasible to have multiple BulkIngester instances with different refresh settings concurrently.

Any insights or suggestions would be greatly appreciated. Thank you!

I'm curious about the "why" here.
Normally, what I do on my side is to refresh the index outside the bulkIngester, ie when I need to call search.

Why would you want to it on and off?

Hi @dadoonet ,

Thank you for your question. The current approach I'm taking involves using the new Java library (v8.2). Initially, this approach seemed unfamiliar to me as well. However, after conducting some research on the matter, I've found that the recommended approach is to utilize refresh on search by setting refresh=true in your GET API calls.

However, it's important to note that this method doesn't guarantee the absolutely latest data in every scenario, correct? If this is the case, I'm curious about how to ensure that the data found in the index is indeed the latest available.

Looking forward to your insights on this matter

It does guarantee that the latest data that have been indexed is now visible.

@dadoonet Ah, I couldn't locate any information regarding the efficacy of utilizing refresh on search by setting refresh=true in my GET API calls to ensure that the latest indexed data is immediately visible. Can you share with me if your know where is this documentation?

Have a look at Tune for indexing speed | Elasticsearch Guide [8.12] | Elastic

@dadoonet Thank you very much. However, my understanding is that having refresh=true in every GET API call triggers a refresh of indices with each call. Wouldn't this potentially impact performance negatively?

Refreshing frequently, either ahead of a query or through every bulk request, adds a lot of overhead and can have a huge negative performance on indexing throughput. Performing a refresh is an expensive operation which is why Elasticsearch by default does not do it for every operation but rather at a configurable interval. If you look at the guidelines for tuning indexing throughput you can see that one of the main recommendations is to disable or lengthen the refresh interval. Instead making it more frequent than the default value naturally has the opposite effect.

@Christian_Dahlqvist So, if we use refresh=true in every GET API call, wouldn't that seriously impact performance?

@dadoonet if my understanding is correct, wouldn't it be more appropriate to refresh the index inside the bulkIngester rather than externally?

I'm feeling a bit confused about this.

How often do you query the data? How are you querying the data? If you are querying frequently you will end up refreshing the index frequently, so it is not necessarily much better that refreshing after each bulk request and will affect indexing performance. If you only query occasionally it is likely a lot better to refresh ahead of the query rather than for each bulk request, although it may still affect indexing performance.

Frequent refreshes, irrespective of how they are triggered, has a negative impact on indexing performance.

@Christian_Dahlqvist The data is queried frequently, which is why I decided to refresh the index during the bulk step. Creating, updating, or deleting documents in the index is not a common occurrence.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.