I have done two benchmark runs using the esrally, one with default refresh interval and second with "refresh_interval" set to "-1" (disabled). Both runs took same amount of time. (382 seconds)
I am hoping that second benchmark run should complete faster as there was no refresh involved.
Here are the shard stats for the refresh interval (5sec)
Disabling refresh allows Elasticsearch to fill its entire indexing buffer during indexing. Refreshes are triggered when the buffer is full and it is time to write segments to disk.
There were no search requests, just only bulk-indexing.
I have added the ""indices.memory.index_buffer_size" (=50%, also 100%) to elasticsearch config file , also set the heap size to 8GB (32GB RAM machine), and tried the esrallly run with refresh interval disabled, I didn't see any improvements in run timings.
Not sure exactly what else should be done speed up the indexing speed.
If there are no search requests issued, using default settings, Elasticsearch will not periodically refresh, see the docs.
Therefore there shouldn't be any performance difference between indexing with an explicit setting of index.refresh_internal to -1 or using the defaults.
EDIT
That said, if are talking specifically about the eventdata track, the default setting for refresh_internal is 5s: GitHub - elastic/rally-eventdata-track: Rally track for simulating event-based data use-cases . If you aren't seeing a diff in indexing throughput I'd suggest you take a look at your methodology and esp. whether there are other bottlenecks (e.g. is the loaddriver, perhaps, saturated?)
The command you provided earlier indicates that the loaddriver is running on the same machine where Elasticsearch is, which is an anti-pattern for benchmarking. I recommend watching the 7 deadly sins of benchmarking video to avoid some common pitfalls).
Yes, I aware that elasticsearch and benchmark test shouldn't be on same instance. This is just a. basic test to evaluate performance on very small dataset. In actual benchmark test, elasticsearch and esrally client nodes are on two different instance.
My understanding is that when the refersh is disabled , there should be at least 1% of performance improvement. But, that's not happening.
The refresh interval used to have a significant mpact in older versions of Elasticsearch, but due to improvements in newer versions this is not necessarily always the case any longer.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.