We've had some success improving bulk insertion times using a higher value
for refresh_interval when doing bulk inserts.
However, the global nature of this setting seems to cause some problems.
We want some insertions processed with a higher value and others processed
immediately (under the default 1s).... there's no way to safely do this in
a concurrent environment where end-user actions are triggering index
updates.
refresh_interval does not control insertions doc by doc, it works on index
level, telling a shard it has to switch from written buffer to a readable
(searchable) index, which can be a heavy operation related to hundreds and
thousands of words in your documents.
You can index into two different indexes, one with refresh disabled, one
with refresh enabled.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.