We are importing data to elasticsearch cluster in few indices, around ~10gb each.
At the same time, we care about search on existing indices, few of them are small-~100mb, few of them are big-~10gb.
In order to optimize indexing, we:
use bulk api with optimized bulk size;
set refresh interval to -1;
set replication factor to 0;
Now, we are trying to understand how merge throttling can help. How search and segment merging are related, if search only against existing indices?
According to this article, we can disable merge throttling.
Does that mean merges will "eat" disks i/o?
Does that mean merges won't happen at all and we have to _forcemerge manually, after indexing is done? Should be worried about max open file descriptors in such case?
According to these article and pull request we shouldn't touch merging settings at all.
Very confused here, any help is highly appreciated.
@warkolm, sorry for molestation, documentation is really poor regarding this internal logic.
As far as I understood, importing data at certain rate might cause merging processes 'eat' all available disk i/o.
In order to keep some room for search queries, there is configuration indices.store.throttle.max_bytes_per_sec that throttles indexing threads if merging rate is higher than this number.
Using configuration option indices.store.throttle.type we can disable/enable index throttling.
Looks like merge throttling actually means index throttling.
See pr here and qbox article here.
I thought if merges won't happen, it might across max open file descriptors number in OS, if index is huge.
Since ES 2.x, the IO throttling is handled automatically by Lucene, meaning it starts at 20 MB/sec throttle on writing bytes to the merged segment. It then increases that rate when merges fall behind, and decreases it otherwise. This means the merges, over time, only soak up as much IO bandwidth as is needed to keep up with your rate of indexing.
You don't need to forceMerge yourself: the merges will happen naturally as you are indexing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.