Elasticsearch indexing performance: throttle merging

We are importing data to elasticsearch cluster in few indices, around ~10gb each.
At the same time, we care about search on existing indices, few of them are small-~100mb, few of them are big-~10gb.

In order to optimize indexing, we:

  • use bulk api with optimized bulk size;
  • set refresh interval to -1;
  • set replication factor to 0;

Now, we are trying to understand how merge throttling can help. How search and segment merging are related, if search only against existing indices?

According to this article, we can disable merge throttling.

  • Does that mean merges will "eat" disks i/o?
  • Does that mean merges won't happen at all and we have to _forcemerge manually, after indexing is done? Should be worried about max open file descriptors in such case?

According to these article and pull request we shouldn't touch merging settings at all.

Very confused here, any help is highly appreciated.

Don't worry about it, let ES handle the merging automatically :slight_smile:

Your initial 3 steps are all you need to do!

@warkolm would be grateful, if you can add more details and answer 2 questions above. I want to understand how does it work and what actually happens.

Which two questions?

These two, Mark.

The best answer to those is don't disable merging as I mentioned.

Otherwise yes merges use IO, if you disable them then they won't happen and a force merge is required.

@warkolm, sorry for molestation, documentation is really poor regarding this internal logic.
As far as I understood, importing data at certain rate might cause merging processes 'eat' all available disk i/o.
In order to keep some room for search queries, there is configuration indices.store.throttle.max_bytes_per_sec that throttles indexing threads if merging rate is higher than this number.

Using configuration option indices.store.throttle.type we can disable/enable index throttling.
Looks like merge throttling actually means index throttling.
See pr here and qbox article here.

I thought if merges won't happen, it might across max open file descriptors number in OS, if index is huge.

Are you indexing into these indices continuously or doing bulk inserts/updates periodically?

Periodically, daily basis, new indices every day.

Since ES 2.x, the IO throttling is handled automatically by Lucene, meaning it starts at 20 MB/sec throttle on writing bytes to the merged segment. It then increases that rate when merges fall behind, and decreases it otherwise. This means the merges, over time, only soak up as much IO bandwidth as is needed to keep up with your rate of indexing.

You don't need to forceMerge yourself: the merges will happen naturally as you are indexing.

Mike McCandless

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.