Index throttling issue

Hi,

We are facing index throttling issues using ES 1.7.2 after around 40-50 hours of running.
Even though our settings says - index.merge.policy.max_merge_at_once: 4 maxNumMerges is shown as 6.

My questions:

  • How is maxNumMerges determined?
  • Is there anything wrong with the configurations below?
  • How do we handle more merge requests?

Appreciate any kind of hints.
Thanks.

Logs show:

[2015-10-01 00:21:32,036][INFO ][index.engine             ] [metrics-datastore-4] [index1575884345][0] stop throttling indexing: numMergesInFlight=5, maxNumMerges=6
[2015-10-01 00:21:32,051][INFO ][index.engine             ] [metrics-datastore-4] [index1575884345][0] now throttling indexing: numMergesInFlight=7, maxNumMerges=6
[2015-10-01 00:21:32,128][INFO ][index.engine             ] [metrics-datastore-4] [index1575884345][0] stop throttling indexing: numMergesInFlight=5, maxNumMerges=6

Setup:

  • 3 master nodes (c3.large, 1 core, 1g heap, 3.5g ram, 2*16G of SSD drives)
  • 6 data nodes (m3.xlarge, 4 cores, 7.5g heap, 15g ram, 2*40G of SSD drives - data is stored on both disks)

Load

  • 5000 tps across 3 indices each with 3 shards and async replication of 1

Relevant configurations:

index.merge.policy.max_merge_at_once: 4
index.merge.policy.max_merge_at_once_explicit: 4
index.merge.policy.max_merged_segment: 5gb
index.merge.policy.segments_per_tier: 4
index.merge.policy.type: tiered
index.merge.scheduler.max_thread_count: 4
index.merge.scheduler.type: concurrent
index.refresh_interval: 20s
index.translog.flush_threshold_ops: 50000
index.translog.interval: 20s
indices.store.throttle.type: none
1 Like

Just to add to the details, the same settings were working fine earlier. Recently we enabled doc_values on all properties and observed a 1.5-2 times increase in disk utilization. Could that have anything to do with increased merge activity?

Your merges are falling too far behind, and so ES throttles incoming indexing to one thread to let them catch up.

The "maxNumMerges=6" in the logged INFO comes from 2 + index.merge.scheduler.max_thread_count. It's the total allowed merge backlog before index throttling will kick in.

I think you should first try removing all settings, so ES defaults apply, except for "indices.store.throttle.type: none" (so that store IO throttling is disabled). Then see if you still hit index throttling ...

And don't call optimize, unless the index will not be updated again (e.g. time-based indices).

Mike McCandless

Thanks for the clarification @mikemccand. I'll try out as suggested and get back on this thread. We never call optimize but we have ttl of 24 hours for each document.

Hi @mikemccand,

After removing these configurations, we did not notice any "now throttling indexing" messages in logs. Thanks for the inputs.

But we did end up in a "No space left on disk" errors after ~36 hours which I'm thinking could be related to enabling doc_values. With doc_values enabled, we observed a huge increase in the amount of disk utilized (nearly twice). Our documents have a ttl of 24 hours. So after 24 hours space should have got continuously reclaimed. Do you see any reasons why the space is not getting reclaimed or may be merging requires additional space to complete.

After around 24 hours, there was around 10G+10G (2 drives) free on each node. But how would we end up with 100% disk utilization on 1-2 nodes if data was continuously getting removed over the next 12 hours of merging activity?

I'm glad your index throttling is fixed.

Doc values inherently consume disk space ... this is the tradeoff vs field data (which consumes java heap).

But, do you have very sparse fields? Or, many different types where each type has different fields? The storage format for doc values is not sparse, so this can consume more disk space than you expect ...

I must mention that doc_values have had tremendous influence over our query response times while also keeping down the pressure on heap.

During the tests that we ran, its the almost like the same record getting indexed repeatedly - only a few fields change like timestamp. All fields are set and there are around 2-3 nested documents.

Is compression enabled by default in 1.7.2?

The "maxNumMerges=6" in the logged INFO comes from 2 + index.merge.scheduler.max_thread_count. It's the total allowed merge backlog before index throttling will kick in.

So that if I increase index.merge.scheduler.max_thread_count I postpone index throttling ?

And don't call optimize, unless the index will not be updated again (e.g. time-based indices).

Do you mean that optimize shouldn't be called on an index that is going to be written to later? Only indices that are not gonna be written to should be optimized ?

Yes.

Optimize, especially if you ask it to merge down to a single segment, is going to create huge segments which cause "interesting" tradeoffs later on if you keep writing to the index - especially if you delete. Your best bet is never to call optimize unless you are done writing.

Basically updates and deletes have eventually have to rewrite chunks of your index to reclaim space. Optimize makes the chunks much larger.

Just chiming in to say that if you can avoid TTL, you'll greatly reduce your merge pressure.

TTL works by (essentially) running a query every 60s and finding all docs that have expired, then executing individual deletes against those documents. These deleted docs linger in your segments until Lucene's merge scheduler decides to merge them out.

Basically, TTL pokes a lot of little holes in all of your segments, which causes the merge scheduler to constantly be cleaning up all the half-filled segments. Which ultimately means you are moving a lot of data around the disk all the time.

If, instead, you can structure your indices using a time-based approach (e.g. index-per-day), you can simply delete the entire index. This is equivalent to deleting a directory off the disk, and doesn't require any expensive merging.

Usually the time-based index doesn't provide a fine enough granularity for your application, so you'll likely want to include an expire_time field in the document and a corresponding range filter in your query, to make sure docs are no longer served after the 24hr period (but before the index is deleted).

Just checking - TTL is not the most efficient way of mass deletion so wanted to be sure you needed it rather than the usual index-per-timeframe type approaches: Time-Based Data | Elasticsearch: The Definitive Guide [2.x] | Elastic
Some heavy logging users adopt an index-per-hour approach.

Now that I think about it, would it be worth replacing TTL with a feature like this? It sounds complex but much more efficient.

IIRC, the one legitimate use-case for TTL is something like an auction where the expiration is dynamic. Some auctions like to extend the time after a bid for example (to prevent sniping), so a strict field expiration + delete-index approach wouldn't work because, theoretically, an auction could extend indefinitely if people keep bidding.

There are probably a few other rare edge-cases, but that's the one that comes to mind.

That said, it might be nice if we could perhaps provide an "efficientTTL" which helps manage the non-TTL backed approach. Not sure how it'd look on the query end...a new "expired" filter? A special expired date expression, so you could say "gte" : "expired" which is just shorthand for now - <defined retention period>?

I think Clinton used TTL "legitimately" for managing web sessions one time too.

You'd index documents based on their expiry date and store it with the document - so that you could smash the whole index after a while and know "everything in there was expired anyway". And you'd use an alias/alias-ish thing to add a simple date filer I think.

I think if you know when the document will expire up front that is really the way to go for TTL like stuff. But you can build that on the client side.

That doesn't solve the problem where you don't know the expiration up front because changing it would require removing the document from one index and dropping it into another. Which is problematic because refresh times don't line up. I bet someone sufficiently motivated could make TTL more efficient in the single index case by being sneaky with the merge scheduler - never merging segments containing TTLs off by more than an hour, letting segments get very delete-ful without merging if it knows the whole thing will be past its TTL soon. Its fun to think about but it'd be a bunch of work.