ES rejecting bulk messages when writing to indices with 40 shards

Hi all.

tl;dr: We had a weird issue in our ES cluster (v6.5.1) this week: bulk messages were being rejected when ES was writing to an index with 40 shards, and the problem was solved by creating a new index with only 5 shards.


Let me explain better...

We have a ES cluster with 7 nodes:

  • 3 master nodes with 6GB ram, 3.5GB heap and 1 CPU
  • 4 data nodes with 100GB ram, 31GB heap, 16 CPUs and 3TB disks

The cluster has a total of 212 indices, 952 shards and > 1,500,000,000 docs

The cluster is used to store logs from our applications and we create an index per log level per day.
The biggest index (INFO logs) has ~900GB size and the others are rather small ( less than 20GB).

According to https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster, the shards should be 20-40GB size.
Due to this recommendation, we increased the shards number for the INFO index from 20 to 40 (so that their size would decrease from 50GB to 25GB).

The problem was that, once this change got in effect, ES started to reject bulk messages. A lot of them.

We then changed the shard size to 5 (just because it's the default value) and the bulk rejection stopped. We did the change by creating a new index for the same day (we just added another word to the index name on logstash to force a new index creation).

We read this article, but we don't understand why we don't have enough threads to process all the bulk requests. Shouldn't 16 CPUs data nodes be able to handle 40 shard indices?

Can someone help?
Why do writing into a 40 shards index is a problem, but writing into a new 5 shards index is not? We didn't delete any indices, we just created a new one with less shards...

Let me know if you need more information. Thank you!

This is the exception that we see on logstash, in case it helps somehow:

[2019-02-20T08:42:54,496][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [91870173][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[logs-info-v6.5-2019.02.20][5]] containing [5] requests, target allocation id: BlL0WHfBS5OnXFJ8A2wDPA, primary term: 1 on EsThreadPoolExecutor[name = elasticsearch-data/4/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@49a4bbf5[Running, pool size = 16, active threads = 16, queued tasks = 202, completed tasks = 44409551]]"})

We ran into this issue before and it appears that the root cause is stress on the cluster because of excessive number of shards and high number indexing/update requests. Your cluster only has 4 data nodes with 950+ shards, that seems like a lot. You should consider increasing the number of data nodes or reduce the number of shards.

How many of these indices and shards are you concurrently writing to? How many concurrent indexing processes/threads do you have accessing the cluster? What is the indexing latency? What type of disks do you have? Do you see any evidence in statistics or logs that merging takes long or can not keep up?

But it the blog post I posted above says

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

So a 4 nodes cluster, each with 30GB heap, should be ok with 900 shards, no?

Also: why did the problem got fixed once we stopped writing to the 40 shards index and started to write to the 5 shards one? (the 40 shards index was still there and being read)

That number is generally an upper bound and not a recommended level. I have however not sid that the cluster can not handle that number of shards. If you look in the blog post about bulk rejections you linked to, it is the number of shards written to and the number of concurrent writers that fill up the queue and cause rejections. You can have a very large number of threads available for processing the data in the queues, but that will not help much if your cluster is limited in throughput by slow storage and disk I/O.

1 Like

It might help to note that every bulk indexing request generates a task for every shard that it hits, and it is these tasks that are being enqueued. Thus if you double the number of shards in the index you effectively double the number of tasks that each indexing request generates.

The usual approach to dealing with logs is to split them into smaller time-based indices, (e.g. weekly or monthly), often each containing a single shard. If you do this then each bulk indexing request doesn't explode into hundreds of tasks and overwhelm the indexing queues. This has other advantages too

  • common searches become more efficient, because most searches over logs filter to a time range, and Elasticsearch is very good at not searching shards that cannot match because of a time range constraint

  • it is easy to remove data once it has expired simply by deleting the index. It is much more expensive to run a delete-by-query to delete merely some of the documents in an index.

2 Likes

When we were having the issue, we were writing to 5 indices: 4 of them had 1 shard each and the other one had 40 shards.

How many concurrent indexing processes/threads do you have accessing the cluster?

We have 16 CPUs nodes, not sure if I'm answering the question...

What is the indexing latency?

It was around 1.100 ms +/-

What type of disks do you have?

We have 3TB SSD disks.

Do you see any evidence in statistics or logs that merging takes long or can not keep up?

No, no evidence of that...

We already do this: we have daily indices. The ~900GB of data is the total amount of logs that we have per day (it's a lot, we know, but that's a different issue :sweat_smile: )

So, the problem of writing into too many shards at the same time is that the cluster needs more threads to process it, right?

Would it be better to have smaller indices (say, 1 per hour or something like that) with less shards?

From what we know, having too big shards is a problem.
Having too many shards is also a problem due to what you just said.

Is the best approach to have more indices, with smaller shards each? We would only write to an index at a time...

Woah ok :astonished: Yes, sub-daily time-based indices would be an option. As would using rollover to limit the size of each index.

Did you actually come across a problem with over-large shards or are you just following the advice in that blog post? It's good advice for most setups but most setups aren't ingesting 1TB/day, so maybe it needs adjusting. Note that it also recommends benchmarking:

TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries . Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.

I would recommend using the rollover API instead of having indices covering a fixed time period. This allows you to select a suitable number of primary shards and roll over when the indices reaches a specific size and/or age. You can therefore cut indices at different intervals to account for varying data volumes. You may e.g. create a new index e.g. every 30 hours during the work week and only once every 50 hours over the weekend when the volume parhaps are lower.

Until a few weeks ago, we had a very old cluster (ES 1.6) ingesting the same ~900GB of data per day. We had 10 nodes with daily indices, 1 shard each.

When we changed the number of shars from 1 to 20 it seemed to help in the cluster health: we started having less outages and downtimes. However, we are not sure if it was a direct consequence of increasing the number of shards. Might have been just a coincidence...

Thanks!

We will definitely try that and see how it impacts the performance of the cluster.

Thank you all for the explanations, they were indeed very helpful!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.