tl;dr: We had a weird issue in our ES cluster (v6.5.1) this week: bulk messages were being rejected when ES was writing to an index with 40 shards, and the problem was solved by creating a new index with only 5 shards.
Let me explain better...
We have a ES cluster with 7 nodes:
- 3 master nodes with 6GB ram, 3.5GB heap and 1 CPU
- 4 data nodes with 100GB ram, 31GB heap, 16 CPUs and 3TB disks
The cluster has a total of 212 indices, 952 shards and > 1,500,000,000 docs
The cluster is used to store logs from our applications and we create an index per log level per day.
The biggest index (INFO logs) has ~900GB size and the others are rather small ( less than 20GB).
According to https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster, the shards should be 20-40GB size.
Due to this recommendation, we increased the shards number for the INFO index from 20 to 40 (so that their size would decrease from 50GB to 25GB).
The problem was that, once this change got in effect, ES started to reject bulk messages. A lot of them.
We then changed the shard size to 5 (just because it's the default value) and the bulk rejection stopped. We did the change by creating a new index for the same day (we just added another word to the index name on logstash to force a new index creation).
We read this article, but we don't understand why we don't have enough threads to process all the bulk requests. Shouldn't 16 CPUs data nodes be able to handle 40 shard indices?
Can someone help?
Why do writing into a 40 shards index is a problem, but writing into a new 5 shards index is not? We didn't delete any indices, we just created a new one with less shards...
Let me know if you need more information. Thank you!