the load on the machines is very low, the specs are the following:
machine A: 16 cores, 64GB RAM, 30GB RAM for ES, swap is disabled
machine B: 12 cores, 48GB RAM, 30GB RAM for ES, swap is disabled
i changed the bulk queue size to 200 with no improvement, and also tried to restart the service with no change in rejection rate.
the only outstanding metrics of those machines is that both have disk usage of 88%, while the rest of the machines on the cluster have disk usage <60%.
i also not bulking data directly to the data nodes - i use nginx to create upstream of coordinating nodes, and send the requests to this upstream.
is it possible that the disk usage is the root cause here? why am i not seeing shards relocating? and when there are shards relocating - they are not moved from these machines?
Are all nodes in the cluster using exactly the same version of Elasticsearch? You should be able to easily see this using the cat nodes API? Do these two nodes have more shards than other nodes in the cluster? How many indices and shards do you have in the cluster? How many of these are you actively indexing into?
There are total of 43 nodes, 21 data nodes.
all nodes in the cluster are using the same ES version (5.6.3).
there are 2500 indices spanning over 13,000 shards - at any given time there are about 70 indices actively being used for index operation.
these 2 nodes dont have any outstanding shards number, one has 778 and the other has 272 shards.
Are you mixing documents for all 70 indices in your bulk requests? How many concurrent bulk indexing threads do you have? How many shards being actively indexed into reside on the two nodes that stand out? How does this compare to other nodes?
This blog post provides a bit of background and may be useful.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.