Specific node is working "harder" than the others on the cluster

Hi,

I have an Elasticsearch cluster (version 2.3.4) with a custom app pulling data and a couchbase cluster pushing data to it (once an hour). The cluster is based on Amazon EC2 machines with the same specs and settings. After few hours one of the nodes seems to work "harder" than the others, in the monitoring plugins (KOPF, Elastic HQ) I can see the load is constantly high and once every few days the number of "Field Evictions" is raising.

While I understand this is an indication for lack of memory (which leads to high IOPs and high cpu), I wish to know why only one (specific) node is showing these symptoms in the cluster and why the load isn't spreading. If I restart the cluster, another node will show these symptoms few days after until the cluster is restarted again.

Settings:
M4 Instance
16 GB ram
150 IOPS
4 Nodes

index.number_of_shards: 5
index.number_of_replicas: 2
indices.fielddata.cache.size: "30%"
indices.cache.filter.size: "30%"
indices.breaker.fielddata.limit: "60%"
ES_HEAP_SIZE=7g
MAX_OPEN_FILES=65536
MAX_LOCKED_MEMORY=unlimited

Elasticsearch version 2.3.4

Thanks for the help in advance.

Impossible to say without more info.

Are all your apps using proper load balancing?

Thank you for the quick reply, I forgot to mention the version of Elasticsearching we're using, 2.3.4 (added to the original post too).

Yes, the app is allowed to access any/all of the Elasticsearch instances, the instances aren't limited to a specific role (master,data storage, router).

After we restart the cluster (gradually), the load will "bounce" to another instance and will stay there until we restart the cluster again.

You have a skew in your design: 4 nodes but 5 shards. So, one node must hold 2 shards, and must burden double load.

Golden rule: always align the shard count with the number of data nodes. Either by a 1:1 ratio, which is easiest, or in the case of many indices, by a 1:n ratio, so each data node will hold the same number of shards.

Beside that, I would strongly recommend to set up an odd number of master-eligible nodes to make the distributed system proof against split-brain situations. See the minimum_master_nodes setting .

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.