Unusually high load in node, never high load on master. 2 node setup

I have a 2 node setup with a master and a node both running in docker containers which volume mount the config and data directories.

As far as the docker specific configuration, I have a ulimit set on both master and node for memlock=-1:-1

Master specs:
10G RAM
4 cores
Heap 4GB
bootstrap memory lock = true

Node specs:
6 GB RAM
6 cores
Heap 3GB
bootstrap memory lock = true

The node is using dm-crypt to encrypt data at rest which is why we added 2 more cores to compensate for the additional overheard (though I have read that it should only add between 1-5% overhead).

We have approximately 40 million documents indexed and about 300K daily being indexed. Our indexing strategy by default is using daily indexing using logstash-(yyyy-mm-dd) format . While monitoring the load the master rarely tops 4+ but the node can sometimes go up to 16 and then in about 20 minutes or so start tapering off. I can't understand why there is such a high load going through the node and the master seems unphased by the amount of data we are sending into elasticsearch. Please help!

Hi Antonio,

If I'm reading this correct you have one node that is a dedicated master and another node that is a dedicated data node, right?
In such a configuration it's quite normal that you only see a high load on the data node, since this is the node being mostly affected by indexing documents. The master will not do much more than deciding where to put the index and updating the cluster state when a new field arrives. While the data node will do the real work like analyzing your data and storing it etc.

So what you are seeing is actually quite normal. The other way around (high load on the master) would be something to worry about.

I think the following two chapters of the definitive guide are a good read to better understand what's going on when indexing documents:
https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/data-in-data-out.html

Hope that helps,
Jakob

1 Like

Thank you for your insight I really appreciate it. Glad to hear that what I am seeing is actually normal. After much more investigation we found out that the load was getting out of control because of slow disks on an over provisioned hypervisor. The I/O wait times were getting ridiculous so we moved the VM over to another hypevisor backed by a RAID 10 SAN and poof, the problems went away.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.