One failed data node cause http connection to master node (6 data nodes) disconnected


Trying to setup ES cluster on EC2 instance directly without using AWS managed Elasticsearch cluster.

So I setup a ES cluster running 1 dedicated master node and 6 dedicated data node (all on EC2 M4.large instance w/ 2 vCPU and 8G RAM).

Then I took snapshot of an index (1.2M docs, ~200G, 40 shards / replicas x 1.) from AWS ES cluster and restored to my own EC2 cluster, with all default settings except "index.unassigned.node_left.delayed_timeout": "10m".

Then I run a python script to use bulk API and scroll to re-index such index into a new index on the same cluster, using the master node as the end point.

Previous this index is on an ES cluster via AWS elasticsearch service with 8 T2.medium instance and took about 8 hours to finish without any problem.

However, using my own cluster, I run into two issues:

  1. a data node will always die due to OOM / Heap size issue;
  2. once this happen, my python script will die shortly due to connection timeout (but not my master node never died);

So my question is:

When one data node died, should the master node automatically stop sending traffic to this failed node, since from the master ES log it clearly detected and had such node removed?

I am less concerned about the OOM error since my design is to automatically scale up by adding new data node to the cluster due to increased CPU load on the rest data nodes.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.