Node Leaving Cluster on Reindex

Hello all!

So I have been having this issue lately with elasticseach 2.4.

When I run

POST /_reindex
{
  "source": {
    "index": "indextest-2016.12.04"
  },
  "dest": {
    "index": "indextest-2016.12.04-2"
  }
}

I see random nodes leaving the cluster momentarily then joining back up. If I do requests_per_second=50 then that seems like a kind of sweet spot where it barely happens.

Here is a response once its finished (the failure nodes change)

{
   "took": 182120,
   "timed_out": false,
   "total": 2999257,
   "updated": 7200,
   "created": 600,
   "batches": 9,
   "version_conflicts": 0,
   "noops": 0,
   "retries": 0,
   "throttled_millis": 0,
   "requests_per_second": "unlimited",
   "throttled_until_millis": 0,
   "failures": [
      {
         "shard": -1,
         "index": null,
         "reason": {
            "type": "node_not_connected_exception",
            "reason": "[NODE_2][IP:9300] Node not connected"
         }
      },
      {
         "shard": -1,
         "index": null,
         "reason": {
            "type": "node_not_connected_exception",
            "reason": "[NODE_2][IP:9300] Node not connected"
         }
      },
      {
         "shard": -1,
         "index": null,
         "reason": {
            "type": "node_not_connected_exception",
            "reason": "[NODE_6][IP:9300] Node not connected"
         }
      }
   ]
}

I have never noticed this before and I am wondering how can I best go about seeing why a node is leaving and joining? In the master node logs and the data node I notice leaving there is no logs. Very weird!

Was hoping for insight as to what stats or logs to look for.

Thanks

I've never seen it before either! Reindex can't abide the node that it is pulling data from leaving the cluster....

I wonder if the node that it is pulling data from is close to the edge performance wise? Or your documents are really really big? You might try setting the batch size lower if you have huge documents:

POST _reindex
{
  "source": {
    "index": "source",
    "size": 10  <------ Here. The default is 1000.
  },
  "dest": {
    "index": "dest",
  }
}

Yeah its kinda weird!

It is in production so as you said it might just be hitting some sort of performance ceiling.

They are pretty decent performance wise though so that is a little odd.

I will try the batch size on source and see what happens, thanks!

I included some stats on the index as well.

      "primaries": {
         "docs": {
            "count": 2999257,
            "deleted": 0
         },
         "store": {
            "size_in_bytes": 2722495408,
            "throttle_time_in_millis": 0
         },

Averaging 907 bytes per document doesn't look big.

Yeah... I tried with size: 10 and same issue. A node almost immediately disconnects from cluster.

Load/cpu are like nothing. I am thinking maybe its LAN network saturation or throttling of sorts.

Other thought it maybe the index has some bad spots on disk. I will try with different/smaller indices.

Thanks anyway!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.