Unassigned replication shards on elasticSearch 5.2[is it a bug?]

Hello,
I have an elasticsearch cluster with 4 nodes, two process on each node(listen 8200 and 9200 port).
Now I have a index with 8 shards and 1 replications. but the index always exsit an unassigned shard. Before ask the question, I have POST /_cluster/reroute?retry_failed=true' or set {"number_of_replicas":0} and back to {"number_of_replicas":1} for many times.but still not work. the cluster is still yellow.

below is the index configs and /_cluster/allocation/explain result.

It tells me one node can not connect,And I am sure the port listen of node is normal,becuase there are 2 primary shards on this node.

curl -s 'localhost:9200/adservice1_20190321/_settings?pretty'
{
  "adservice1_20190321" : {
    "settings" : {
      "index" : {
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "require" : {
              "rack" : "stale"
            }
          }
        },
        "search" : {
          "slowlog" : {
            "threshold" : {
              "fetch" : {
                "warn" : "1s",
                "trace" : "200ms",
                "debug" : "500ms",
                "info" : "800ms"
              },
              "query" : {
                "warn" : "10s",
                "trace" : "500ms",
                "debug" : "2s",
                "info" : "5s"
              }
            }
          }
        },
        "refresh_interval" : "60s",
        "number_of_shards" : "8",
        "translog" : {
          "sync_interval" : "5s",
          "durability" : "async"
        },
        "provided_name" : "adservice1_20190321",
        "merge" : {
          "scheduler" : {
            "max_thread_count" : "2"
          }
        },
        "creation_date" : "1553212811542",
        "number_of_replicas" : "1",
        "uuid" : "TAngC1U3R8qneD-dnbMgNA",
        "version" : {
          "created" : "5020299"
        }
      }
    }
  }
}
{
  "index" : "adservice1_20190321",
  "shard" : 3,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-03-21T01:44:50.171Z",
    "failed_allocation_attempts" : 7,
    "details" : "failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=X0fb0yW0QRWZs_OwdDZPag], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T01:44:34.951Z], failed_attempts[6], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=-Mr1-CWPR7O2giG4cHBuHQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:11.615Z], failed_attempts[5], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_qxUN09-TdmBmkQDFAKjZQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:10.202Z], failed_attempts[4], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[x9q2x22EQMi_PlQiau552w], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Eo4-nMHETUG4ReGUYi5-ng], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:08.793Z], failed_attempts[3], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=7ZJ2I8XzTWmlPXwPM_fKWw], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:01.983Z], failed_attempts[2], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[x9q2x22EQMi_PlQiau552w], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_vEVLtoZQFCeAY8bg_fK-w], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:00:58.817Z], failed_attempts[1], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], s[STARTED], a[id=xqjVvuU9TzKxmjp2o5v-Pg], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.137.176][10.226.137.176:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.137.176][10.226.137.176:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",

somebody can help me? thanks very much.

below is the shard list and explain result of one allocation_failed node

adservice1_20190321                    1 r STARTED     22754875   72.8gb 10.226.137.176 node1-10.226.137.176
adservice1_20190321                    1 p STARTED     22742824   71.8gb 9.37.152.24    node1-9.37.152.24
adservice1_20190321                    3 p STARTED     22761158   64.1gb 100.107.135.3  node1-100.107.135.3
adservice1_20190321                    3 r UNASSIGNED                                   
adservice1_20190321                    5 p STARTED     22732798   64.9gb 9.37.152.24    node1-9.37.152.24
adservice1_20190321                    5 r STARTED     22719831   66.3gb 10.226.154.25  node1-10.226.154.25
adservice1_20190321                    7 r STARTED     22748021   63.9gb 10.226.137.176 node1-10.226.137.176
adservice1_20190321                    7 p STARTED     22759485   68.6gb 100.107.135.3  node1-100.107.135.3
adservice1_20190321                    2 r STARTED     22736215   68.5gb 9.37.152.24    node1-9.37.152.24
adservice1_20190321                    2 p STARTED     22723809   65.5gb 10.226.154.25  node1-10.226.154.25
adservice1_20190321                    4 p STARTED     22752633   71.7gb 10.226.137.176 node1-10.226.137.176
adservice1_20190321                    4 r STARTED     22763864     69gb 100.107.135.3  node1-100.107.135.3
adservice1_20190321                    6 p STARTED     22733711   64.6gb 10.226.154.25  node1-10.226.154.25
adservice1_20190321                    6 r STARTED     22773291   68.3gb 100.107.135.3  node1-100.107.135.3
adservice1_20190321                    0 p STARTED     22750002   71.5gb 10.226.137.176 node1-10.226.137.176
adservice1_20190321                    0 r STARTED     22739022   62.6gb 9.37.152.24    node1-9.37.152.24
 {
      "node_id" : "assw7TKcSkaNcLzO7CMvGQ",
      "node_name" : "node1-9.37.152.24",
      "transport_address" : "9.37.152.24:9400",
      "node_attributes" : {
        "rack" : "stale"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T01:44:50.171Z], failed_attempts[7], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=X0fb0yW0QRWZs_OwdDZPag], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T01:44:34.951Z], failed_attempts[6], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=-Mr1-CWPR7O2giG4cHBuHQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:11.615Z], failed_attempts[5], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_qxUN09-TdmBmkQDFAKjZQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:10.202Z], failed_attempts[4], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[x9q2x22EQMi_PlQiau552w], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Eo4-nMHETUG4ReGUYi5-ng], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:08.793Z], failed_attempts[3], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=7ZJ2I8XzTWmlPXwPM_fKWw], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:01:01.983Z], failed_attempts[2], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[x9q2x22EQMi_PlQiau552w], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_vEVLtoZQFCeAY8bg_fK-w], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T00:00:58.817Z], failed_attempts[1], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], s[STARTED], a[id=xqjVvuU9TzKxmjp2o5v-Pg], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.137.176][10.226.137.176:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.137.176][10.226.137.176:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[130], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]]]"
        }
      ]
    }

Have you given the two Elasticsearch processes on each node different data paths (path.data in elasticsearch.yml)? They need that because two processes or instances can't share data files since both will try to lock and open the data files for quick access and updates.

yes,they have different path.data and install dir.
this cluster runs for more than 1 year,and the uassigned shards problem appears recently.

You have only quoted half of the allocation explain API output. You have deleted the node-by-node information that will answer your question.

Ah you gave more information lower down:

This tells us that the replica became unassigned because the node disconnected

NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]

Subsequent repeated attempts to reallocate this shard are failing because it is rejecting writes:

failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw]

There should be information in this node's log with a stack trace indicating why it's rejecting these writes.

yes you are right ,but actually what makes me fuzz is why it will tell node disconnected and reject write.because the allocation_failed node is very normal,it even has 1 primary shards(the other nodes replicate data from this node is also normal,you can see it in above /_cat/shards result section ,so it can prove the network is ok .)
And disk space is enough(only 40% disk space used,and has no quota limit ).And data is writing everyday.

I will paste the stack trace information thank you very much

this is stack information,is it mean the data format not matching make the replications crash?

[2019-03-21T18:34:08,567][DEBUG][o.e.a.b.TransportShardBulkAction] [node1-10.226.154.25] [adservice1_20190321][6] failed to execute bulk item (index) index {[adservice1_20190321][log][AWmk9w8OkfzP2BP2k477], source[{"@version":"1","@timestamp":"2019-03-22T10:34:07.964Z","kafka":{"msg_size":3652,"topic":"ads_api","consumer_group":"logstash_ads","partition":60,"offset":1322449607,"key":null},"fields":{"name":"ads_api","server_ip":"10.175.105.116"},"beat":{"name":"TNG","hostname":"TNG","version":"6.5.4"},"host":{"name":"TNG"},"source":"/data/log/Biz_RD/ad_service/trace/20190322/adtarget_get_adtarget_18.log","offset":9228094,"tags":["biz_ads","_rubyexception"],"datetime":"9633\\u5e02\\u3001\\u516d\\u76d8\\u6c34\\u5e02\\u3001\\u9ed4\\u4e1c\\u5357\\u82d7\\u65cf\\u4f97\\u65cf\\u81ea\\u6cbb\\u5dde\\u3001\\u9ed4\\u5357\\u5e03\\u4f9d\\u65cf\\u82d7\\u65cf\\u81ea\\u6cbb\\u5dde\\u3001\\u9ed4\\u897f\\u5357\\u5e03\\u4f9d\\u65cf\\u82d7\\u65cf\\u81ea\\u6cbb\\u5dde\\u3001\\u94dc\\u4ec1\\u5e02\\u3001\\u9075\\u4e49\\u5e02\\u3001\\u4f0a\\u6625\\u5e02\\u3001\\u4e03\\u53f0\\u6cb3\\u5e02\\u3001\\u9f50\\u9f50\\u54c8\\u5c14\\u5e02\\u3001\\u53cc\\u9e2d\\u5c71\\u5e02\\u3001\\u7ee5\\u5316\\u5e02\\u3001\\u5927\\u5e86\\u5e02\\u3001\\u5927\\u5174\\u5b89\\u5cad\\u5730\\u533a\\u3001\\u54c8\\u5c14\\u6ee8\\u5e02\\u3001\\u9e64\\u5c97\\u5e02\\u3001\\u9ed1\\u6cb3\\u5e02\\u3001\\u9e21\\u897f\\u5e02\\u3001\\u4f73\\u6728\\u65af\\u5e02\\u3001\\u7261\\u4e39\\u6c5f\\u5e02\\u3001\\u767d\\u6c99\\u9ece\\u65cf\\u81ea\\u6cbb\\u53bf\\u3001\\u4fdd\\u4ead\\u9ece\\u65cf\\u82d7\\u65cf\\u81ea\\u6cbb\\u53bf\\u3001\\u660c\\u6c5f\\u9ece\\u65cf\\u81ea\\u6cbb\\u53bf\\u3001\\u6f84\\u8fc8\\u53bf\\u3001\\u510b\\u5dde\\u5e02\\u3001\\u5b9a\\u5b89\\u53bf\\u3001\\u4e1c\\u65b9\\u5e02\\u3001\\u6d77\\u53e3\\u5e02\\u3001\\u4e34\\u9ad8\\u53bf\\u3001\\u9675\\u6c34\\u9ece\\u65cf\\u81ea\\u6cbb\\u53bf\\u3001\\u743c\\u6d77\\u5e02\\u3001\\u4e09\\u4e9a\\u5e02\\u3001\\u4e94\\u6307\\u5c71\\u5e02\\u3001\\u4e07\\u5b81\\u5e02\\u3001\\u6587\\u660c\\u5e02\\u3001\\u73e0\\u6d77\\u5e02\\u3001\\u4e91\\u6d6e\\u5e02\\u3001\\u6e5b\\u6c5f\\u5e02\\u3001\\u8087\\u5e86\\u5e02\\u3001\\u4e2d\\u5c71\\u5e02\\u3001\\u8302\\u540d\\u5e02\\u3001\\u6885\\u5dde\\u5e02\\u3001\\u6e05\\u8fdc\\u5e02\\u3001\\u6c55\\u5934\\u5e02\\u3001\\u6c55\\u5c3e\\u5e02\\u3001\\u97f6\\u5173\\u5e02\\u3001\\u6df1\\u5733\\u5e02\\u3001\\u9633\\u6c5f\\u5e02\\u3001\\u6f6e\\u5dde\\u5e02\\u3001\\u4e1c\\u839e\\u5e02\\u3001\\u4f5b\\u5c71\\u5e02\\u3001\\u5e7f\\u5dde\\u5e02\\u3001\\u6cb3\\u6e90\\u5e02\\u3001\\u60e0\\u5dde\\u5e02\\u3001\\u6c5f\\u95e8\\u5e02\\u3001\\u63ed\\u9633\\u5e02\\u3001\\u5b89\\u9633\\u5e02\\u3001\\u9e64\\u58c1\\u5e02\\u3001\\u6d4e\\u6e90\\u5e02\\u3001\\u7126\\u4f5c\\u5e02\\u3001\\u5f00\\u5c01\\u5e02\\u3001\\u6d1b\\u9633\\u5e02\\u3001\\u6f2f\\u6cb3\\u5e02\\u3001\\u5357\\u9633\\u5e02\\u3001\\u5e73\\u9876\\u5c71\\u5e02\\u3001\\u6fee\\u9633\\u5e02\\u3001\\u4e09\\u95e8\\u5ce1\\u5e02\\u3001\\u5546\\u4e18\\u5e02\\u3001\\u65b0\\u4e61\\u5e02\\u3001\\u4fe1\\u9633\\u5e02\\u3001\\u8bb8\\u660c\\u5e02\\u3001\\u90d1\\u5dde\\u5e02\\u3001\\u5468\\u53e3\\u5e02\\u3001\\u9a7b\\u9a6c\\u5e97\\u5e02\\u3001\\u5170\\u5dde\\u5e02\\u3001\\u4e34\\u590f\\u56de\\u65cf\\u81ea\\u6cbb\\u5dde\\u3001\\u9647\\u5357\\u5e02\\u3001\\u5e73\\u51c9\\u5e02\\u3001\\u5e86\\u9633\\u5e02\\u3001\\u5929\\u6c34\\u5e02\\u3001\\u6b66\\u5a01\\u5e02\\u3001\\u5f20\\u6396\\u5e02\\u3001\\u767d\\u94f6\\u5e02\\u3001\\u5b9a\\u897f\\u5e02\\u3001\\u7518\\u5357\\u85cf\\u65cf\\u81ea\\u6cbb\\u5dde\\u3001\\u5609\\u5cea\\u5173\\u5e02\\u3001\\u91d1\\u660c\\u5e02\\u3001\\u9152\\u6cc9\\u5e02\\u3001\\u957f\\u6c99\\u5e02\\u3001\\u5e38\\u5fb7\\u5e02\\u3001\\u6c38\\u5dde\\u5e02\\u3001\\u5cb3\\u9633\\u5e02\\u3001\\u5f20\\u5bb6\\u754c\\u5e02\\u3001\\u682a\\u6d32\\u5e02\\u3001\\u90f4\\u5dde\\u5e02\\u3001\\u8861\\u9633\\u5e02\\u3001\\u6000\\u5316\\u5e02\\u3001\\u5a04\\u5e95\\u5e02\\u3001\\u90b5\\u9633\\u5e02\\u3001\\u6e58\\u6f6d\\u5","level":"0","logger":null,"thread":null,"version":null,"trace_id":null,"span_id":null,"app_name":null,"client":null,"callee":null,"uid":null,"mod_name":null,"act_name":null,"client_ip":null,"server_ip":null,"operator_id":null,"exec_time_us":null,"file":null,"error_code":null,"error_msg":null,"line":null,"req_method":null,"rsp_code":null,"req_url":null,"req_body":null,"rsp_body":null,"error_type":null}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [datetime]
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:298) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:449) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:575) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:395) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:372) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:275) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:533) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:510) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:196) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:201) ~[elasticsearch-5.2.2.jar:5.2.2]

may be this is the wrong log,because I did not find some information include shard 3 like below [adservice1_20190321][3]

No, this is not the exception we're looking for. This one pertains to an operation on the primary of [adservice1_20190321][6]:

The failure we are interested in was on a replica of [adservice1_20190321][3].

may be this is the wrong log,because I did not find some information include shard 3 like below [adservice1_20190321][3]

You seem to be editing your posts to add replies and/or extra information. This makes the conversation hard to follow, and means that that we might not see this extra information. It's better to use the edit feature only to fix minor errors (typos, spelling etc.) and to post new information as a new post.

but i find shard 3 on the primary node

[2019-03-21T18:35:47,910][WARN ][o.e.a.b.TransportShardBulkAction] [node1-100.107.135.3] [[adservice1_20190321][3]] failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=pn9M5CtkT-KStxS-85T4jQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T08:20:28.813Z], failed_attempts[8], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=mrQCiq6jRE2NMQqn-WKYjg], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T08:18:54.603Z], failed_attempts[7], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=E9v2LmvqRk2rEuNgFV940Q], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T06:20:41.715Z], failed_attempts[6], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=rhkqVV3qRO-A0BrK59b_Uw], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T04:31:11.161Z], failed_attempts[5], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=UmAZHN-1SfW8aqHnetZLlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T04:15:42.111Z], failed_attempts[4], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=rmiOwygnRASFZVr2XrdzFw], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T02:48:55.579Z], failed_attempts[3], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=n-I-zq2FRIGX5S9Wm7qQGw], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T02:48:52.411Z], failed_attempts[2], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=Fr3Y87NhRwym-lkLSE-ZhQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-03-21T02:48:50.861Z], failed_attempts[1], delayed=false, details[failed to perform indices:data/write/bulk[s] on replica [adservice1_20190321][3], node[nlTM3tPtSLKn7vJItD0gqw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=rjIQoIy-ScCXCthB3rLwhA], unassigned_info[[reason=REPLICA_ADDED], at[2019-03-21T02:48:47.158Z], delayed=false, allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[0], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[0], failure NodeDisconnectedException[[node1-10.226.154.25][10.226.154.25:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], expected_shard_size[0], failure NodeNotConnectedException[[node1-10.226.154.25][10.226.154.25:9400] Node not connected]], allocation_status[no_attempt]], expected_shard_size[31636823288], failure NodeNotConnectedException[[node1-9.37.152.24][9.37.152.24:9400] Node not connected]], allocation_status[no_attempt]], failure NodeDisconnectedException[[node1-9.37.152.24][9.37.152.24:9400][indices:data/write/bulk[s][r]] disconnected]], allocation_status[no_attempt]], failure NodeNotConnectedException[[node1-9.37.152.24][9.37.152.24:9400] Node not connected]], allocation_status[no_attempt]], failure NodeNotConnectedException[[node1-9.37.152.24][9.37.152.24:9400] Node not connected]], allocation_status[no_attempt]]
org.elasticsearch.transport.NodeNotConnectedException: [node1-9.37.152.24][9.37.152.24:9400] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:590) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:113) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:496) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:472) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:935) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:184) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:168) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:327) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:262) [elasticsearch-5.2.2.jar:5.2.2]

thank you very much, I will do like this. Now i have to go out. I will do this 4 hours later,still thank you very much. very very much.

1 Like

NodeNotConnectedException means the node isn't (or wasn't) connected. This indicates a network problem, but it could have been transient. If the connectivity issue persists then I think you should be seeing log messages containing the string failed to connect to node roughly every minute.

Hello man
YES I found 3 strings failed to connect to node on the primary shard node.yes you are right,Maybe at the failed time,the node really could not be connected.But I have tried POST /_cluster/reroute?retry_failed=true more than 10 times,the explain result show the cluster retried 7 times.but only 3 log record.

[root@100 /data1/ttt-es-log/node1]# grep 'failed to connect to node' ttt_log_es-2019-03-21.log
[2019-03-21T08:00:35,749][WARN ][o.e.c.NodeConnectionsService] [node1-100.107.135.3] failed to connect to node {node1-10.226.137.176}{x9q2x22EQMi_PlQiau552w}{k-D4YwmiRsy6zXmGI6b8Gg}{10.226.137.176}{10.226.137.176:9400}{rack=stale} (tried [1] times)
[2019-03-21T08:00:51,329][WARN ][o.e.c.NodeConnectionsService] [node1-100.107.135.3] failed to connect to node {node1-10.226.137.176}{x9q2x22EQMi_PlQiau552w}{k-D4YwmiRsy6zXmGI6b8Gg}{10.226.137.176}{10.226.137.176:9400}{rack=stale} (tried [1] times)
[2019-03-21T13:02:42,636][WARN ][o.e.c.NodeConnectionsService] [node1-100.107.135.3] failed to connect to node {node1-10.226.137.176}{x9q2x22EQMi_PlQiau552w}{k-D4YwmiRsy6zXmGI6b8Gg}{10.226.137.176}{10.226.137.176:9400}{rack=stale} (tried [1] times)
[root@100 /data1/ttt-es-log/node1]# grep 'failed to connect to node' ttt_log_es-2019-03-21.log -c
3
[root@100 /data1/ttt-es-log/node1]#

below is the information

[root@100 /data1/ttt-es-log/node1]# grep 'failed to connect to node' ttt_log_es-2019-03-21.log -A 100|head -50
[2019-03-21T08:00:35,749][WARN ][o.e.c.NodeConnectionsService] [node1-100.107.135.3] failed to connect to node {node1-10.226.137.176}{x9q2x22EQMi_PlQiau552w}{k-D4YwmiRsy6zXmGI6b8Gg}{10.226.137.176}{10.226.137.176:9400}{rack=stale} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [node1-10.226.137.176][10.226.137.176:9400] general node connection failure
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:519) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:460) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:305) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:121) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:87) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:775) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.2.jar:5.2.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
Caused by: java.lang.IllegalStateException: handshake failed
	at org.elasticsearch.transport.TcpTransport.executeHandshake(TcpTransport.java:1568) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:502) ~[elasticsearch-5.2.2.jar:5.2.2]
	... 14 more
Caused by: org.elasticsearch.transport.TransportException: connection reset
	at org.elasticsearch.transport.TcpTransport.onChannelClosed(TcpTransport.java:1610) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.netty4.Netty4Transport.access$100(Netty4Transport.java:93) ~[?:?]
	at org.elasticsearch.transport.netty4.Netty4Transport$ChannelCloseListener.operationComplete(Netty4Transport.java:412) ~[?:?]
	at org.elasticsearch.transport.netty4.Netty4Transport$ChannelCloseListener.operationComplete(Netty4Transport.java:402) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) ~[?:?]
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) ~[?:?]
	at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1058) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:686) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:664) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:607) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:82) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:101) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:140) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more
[2019-03-21T08:00:40,315][WARN ][o.e.t.n.Netty4Transport  ] [node1-100.107.135.3] exception caught on transport layer [[id: 0xfce89b3d, L:/100.107.135.3:64038 - R:10.226.137.176/10.226.137.176:9400]], closing connection
java.lang.ClassCastException: null
[2019-03-21T08:00:40,315][WARN ][o.e.t.n.Netty4Transport  ] [node1-100.107.135.3] exception caught on transport layer [[id: 0xfce89b3d, L:/100.107.135.3:64038 - R:10.226.137.176/10.226.137.176:9400]], closing connection

And I have some new information on the primary shards[3,7] node for the new index created at today.It tells me unassigned_info[[reason=INDEX_CREATED].
And I did not find the failed to connect node in the log file.

[root@100 /data1/ttt-es-log/node1]# ll -htr|tail -3
-rw-r--r-- 1 user_00 users   41M Mar 22 23:59 ttt_log_es-2019-03-22.log
-rw-r--r-- 1 user_00 users  310K Mar 23 17:04 ttt_log_es_index_search_slowlog.log
-rw-r--r-- 1 user_00 users   33M Mar 23 18:12 ttt_log_es.log
[root@100 /data1/ttt-es-log/node1]# grep 'failed to connect to node' ttt_log_es.log                            
[root@100 /data1/ttt-es-log/node1]# grep 'failed to connect to node' ttt_log_es.log -c
0

below is the stack information

[root@100 /data1/ttt-es-log/node1]# curl -s  'localhost:9200/_cat/shards?v'|grep adservice1_20190323
adservice1_20190323                    3     p      STARTED     17123343   47.6gb 100.107.135.3  node1-100.107.135.3
adservice1_20190323                    3     r      UNASSIGNED                                   
adservice1_20190323                    4     r      STARTED     17105518   48.8gb 10.226.137.176 node1-10.226.137.176
adservice1_20190323                    4     p      STARTED     17112854     48gb 10.226.154.25  node1-10.226.154.25
adservice1_20190323                    1     p      STARTED     17106768   47.7gb 10.226.137.176 node1-10.226.137.176
adservice1_20190323                    1     r      STARTED     17113804   48.5gb 10.226.154.25  node1-10.226.154.25
adservice1_20190323                    6     p      STARTED     17140392   49.7gb 9.37.152.24    node1-9.37.152.24
adservice1_20190323                    6     r      STARTED     17124441   48.4gb 100.107.135.3  node1-100.107.135.3
adservice1_20190323                    5     p      STARTED     17113245   48.2gb 10.226.137.176 node1-10.226.137.176
adservice1_20190323                    5     r      STARTED     17120503   47.9gb 10.226.154.25  node1-10.226.154.25
adservice1_20190323                    2     p      STARTED     17137361   49.7gb 9.37.152.24    node1-9.37.152.24
adservice1_20190323                    2     r      STARTED     17117608   47.9gb 10.226.154.25  node1-10.226.154.25
adservice1_20190323                    7     p      STARTED     17121300   47.3gb 100.107.135.3  node1-100.107.135.3
adservice1_20190323                    7     r      UNASSIGNED                                   
adservice1_20190323                    0     r      STARTED     17101462     49gb 10.226.137.176 node1-10.226.137.176
adservice1_20190323                    0     p      STARTED     17108760   48.3gb 10.226.154.25  node1-10.226.154.25
[2019-03-23T08:01:12,209][WARN ][o.e.a.b.TransportShardBulkAction] [node1-100.107.135.3] [[adservice1_20190323][7]] failed to perform indices:data/write/bulk[s] on replica [adservice1_20190323][7], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=zLi5WsHeQ7yM0X_N8WRWiw], unassigned_info[[reason=INDEX_CREATED], at[2019-03-23T00:00:00.302Z], delayed=false, allocation_status[no_attempt]]
org.elasticsearch.transport.NodeDisconnectedException: [node1-9.37.152.24][9.37.152.24:9400][indices:data/write/bulk[s][r]] disconnected
[2019-03-23T08:01:12,209][WARN ][o.e.a.b.TransportShardBulkAction] [node1-100.107.135.3] [[adservice1_20190323][3]] failed to perform indices:data/write/bulk[s] on replica [adservice1_20190323][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], s[STARTED], a[id=liVEnESdQnmcCGRmd6x53Q]
org.elasticsearch.transport.NodeDisconnectedException: [node1-9.37.152.24][9.37.152.24:9400][indices:data/write/bulk[s][r]] disconnected
[2019-03-23T08:01:12,209][WARN ][o.e.a.b.TransportShardBulkAction] [node1-100.107.135.3] [[adservice1_20190323][3]] failed to perform indices:data/write/bulk[s] on replica [adservice1_20190323][3], node[assw7TKcSkaNcLzO7CMvGQ], [R], s[STARTED], a[id=liVEnESdQnmcCGRmd6x53Q]
org.elasticsearch.transport.NodeDisconnectedException: [node1-9.37.152.24][9.37.152.24:9400][indices:data/write/bulk[s][r]] disconnected
[2019-03-23T08:01:12,212][WARN ][o.e.a.b.TransportShardBulkAction] [node1-100.107.135.3] [[adservice1_20190323][7]] failed to perform indices:data/write/bulk[s] on replica [adservice1_20190323][7], node[assw7TKcSkaNcLzO7CMvGQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=zLi5WsHeQ7yM0X_N8WRWiw], unassigned_info[[reason=INDEX_CREATED], at[2019-03-23T00:00:00.302Z], delayed=false, allocation_status[no_attempt]]
org.elasticsearch.transport.NodeNotConnectedException: [node1-9.37.152.24][9.37.152.24:9400] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:590) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:113) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:496) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:472) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:935) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:184) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:168) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:327) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:262) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:864) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:861) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147) [elasticsearch-5.2.2.jar:5.2.2]

To be clear, I think this message should be appearing every minute in the log file of the node holding the primary. Which log are you looking at there?

I can provide the log file ,but attachment uploading is not supporting on the platform.Can you give me an mail address?

Sure, I'm david.turner@elastic.co.

thank you very much.
Mail has already been sent