Killing 1 node causes hanging bulk requests

After killing 1 node (without -9) out of 3 nodes in the cluster, bulk create via transport client hangs for minutes!

The logs show:

[2017-12-13T10:52:58,703][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300} failed to ping, tried [3] times, each with maximum [1s] timeout]).
[2017-12-13T10:52:58,704][INFO ][o.e.c.s.ClusterService   ] [node-3] removed {{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300},}, reason: zen-disco-node-failed({node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300}), reason(failed to ping, tried [3] times, each with maximum [1s] timeout)[{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300} failed to ping, tried [3] times, each with maximum [1s] timeout]
[2017-12-13T10:52:58,827][INFO ][o.e.c.r.DelayedAllocationService] [node-3] scheduling reroute for delayed shards in [59.8s] (1 delayed shards)
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:58:06,206][INFO ][o.e.c.s.ClusterService   ] [node-3] added {{node-2}{y5SsksY3RYqoroDPrSvfdg}{wEPJXyNNRG2vAnj1Qb5oMA}{172.16.69.2}{172.16.69.2:9300},}, reason: zen-disco-node-join[{node-2}{y5SsksY3RYqoroDPrSvfdg}{wEPJXyNNRG2vAnj1Qb5oMA}{172.16.69.2}{172.16.69.2:9300}]

Our configuration is:

discovery.zen.commit_timeout: 2s
discovery.zen.publish_timeout: 2s
discovery.zen.fd.ping_timeout: 1s
transport.tcp.connect_timeout: 5s

Using version 5.4.1 of ES.
This is the code of bulk request:

BulkRequestBuilder bulkRequestBuilder = client().prepareBulk();

for (Map.Entry<Long, String> eventJson : eventJsons.entrySet()) {

 indexRequestBuilder = client().prepareIndex(EventsConstants.CURRENT_ALIAS, EventsConstants.BASE_TYPE, eventJson.getKey().toString());
 bulkRequestBuilder.add(indexRequestBuilder.setSource(eventJson.getValue(), XContentType.JSON).setCreate(true));
}

bulkRequestBuilder.setTimeout(TIMEOUT);

bulkRequestBuilder.execute().actionGet()

In case of no master i expect org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master]; and not hanging... what could be the problem?

any idea?

Is there any chance that all your queries/indexing requests are going to only one node?

I am working with transport client. All 3 nodes are configured and except for bulk, other requests do return on time when there is a failover.
I am adding all 3 nodes with addTransportAddress.
The bulk response also returns after a few minutes successfully.
It seems that some other operations are blocking it? I also don't understand why timeout exception is not thrown in this situation...
I add timeout to bulk request

bulkRequestBuilder.setTimeout(TIMEOUT);

In addition, we do have only 1 primary shards and 2 replica shards. So I expect 1 of the other nodes that holds the replica to be promoted to be master and it's shard to be the new primary. Isn't that the flow in fail over?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.