Killing 1 node causes hanging bulk requests

Alex_Davidovich · December 13, 2017, 9:15am

After killing 1 node (without -9) out of 3 nodes in the cluster, bulk create via transport client hangs for minutes!

The logs show:

[2017-12-13T10:52:58,703][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300} failed to ping, tried [3] times, each with maximum [1s] timeout]).
[2017-12-13T10:52:58,704][INFO ][o.e.c.s.ClusterService   ] [node-3] removed {{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300},}, reason: zen-disco-node-failed({node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300}), reason(failed to ping, tried [3] times, each with maximum [1s] timeout)[{node-2}{y5SsksY3RYqoroDPrSvfdg}{u1cSLNqgRgusPfdhhY90Fw}{172.16.69.2}{172.16.69.2:9300} failed to ping, tried [3] times, each with maximum [1s] timeout]
[2017-12-13T10:52:58,827][INFO ][o.e.c.r.DelayedAllocationService] [node-3] scheduling reroute for delayed shards in [59.8s] (1 delayed shards)
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:52:58,836][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513161010363][0] received shard failed for shard id [[events_1513161010363][0]], allocation id [KXtrTmKQSkmIuZ528faD7A], primary term [2], message [mark copy as stale]
[2017-12-13T10:58:06,206][INFO ][o.e.c.s.ClusterService   ] [node-3] added {{node-2}{y5SsksY3RYqoroDPrSvfdg}{wEPJXyNNRG2vAnj1Qb5oMA}{172.16.69.2}{172.16.69.2:9300},}, reason: zen-disco-node-join[{node-2}{y5SsksY3RYqoroDPrSvfdg}{wEPJXyNNRG2vAnj1Qb5oMA}{172.16.69.2}{172.16.69.2:9300}]

Our configuration is:

discovery.zen.commit_timeout: 2s
discovery.zen.publish_timeout: 2s
discovery.zen.fd.ping_timeout: 1s
transport.tcp.connect_timeout: 5s

Using version 5.4.1 of ES.
This is the code of bulk request:

BulkRequestBuilder bulkRequestBuilder = client().prepareBulk();

for (Map.Entry<Long, String> eventJson : eventJsons.entrySet()) {

 indexRequestBuilder = client().prepareIndex(EventsConstants.CURRENT_ALIAS, EventsConstants.BASE_TYPE, eventJson.getKey().toString());
 bulkRequestBuilder.add(indexRequestBuilder.setSource(eventJson.getValue(), XContentType.JSON).setCreate(true));
}

bulkRequestBuilder.setTimeout(TIMEOUT);

bulkRequestBuilder.execute().actionGet()

In case of no master i expect org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master]; and not hanging... what could be the problem?

Alex_Davidovich · December 13, 2017, 8:26pm

any idea?

mujtabahussain · December 13, 2017, 10:31pm

Is there any chance that all your queries/indexing requests are going to only one node?

Alex_Davidovich · December 14, 2017, 4:32am

I am working with transport client. All 3 nodes are configured and except for bulk, other requests do return on time when there is a failover.
I am adding all 3 nodes with addTransportAddress.
The bulk response also returns after a few minutes successfully.
It seems that some other operations are blocking it? I also don't understand why timeout exception is not thrown in this situation...
I add timeout to bulk request

bulkRequestBuilder.setTimeout(TIMEOUT);

In addition, we do have only 1 primary shards and 2 replica shards. So I expect 1 of the other nodes that holds the replica to be promoted to be master and it's shard to be the new primary. Isn't that the flow in fail over?

system · January 11, 2018, 4:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.