I am working with elastic 5.4.3 and indexing events via transport client.
I have a cluster of 3 nodes with 1 primary shard and 2 replicas.
During a test, node-2 was shutdown.
For a very specific time, when indexing a document in elastic via transport client, this line didn't return for a very long time (15 minutes): builder.execute().actionGet();
Other threads also executed this line of code and got a response successfully.
I can see that after 15 minutes, the thread got a response, right after the client wrote to the log:
DEBUG 2019-06-20 21:45:39,721 [elasticsearch[_client_][generic][T#4]] : Netty4Transport(TcpTransport.closeAndNotify:605) - disconnecting from [{node-2}{jfhMi92vThOz0x801XnniA}{1fN_YK1oR1KOyC-swbLiNw}{node
-2}{9.151.141.2:9300}], IOException[Connection timed out]
The node sampler writes to the log every 5 seconds:
DEBUG 2019-06-23 13:58:34,240 [elasticsearch[_client_][generic][T#1]] : TransportClientNodesService(TransportClientNodesService$SimpleNodeSampler.doSample:432) - failed to connect to node [{#transport#-2}{kBEaZTXMTyCodE1w9RR_qg}{node-2}{9.151.141.2:9300}], ignoring...
I don't use a sniffer sampler and I set a timeout on the bulkRequestBuilder as well as waitingForActiveShards.
I wait for 2 shards while there are 2 active shards since only 1 node turned off.
In addition, just want to emphasize that other threads do get answers at around that time, so maybe a corner case?
How can I configure some kind of timeout for this kind of situations? or configure my transport client to handle this without getting stuck?