Request to elasticsearch cluster hangs


#1

Hi,
we have 3 node cluster(may be abc, pqr, xyz). while working some times the search request sent gets hanged and we dont get any response back.

when we check elasticsearch logs on abc(master) we get following

[NodeName-{abc}] [index_name][1] received shard failed for [index_name][1], node[N9_z7xYSSO6E4-W6DqEDVA], [R], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-09-13T06:13:53.463Z], details[shard failure [failed recovery][RecoveryFailedException[[index_name][1]: Recovery failed from [NodeName-pqr][nzkNy894SMe9FJBNv45k1Q][pqr][inet[/20.222.146.196:9300]]{master=true} into [NodeName-{abc}][GRBa3JlKRSybCTQgu76jvQ][abc][inet[/20.222.146.221:9300]]{master=true}]; nested: RemoteTransportException[[NodeName-pqr][inet[/20.222.146.196:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[index_name][1] Phase[1] Execution failed]; nested: RecoverFilesRecoveryException[[index_name][1] Failed to transfer [299] files with total size of [119.1gb]]; nested: ReceiveTimeoutTransportException[[NodeName-{abc}][inet[/20.222.146.221:9300]][internal:index/shard/recovery/clean_files] request_id [371207311] timed out after [900000ms]]; ]]], indexUUID [FImaN7b3RriRIT55eeeJXw], reason [Failed to perform [indices:data/write/delete] on replica, message [NodeDisconnectedException[[NodeName-{xyz}][inet[/20.222.146.220:9300]][indices:data/write/delete[r]] disconnected]]]
[2016-09-13 01:26:54,256][WARN ][cluster.action.shard ] [NodeName-{abc}] [index_name][2] received shard failed for [index_name][2], node[N9_z7xYSSO6E4-W6DqEDVA], [R], s[STARTED], indexUUID [FImaN7b3RriRIT55eeeJXw], reason [Failed to perform [indices:data/write/index] on replica, message [SendRequestTransportException[[NodeName-{xyz}][inet[/20.222.146.220:9300]][indices:data/write/index[r]]]; nested: NodeNotConnectedException[[NodeName-{xyz}][inet[/20.222.146.220:9300]] Node not connected]; ]]

and on node node pqr we see following logs

[action.admin.cluster.health] [NodeName-pqr] connection exception while trying to forward request to master node [[NodeName-{abc}][GRBa3JlKRSybCTQgu76jvQ][abc][inet[/20.222.146.221:9300]]{master=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [NodeName-{abc}][inet[/20.222.146.221:9300]][cluster:monitor/health] disconnected]
[2016-09-13 01:14:26,234][WARN ][search.action ] [NodeName-pqr] Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException: [NodeName-{abc}][inet[/20.222.146.221:9300]][indices:data/read/search[free_context]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:143)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.sendReleaseSearchContext(TransportSearchTypeAction.java:353)
at org.elasticsearch.action.search.type.TransportSearchDfsQueryThenFetchAction$AsyncAction$1.onFailure(TransportSearchDfsQueryThenFetchAction.java:123)
at org.elasticsearch.search.action.SearchServiceTransportAction$8.handleException(SearchServiceTransportAction.java:283)
at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [NodeName-{abc}][inet[/20.222.146.221:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
... 9 more

can somebody help us out as to why this must be happening, there are no time outs, the request just hangs


(system) #2