Node stops responding if remote monitoring cluster is unresponsive

Dan_Markhasin · June 7, 2017, 1:00pm

We have multiple clusters that export monitoring data to a remote cluster.
When the target configured in elasticsearch.yml became unavailable and requests to it were timing out (I believe this is an important factor - it was accepting connections but not returning any response), multiple nodes - in different clusters (!) - also stopped responding simultaneously.

All nodes were reporting timeouts connecting to the monitoring cluster, but some were not responding at all on their own REST API.

From what I could gather, the issue did not affect data nodes since we haven't seen any failed search or bulk requests, however all individual Kibana instances - which are connected to ingest nodes - also began timing out querying their respective clusters, and all clusters have gaps in their own local monitoring indices.
Perhaps some odd bug in X-Pack?

When the node was in this state, a Kibana that was connected to this node also stopped functioning (was getting a timeout).

Dan_Markhasin · June 7, 2017, 1:01pm

Stack trace of a node while it was not responding to Kibana:

[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.NodeFailureListener] connection failed to node at [https://arm-or-006.localdomain:9996]
[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.HttpExportBulkResponseListener] bulk request failed unexpectedly
java.net.SocketTimeoutException: null
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2017-06-06T20:41:33,125][ERROR][o.e.x.m.e.h.VersionHttpResource] failed to verify minimum version [5.0.0-beta1] on the [xpack.monitoring.exporters.arm] monitoring cluster
java.io.IOException: listener timeout after waiting for [30000] ms

spinscale · June 12, 2017, 6:59am

Hey,

can you provide some more information which Elasticsearch versions you are using? I am wondering about the 5.0.0-beta1 string in there, which clearly is a beta version and I assume you dont run it in production?

--Alex

Dan_Markhasin · June 12, 2017, 7:23am

The monitoring cluster itself is 5.4.1, the rest of the clusters (including the one where this stack was taken from) are 5.4.0.
The "5.0.0-beta1" is a hard-coded value in X-Pack for the minimum version it supports sending the monitoring data to.

system · July 10, 2017, 7:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic data nodes are randomly getting unresponsive Elasticsearch	1	439	January 11, 2022
A node "Received response for a request that has timed out, sent [24704ms] ago, timed out [9704ms] ago, action [cluster: monitor / nodes / stats [n]]," stuck entire cluster Elasticsearch	3	2718	May 17, 2017
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1588	March 2, 2021
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
Kibana's stack-monitoring partly broken after stack upgrade from 7.13 to 7.15 Kibana elastic-stack-monitoring	9	1620	January 14, 2022

Node stops responding if remote monitoring cluster is unresponsive

Related topics