Node stops responding if remote monitoring cluster is unresponsive

We have multiple clusters that export monitoring data to a remote cluster.
When the target configured in elasticsearch.yml became unavailable and requests to it were timing out (I believe this is an important factor - it was accepting connections but not returning any response), multiple nodes - in different clusters (!) - also stopped responding simultaneously.

All nodes were reporting timeouts connecting to the monitoring cluster, but some were not responding at all on their own REST API.

From what I could gather, the issue did not affect data nodes since we haven't seen any failed search or bulk requests, however all individual Kibana instances - which are connected to ingest nodes - also began timing out querying their respective clusters, and all clusters have gaps in their own local monitoring indices.
Perhaps some odd bug in X-Pack?

When the node was in this state, a Kibana that was connected to this node also stopped functioning (was getting a timeout).

Stack trace of a node while it was not responding to Kibana:

[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.NodeFailureListener] connection failed to node at [https://arm-or-006.localdomain:9996]
[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.HttpExportBulkResponseListener] bulk request failed unexpectedly
java.net.SocketTimeoutException: null
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2017-06-06T20:41:33,125][ERROR][o.e.x.m.e.h.VersionHttpResource] failed to verify minimum version [5.0.0-beta1] on the [xpack.monitoring.exporters.arm] monitoring cluster
java.io.IOException: listener timeout after waiting for [30000] ms

Hey,

can you provide some more information which Elasticsearch versions you are using? I am wondering about the 5.0.0-beta1 string in there, which clearly is a beta version and I assume you dont run it in production?

--Alex

The monitoring cluster itself is 5.4.1, the rest of the clusters (including the one where this stack was taken from) are 5.4.0.
The "5.0.0-beta1" is a hard-coded value in X-Pack for the minimum version it supports sending the monitoring data to.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.