We have multiple clusters that export monitoring data to a remote cluster.
When the target configured in elasticsearch.yml became unavailable and requests to it were timing out (I believe this is an important factor - it was accepting connections but not returning any response), multiple nodes - in different clusters (!) - also stopped responding simultaneously.
All nodes were reporting timeouts connecting to the monitoring cluster, but some were not responding at all on their own REST API.
From what I could gather, the issue did not affect data nodes since we haven't seen any failed search or bulk requests, however all individual Kibana instances - which are connected to ingest nodes - also began timing out querying their respective clusters, and all clusters have gaps in their own local monitoring indices.
Perhaps some odd bug in X-Pack?
When the node was in this state, a Kibana that was connected to this node also stopped functioning (was getting a timeout).
Stack trace of a node while it was not responding to Kibana:
[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.NodeFailureListener] connection failed to node at [https://arm-or-006.localdomain:9996]
[2017-06-06T20:41:13,851][WARN ][o.e.x.m.e.h.HttpExportBulkResponseListener] bulk request failed unexpectedly
java.net.SocketTimeoutException: null
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2017-06-06T20:41:33,125][ERROR][o.e.x.m.e.h.VersionHttpResource] failed to verify minimum version [5.0.0-beta1] on the [xpack.monitoring.exporters.arm] monitoring cluster
java.io.IOException: listener timeout after waiting for [30000] ms
can you provide some more information which Elasticsearch versions you are using? I am wondering about the 5.0.0-beta1 string in there, which clearly is a beta version and I assume you dont run it in production?
The monitoring cluster itself is 5.4.1, the rest of the clusters (including the one where this stack was taken from) are 5.4.0.
The "5.0.0-beta1" is a hard-coded value in X-Pack for the minimum version it supports sending the monitoring data to.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.