Xpack.monitoring ExportException: failed to flush export bulks

Hi team

 We have multiple clusters that export monitoring data to a remote cluster.

All nodes were reporting timeouts when connecting to the monitoring cluster, some of the nodes with hight CPU uage(close to 100%), the memory is normal, nothing changed.

I want to find the root cause, could you help me, thanks!

Node logs:
[2019-03-27T09:04:18,070][WARN ][o.e.x.m.e.h.NodeFailureListener] connection failed to node at [http://X.X.X.X:9200]
[2019-03-27T09:04:18,077][WARN ][o.e.x.m.e.h.HttpExportBulkResponseListener] bulk request failed unexpectedly
java.io.IOException: request retries exceeded max retry timeout [30000]
at org.elasticsearch.client.RestClient$1.retryIfPossible(RestClient.java:574) [elasticsearch-rest-client-6.4.0.jar:6.4.0]
at org.elasticsearch.client.RestClient$1.failed(RestClient.java:561) [elasticsearch-rest-client-6.4.0.jar:6.4.0]
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134) [httpcore-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.failed(AbstractClientExchangeHandler.java:419) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
[2019-03-27T09:04:18,080][WARN ][o.e.x.m.MonitoringService] [node-5] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.(ExportBulk.java:95) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:93) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:206) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:200) ~[?:?]
at org.elasticsearch.xpack.core.common.IteratingActionListener.onResponse(IteratingActionListener.java:96) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:164) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:68) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.monitoring.exporter.http.HttpExportBulk$1.onFailure(HttpExportBulk.java:114) [x-pack-monitoring-6.4.0.jar:6.4.0]
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:844) [elasticsearch-rest-client-6.4.0.jar:6.4.0]
at org.elasticsearch.client.RestClient$1.retryIfPossible(RestClient.java:575) [elasticsearch-rest-client-6.4.0.jar:6.4.0]
at org.elasticsearch.client.RestClient$1.failed(RestClient.java:561) [elasticsearch-rest-client-6.4.0.jar:6.4.0]
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134) [httpcore-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.failed(AbstractClientExchangeHandler.java:419) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:156) ~[?:?]
... 18 more
Caused by: java.io.IOException: request retries exceeded max retry timeout [30000]
at org.elasticsearch.client.RestClient$1.retryIfPossible(RestClient.java:574) ~[?:?]
... 14 more

Monitor cluster logs:
[2019-03-27T09:08:29,062][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-2] collector [node_stats] timed out when collecting data

Do you know anything more about the state of the monitoring cluster at the time this occurred? This looks like the monitoring cluster stopped responding but there isn't enough information here to really understand why that might have occurred.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.