Consistantly losing monitoring data for clusters

ryan.dyer · July 10, 2017, 5:28pm

We have multiple clusters which log their monitoring data to their own monitoring clusters. All clusters consistantly stop logging monitoring data on multiple if not all nodes in the cluster after an extended period of time (days to weeks). The error logs from the nodes that stop logging have the following errors. Restarting the nodes experiencing the issue will resolve the issue. I am able to perform the query to /?filter_path=version.number of the monitoring cluster from the nodes experiencing the problem and they return the version info as expected.

[es5-node] # curl -v http://monitoring.cluster:9200/?filter_path=version.number

*   Trying ...
* Connected to monitoring.cluster () port 9200 (#0)
> GET /?filter_path=version.number HTTP/1.1
> Host: monitoring.cluster:9200
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 10 Jul 2017 16:27:56 GMT
< Content-Type: application/json; charset=UTF-8
< Content-Length: 47
< Connection: keep-alive
<
{
  "version" : {
    "number" : "5.2.0"
  }
}

[2017-07-10T17:09:22,594][INFO ][o.e.x.m.e.Exporters      ] [master-ip] skipping exporter [es5-monitoring] as it is not ready yet
[2017-07-10T17:09:37,689][WARN ][o.e.x.m.e.h.NodeFailureListener] connection failed to node at [http://monitoring.cluster:9200]
[2017-07-10T17:09:37,689][ERROR][o.e.x.m.e.h.VersionHttpResource] failed to verify minimum version [5.0.0-beta1] on the [xpack.monitoring.exporters.es5-monitoring] monitoring cluster
java.net.NoRouteToHostException: No route to host
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
    at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:171) ~[?:?]
    at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:145) ~[?:?]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) ~[?:?]
    at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) ~[?:?]
    at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[?:?]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

pickypg · July 17, 2017, 8:03pm

Hi @ryan.dyer

Is it possible that the DNS hostname's underlying IP address is changing after Elasticsearch is starting? If so, the JVM isn't going to notice because of DNS caching. You can update this in the $JAVA_HOME/lib/security/java.security file for Java itself via the networkaddress.cache.ttl setting.

https://www.elastic.co/guide/en/cloud/current/_dns_caching.html

This documentation is for the Elastic Cloud, but it's true for any instance of Elasticsearch.

Hope that helps,
Chris

system · August 14, 2017, 8:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data cluster failing to connect to monitoring cluster Elasticsearch	2	3254	February 2, 2017
Monitoring Not works after some time Elasticsearch elastic-stack-monitoring	3	952	February 6, 2019
Cannot export monitoring data from a cluster to to monitoring cluster Elasticsearch elastic-stack-monitoring	2	1418	April 23, 2019
"failed shard on node... ...Data too large, data for [<transport_request>] would be" only for 3 most recent .monitoring-es indices Elasticsearch	9	4925	March 26, 2020
X-Pack Monitoring. Data Missing Elasticsearch	2	1348	September 22, 2017

Consistantly losing monitoring data for clusters

Related topics