Monitoring cluster bring down production

I have a separate monitoring cluster separate to the production one. However, when the monitoring cluster does not respond (e.g. maintenance, network problem, internal error) the production cluster goes RED and stop working. The kibana connected to the production get 500 errors.

Is there a way to get have a less disruptive behavior by the production cluster if monitoring is unreachable? Am I doing something wrong?

In each elasticsearch production node, I have this settings for xpack monitoring:

xpack.monitoring.enabled: true
xpack.monitoring.exporters:
  id1:
    type: http
    host: ["elasticsearch_domain"]
    auth.username: "monitoring"
    auth.password: "monitoring_password"

Hi Alessandro, which version of Kibana are you using? I believe this has been fixed in 6.2.

Thanks,
CJ

Hi CJ,

I run 6.2 for monitoring and 5.6.4 for production. However, I do not think is a kibana problem but an Elasticsearch problem, I did not set up any xpack.monitoring variable on kibana production, I set up xpack.monitoring only on the ES nodes.

Hi Alessandro, I'm afraid I don't understand what you mean by "6.2 for Monitoring and 5.6.4 for production." Do you mean you're using X-Pack 6.2 with Kibana 5.6.4?

Thanks,
CJ

Hi CJ,

I mean that the monitoring cluster run on a 6.2 stack (Logstash, Elasticsearch, Kibana, X-Pack) and the production cluster run on a 5.6.4 stack.

Hey Alessandro, thanks for clarifying for me. I spoke with an engineer who works on Monitoring and he thinks that you could try temporarily disabling monitoring collection in your production agent when the monitoring cluster needs some down time. You can fire a dynamic cluster setting to put the collection interval at -1 to do this:

PUT /_cluster/settings
{
    "transient" : {
        "xpack.monitoring.collection.interval" : "-1"
    }
}

The transient setting will be reset if your cluster restarts. If you need it persistent you can just change "transient" to "persistent".

He also was wondering if you could provide some logs from Elasticsearch regarding "production cluster goes RED and stops working"? That will help us figure out if this is a known issue or if it's been fixed.

Thanks,
CJ

Hi CJ,

Thanks for the tip. I do not have the logs to post but I remember that there was a lot of error from the xpack.monitoring module about connection refused, which make sense because the monitoring cluster was down.

The collection interval is a good workaround for schedule downtimes but for outages it will not work.

It seems odd to me that a production environment stops working only because the monitoring one is not reachable. Maybe there is a good reason for that.

Cheers,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.